[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] pvops: Does PVOPS guest os support online "suspend/resume"
Hi, > -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx] > Sent: Monday, August 12, 2013 8:50 PM > To: Gonglei (Arei) > Cc: xen-devel@xxxxxxxxxxxxx; Zhangbo (Oscar); Luonengjun; > ian.campbell@xxxxxxxxxx; stefano.stabellini@xxxxxxxxxxxxx; rjw@xxxxxxx; > rshriram@xxxxxxxxx; Yanqiangjun; Jinjian (Ken) > Subject: Re: [Xen-devel] pvops: Does PVOPS guest os support online > "suspend/resume" > > On Sat, Aug 10, 2013 at 08:29:43AM +0000, Gonglei (Arei) wrote: > > > > > > > -----Original Message----- > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx] > > > Sent: Friday, August 09, 2013 3:17 AM > > > To: Gonglei (Arei) > > > Cc: xen-devel@xxxxxxxxxxxxx; Zhangbo (Oscar); Luonengjun; Hanweidong > > > Subject: Re: [Xen-devel] pvops: Does PVOPS guest os support online > > > "suspend/resume" > > > > > > On Thu, Aug 08, 2013 at 02:23:06PM +0000, Gonglei (Arei) wrote: > > > > Hi all, > > > > > > > > While suspend and resume a PVOPS guest os while it's running, we found > that > > > it would get its block/net io stucked. However, non-PVOPS guest os has no > such > > > problem. > > > > > > > > > > With what version of Linux is this? Have you tried with v3.10? > > > > Thanks for responding. We've tried kernel "3.5.0-17 generic" (ubuntu 12.10), > the problem still exists. > > So you have not tried v3.10. v3.5 is ancient from the upstream perspective. > thank you, I didn't notice that, I would try 3.10 later. > > Although we are not sure about the result about kernel 3.10, but > > suspiciously > it would also have the same problem. > > Potentially. There were fixes added in 3.5: > > commit 569ca5b3f94cd0b3295ec5943aa457cf4a4f6a3a > Author: Jan Beulich <JBeulich@xxxxxxxx> > Date: Thu Apr 5 16:10:07 2012 +0100 > > xen/gnttab: add deferred freeing logic > > Rather than just leaking pages that can't be freed at the point where > access permission for the backend domain gets revoked, put them on a > list and run a timer to (infrequently) retry freeing them. (This can > particularly happen when unloading a frontend driver when devices are > still present, and the backend still has them in non-closed state or > hasn't finished closing them yet.) > > and that seems to be triggered. I've tryed to apply this patch, but it didn't fix this problem: it retries endlessly to free the leaking pages, however, there seems to be no end. messages keep coming out per seconds "WARNING: leaking g.e. and page still in use!" > > > > Xen version: 4.3.0 > > > > Another method to reproduce: > > 1) xl create dom1.cfg > > 2) xl save -c dom1 /path/to/save/file > > (-c Leave domain running after creating the snapshot.) > > > > As I mentioned before, the problem occurs because PVOPS guest os > RESUMEes blkfront when the guest resumes. > > The "blkfront_resume" method seems unnecessary here. > > It has to do that otherwise it can't replay the I/Os that might not have > hit the platter when it migrated from the original host. > > But you are exercising the case where it does a checkpoint, > not a full save/restore cycle. > > In which case you might be indeed hitting a bug. If we add a suspend method for the blkfront, to make the front/end blk device turn their states from {XenbusStateConnected, XenbusStateConnected} into{XenbusStateInitialising, XenbusStateInitWait}, when we suspend the guest os,would that cause any problem? We found that windows xen-pv driver did such things. We're hoping that such attempt would solve this problem > > > non-PVOPS guest os doesn't RESUME blkfront, thus they works fine. > > Potentially. The non-PVOPS guests are based on an ancient kernels and > the upstream logic in the generic suspend/resume machinery has also > changed. > > > > > So, here comes the 2 questions, is the problem caused because: > > 1) PVOPS kernel doesn't take this situation into accont, and has a bug here? > > or > > 2) PVOPS has other ways to avoid such problem? > > Just to make sure I am not confused here. The problem does not > appear if you do NOT use -c, correct? yes, the purpose of using "-c" here is to do a "ONLINE" suspend/resume. such problem just occurs with ONLINE suspend/resume, rather than OFFLINE suspend/resume. To be precisely, 2 examples are listed here below: <1> 1) xl create dom1.cfg 2) xl save -c dom1 /opt/dom1.save after this, the dom1 guest os has its io stucked. which means ONLINE suspend/resume has something wrong. 3) xl destroy dom1 4) xl restore /opt/dom1.save the restored dom1 works fine, which means OFFLINE suspend/resume is OK. <2> 1) xl create dom1.cfg 2) xl save dom1 /opt/dom1.save no "-c" here, it would destroy the guest dom1 automatically. 3) xl restore /opt/dom1.save the restored dom1 works fine, which means OFFLINE suspend/resume is OK. -Gonglei _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |