Xen project Mailing List

Re: [Xen-devel] pvops: Does PVOPS guest os support online "suspend/resume"

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: "Gonglei (Arei)" <arei.gonglei@xxxxxxxxxx>

Date: Wed, 14 Aug 2013 10:52:16 +0000

Accept-language: zh-CN, en-US

Cc: "ian.campbell@xxxxxxxxxx" <ian.campbell@xxxxxxxxxx>, "stefano.stabellini@xxxxxxxxxxxxx" <stefano.stabellini@xxxxxxxxxxxxx>, "Zhangbo \(Oscar\)" <oscar.zhangbo@xxxxxxxxxx>, Yanqiangjun <yanqiangjun@xxxxxxxxxx>, Luonengjun <luonengjun@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, "rjw@xxxxxxx" <rjw@xxxxxxx>, "rshriram@xxxxxxxxx" <rshriram@xxxxxxxxx>, "Jinjian \(Ken\)" <jinjian@xxxxxxxxxx>

Delivery-date: Wed, 14 Aug 2013 10:52:55 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: Ac6UQsmophhm9G7yR2q0shBcQAssaf//y/gA//0N1wCACM9LAP//Ye/ggAD2L4D//iJfwABq2IIA//5H2MA=

Thread-topic: [Xen-devel] pvops: Does PVOPS guest os support online "suspend/resume"

> -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx] > Sent: Wednesday, August 14, 2013 12:35 AM > To: Gonglei (Arei) > Cc: rshriram@xxxxxxxxx; xen-devel@xxxxxxxxxxxxx; Zhangbo (Oscar); > Luonengjun; ian.campbell@xxxxxxxxxx; stefano.stabellini@xxxxxxxxxxxxx; > rjw@xxxxxxx; Yanqiangjun; Jinjian (Ken) > Subject: Re: [Xen-devel] pvops: Does PVOPS guest os support online > "suspend/resume" > > On Tue, Aug 13, 2013 at 02:38:18PM +0000, Gonglei (Arei) wrote: > > Hi, > > I rechecked the different kernels today, and found that I made a mistake > before. sorry for misleading you all:) > > > > All in all, the problems should be concluded in the 2 items below: > > 1 the kernel 2.6.32 PVOPS guest os(I tested RHEL6.1 and RHEL6.3), does have > bugs in ONLINE suspend/resume (checkpoint), which was, > > as Shriram mentioned, fixed in: > > > http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/driver > s/xen/manage.c?id=b3e96c0c756211e805c6941d4a6e5f6e1995cb6b > > 2 the kernel above 3.0(I tested Ubuntu12.10 with kernel 3.5 and Ubuntu13.04 > with kernel 3.8), they seem to have another "bug": > > 1) if we set MULTI VCPUS for the guest os, it would have problems in > resuming(to be correctly, it's thaw). > > In details: > > <1>set the guest os with 4 vcpus > > in dom1.cfg: vcpus=4 > > <2>xl create dom1.cfg > > excute command "top -d 1" in guest dom1's vnc window > > <3>xl save -c dom1 /opt/dom1.save > > <4>after step <3>, we check the guest dom1's vnc window, and > found that: > > kernel thread migration/1, migration/2, migration/3 got > their cpu usage up to 100% > > the guest os couldn't respond to any request such as > mouse movement or keyboard input. > > no "thaw" things printed in dom1's serial output. > > > > 2) if we set only 1 vcpu for the guest os, it would thaw back and works > > fine. > > 3) anyother odd thing is that: if we use the saved file generated in 2-1) > > to > restore the guest, and then do online suspend/resume (xl save -c, checkpoint), > > it would be fine, no problems occurred. > > > > Such problem occurs on guest os with kernel 3.5/3.8(maybe other kernels as > well, not tested). I hope that the steps I did was correct. > > Please do check with the upstream kernel. There were some CPU hotplug > issues in older kernels > and just to make sure that this is not one of them it would be good to > eliminate > this. > > Please do test with v3.11-rc5. > > > Have you ever entercounter such "suspend/resume checkpoint on multi-vcpu > guest os" problem? > > > > ------- > > PS: BTW, I'm wondering why using freeze/thaw instead of suspend/resume > would solve the problem with kernels below 3.0? > > It seems that blkfront_resume is still called if we use thaw method here, > because blkfront has no available pm_op. > > > > static int device_resume(struct device *dev, pm_message_t state, bool > async) > > { > > ââââ > > if (dev->bus) { > > if (dev->bus->pm) { > > info = "bus "; > > callback = pm_op(dev->bus->pm, state); > > } else if (dev->bus->resume) { > > info = "legacy bus "; > > callback = dev->bus->resume; > //blkfront_resume is called here. here? > > goto End; > > One easy way to figure this out is to stick printks in here to see if that > blkfront > code > is indeed called. You can also use 'dump_stack()' to get a nice stack-trace. Hi, 1 I tried kernel 3.11-rc6, it has the same problem: after doing the checkpoint, multi-vcpu guest os can't respond to anything, because its kernel threads migration/1, migration/2, etc, got their cpu usage up to 100% 2 kernel 3.0 doesn't have this problem. So, It seems that some bugs came out between v3.0 and v3.5, something concerning vcpu freeze/thaw ? Thanks! -Gonglei _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.