[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] pvops: Does PVOPS guest os support online "suspend/resume"




> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx]
> Sent: Friday, August 09, 2013 3:17 AM
> To: Gonglei (Arei)
> Cc: xen-devel@xxxxxxxxxxxxx; Zhangbo (Oscar); Luonengjun; Hanweidong
> Subject: Re: [Xen-devel] pvops: Does PVOPS guest os support online
> "suspend/resume"
> 
> On Thu, Aug 08, 2013 at 02:23:06PM +0000, Gonglei (Arei) wrote:
> > Hi all,
> >
> > While suspend and resume a PVOPS guest os while it's running, we found that
> it would get its block/net io stucked. However, non-PVOPS guest os has no such
> problem.
> >
> 
> With what version of Linux is this? Have you tried with v3.10?

Thanks for responding. We've tried kernel "3.5.0-17 generic" (ubuntu 12.10), 
the problem still exists.
Although we are not sure about the result about kernel 3.10, but suspiciously 
it would also have the same problem.

Xen version:  4.3.0

Another method to reproduce:
1) xl create dom1.cfg
2) xl save -c dom1 /path/to/save/file
   (-c  Leave domain running after creating the snapshot.)

As I mentioned before, the problem occurs because PVOPS guest os RESUMEes 
blkfront when the guest resumes. 
The "blkfront_resume" method seems unnecessary here. 
non-PVOPS guest os doesn't RESUME blkfront, thus they works fine.

So, here comes the 2 questions, is the problem caused because: 
1) PVOPS kernel doesn't take this situation into accont, and has a bug here?
or
2) PVOPS has other ways to avoid such problem?

-Gonglei
> 
> Thanks.
> > How reproducible:
> > -------------------
> > 1/1
> >
> > Steps to reproduce:
> > ------------------
> >   1)suspend guest os
> >     Note: do not migrate/shutdown the guest os.
> >   2)resume guest os
> >
> > (Think about rolling-back(resume) during core-dumping(suspend) a guest,
> such problem would cause the guest os unoprationable.)
> >
> >
> ================================================================
> ====
> > we found warning messages in guest os:
> > --------------------------------------------------------------------
> > Aug  2 10:17:34 localhost kernel: [38592.985159] platform pcspkr: resume
> > Aug  2 10:17:34 localhost kernel: [38592.989890] platform vesafb.0: resume
> > Aug  2 10:17:34 localhost kernel: [38592.996075] input input0: type resume
> > Aug  2 10:17:34 localhost kernel: [38593.001330] input input1: type resume
> > Aug  2 10:17:34 localhost kernel: [38593.005496] vbd vbd-51712: legacy
> resume
> > Aug  2 10:17:34 localhost kernel: [38593.011506] WARNING: g.e. still in use!
> > Aug  2 10:17:34 localhost kernel: [38593.016909] WARNING: leaking g.e.
> and page still in use!
> > Aug  2 10:17:34 localhost kernel: [38593.026204] xen vbd-51760: legacy
> resume
> > Aug  2 10:17:34 localhost kernel: [38593.033070] vif vif-0: legacy resume
> > Aug  2 10:17:34 localhost kernel: [38593.039327] WARNING: g.e. still in use!
> > Aug  2 10:17:34 localhost kernel: [38593.045304] WARNING: leaking g.e.
> and page still in use!
> > Aug  2 10:17:34 localhost kernel: [38593.052101] WARNING: g.e. still in use!
> > Aug  2 10:17:34 localhost kernel: [38593.057965] WARNING: leaking g.e.
> and page still in use!
> > Aug  2 10:17:34 localhost kernel: [38593.066795] serial8250 serial8250:
> resume
> > Aug  2 10:17:34 localhost kernel: [38593.073556] input input2: type resume
> > Aug  2 10:17:34 localhost kernel: [38593.079385] platform Fixed MDIO bus.0:
> resume
> > Aug  2 10:17:34 localhost kernel: [38593.086285] usb usb1: type resume
> > ------------------------------------------------------
> >
> > which means that we refers to a grant-table while it's in use.
> >
> > The reason results in that:
> > suspend/resume codes:
> > --------------------------------------------------------
> > //drivers/xen/manage.c
> > static void do_suspend(void)
> > {
> >     int err;
> >     struct suspend_info si;
> >
> >     shutting_down = SHUTDOWN_SUSPEND;
> >
> > ââââââ
> >     err = dpm_suspend_start(PMSG_FREEZE);
> > ââââââ
> >     dpm_resume_start(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
> >
> >     if (err) {
> >             pr_err("failed to start xen_suspend: %d\n", err);
> >             si.cancelled = 1;
> >     }
> > //NOTE: si.cancelled = 1
> >
> > out_resume:
> >     if (!si.cancelled) {
> >             xen_arch_resume();
> >             xs_resume();
> >     } else
> >             xs_suspend_cancel();
> >
> >     dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
> //blkfront device got resumed here.
> >
> > out_thaw:
> > #ifdef CONFIG_PREEMPT
> >     thaw_processes();
> > out:
> > #endif
> >     shutting_down = SHUTDOWN_INVALID;
> > }
> > ------------------------------------
> >
> > Func "dpm_suspend_start" suspends devices, and "dpm_resume_end"
> resumes devices.
> > However, we found that the device "blkfront" has no SUSPEND method but
> RESUME method.
> >
> > -------------------------------------
> > //drivers/block/xen-blkfront.c
> > static DEFINE_XENBUS_DRIVER(blkfront, ,
> >     .probe = blkfront_probe,
> >     .remove = blkfront_remove,
> >     .resume = blkfront_resume,  // only RESUME method found here.
> >     .otherend_changed = blkback_changed,
> >     .is_ready = blkfront_is_ready,
> > );
> > --------------------------------------
> >
> > It resumes blkfront device when it didn't get suspended, which caused the
> prolem above.
> >
> >
> > =========================================
> > In order to check whether it's the problem of PVOPS or hypervisor(xen)/dom0,
> we suspend/resume other non-PVOPS guest oses, no such problem occured.
> >
> > Other non-PVOPS are using their own xen drivers, as shown in
> https://github.com/jpaton/xen-4.1-LJX1/blob/master/unmodified_drivers/linux-
> 2.6/platform-pci/machine_reboot.c :
> >
> > int __xen_suspend(int fast_suspend, void (*resume_notifier)(int))
> > {
> >     int err, suspend_cancelled, nr_cpus;
> >     struct ap_suspend_info info;
> >
> >     xenbus_suspend();
> >
> > ââââââââ
> >     preempt_enable();
> >
> >     if (!suspend_cancelled)
> >         xenbus_resume();     //when the guest os get resumed,
> suspend_cancelled == 1, thus it wouldn't enter xenbus_resume_uvp here.
> >     else
> >         xenbus_suspend_cancel();  //It gets here. so the blkfront
> wouldn't resume.
> >
> >     return 0;
> > }
> >
> >
> > In non-PVOPS guest os, although they don't have blkfront SUSPEND method
> either, their xen-driver doesn't resume blkfront device, thus, they would't 
> have
> any problem after suspend/resume.
> >
> >
> > I'm wondering why the 2 types of driver(PVOPS and non-PVOPS) are different
> here.
> > Is that because:
> > 1) PVOPS kernel doesn't take this situation into accont, and has a bug here?
> > or
> > 2) PVOPS has other ways to avoid such problem?
> >
> > thank you in advance.
> >
> > -Gonglei
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxx
> > http://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.