[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: [PATCH] libxl: do slow resume after failed migration attempt



On Wed, 2011-02-16 at 11:49 +0000, Ian Campbell wrote:
> On Wed, 2011-02-16 at 11:47 +0000, Ian Campbell wrote:
> > # HG changeset patch
> > # User Ian Campbell <ian.campbell@xxxxxxxxxx>
> > # Date 1297856874 0
> > # Node ID 1728ed4bbec9e82ca13c2639c8e4ef8b4dc231b6
> > # Parent  aa466613328f5de78fdfc968473cb06e948c1f5d
> > libxl: do slow resume after failed migration attempt
> > 
> > both of the current callers for libxl_domain_resume are calling after
> > a migration has failed, one is failure to suspend on the sender and
> > the other is failure to start on the destination, both leading to a
> > resume attempt on the sender.
> > 
> > However in the first case, failure to suspend, there is no guarantee
> > that the guest has made it as far as the suspend hypercall and
> > therefore the fast resume method, which frobs the hypercall return to
> > indicate a cancelled suspend, cannot safely be used since it will
> > corrupt %eax/%rax.
> > 
> > For the second case, failure to start on destination, I don't think it
> > really matters if the resume is fast or slow.
> > 
> > Therefore always use the slow/uncooperative version of xc_domain_resume from
> > libxl_domain_resume.
> > 
> > This makes a PV domain which failed to suspend (e.g. because the core
> > Linux PM infrastructure within the guest didn't allow it) recover
> > gracefully.
> 
> a PVHVM domain never suffered from this because libxl_domain_resume
> bails due to a libxl__domain_is_hvm check. I'm not 100% clear whether
> this is correct but I didn't change it. My test with a PVHVM guest which
> acknowledges the suspend but doesn't go on to do anything seems to work.

Looking closer, even a PV guest which is hacked to not actually try to
suspend fails this new xc_domain_resume call and it's actually the
original domain which continues.

I'm inclined to suggest that this is OK and that trying to do a slow
xc_domain_resume will save guests which have suffered certain types of
failure and be harmless for other types of failures, but I wouldn't
argue strongly against a suggestion that the right thing to do in the
"failed to suspend" case is to simply unpause the original domain and
let it try and continue...

Ian.

> 
> Ian.
> 
> > 
> > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
> > 
> > diff -r aa466613328f -r 1728ed4bbec9 tools/libxl/libxl.c
> > --- a/tools/libxl/libxl.c   Tue Feb 15 13:40:50 2011 +0000
> > +++ b/tools/libxl/libxl.c   Wed Feb 16 11:47:54 2011 +0000
> > @@ -226,7 +226,7 @@ int libxl_domain_resume(libxl_ctx *ctx, 
> >          rc = ERROR_NI;
> >          goto out;
> >      }
> > -    if (xc_domain_resume(ctx->xch, domid, 1)) {
> > +    if (xc_domain_resume(ctx->xch, domid, 0)) {
> >          LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, 
> >                          "xc_domain_resume failed for domain %u", 
> >                          domid);
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.