[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] libxl/xl: improve behaviour when guest fails to suspend itself
On Tue, 8 Feb 2011, Ian Campbell wrote: > # HG changeset patch > # User Ian Campbell <ian.campbell@xxxxxxxxxx> > # Date 1297171349 0 > # Node ID ad6f318aac5dcdc5a7be03a4a80f27ed80729638 > # Parent deaa7bc1a7ff6cfad7865394cb8b7205741004c9 > libxl/xl: improve behaviour when guest fails to suspend itself. > > The PV suspend protocol requires guest co-operating whereb the guest > must respond to a suspend request written to the xenstore control node > by clearing the node and then making a suspend hypercall. > > Currently when a guest fails to do this libxl times out and returns > a generic failure code to the caller. > > In response to this failure xl attempts to resume the guest. However > if the guest has not responded to the suspend request then the is no > guarantee that the guest has made the suspend hypercall (in fact it is > quite unlikely). Since the resume process attempts to modify the > return value of the hypercall (to indicate a cancelled suspend) this > results in the guest eax/rax register being corrupted! > > Firstly ensure that libxl cancels the suspend request in a synchronous > manner, to avoid racing with a slow guest. Both pvops and classic-Xen > guests read and clear the control node in a single transaction so it > sufficient to do the same on the toolstack side. > > Secondly add a mechanism to propagate guest timeouts from the > xc_domain_suspend ->suspend callback > (libxl__domain_suspend_common_callback in the case of libxl) to the > caller (libxl__domain_suspend_common). > > Thirdly add a new libxl error code to indicate that the guest did not > respond in time. > > Lastly in xl do not attempt to resume a guest if it has not responded > to the suspend request. > > Tested by live migration of PVops kernels which either ignore the > suspend request, have already crashed and those which suspend/resume > correctly. In the first two cases the source domain is left alone (and > continues to function in the first case) and in the third the > migration is successful. > > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> > > diff -r deaa7bc1a7ff -r ad6f318aac5d tools/libxl/libxl.h > --- a/tools/libxl/libxl.h Tue Feb 08 09:13:38 2011 +0000 > +++ b/tools/libxl/libxl.h Tue Feb 08 13:22:29 2011 +0000 > @@ -239,6 +239,7 @@ enum { > ERROR_NOMEM = -5, > ERROR_INVAL = -6, > ERROR_BADFAIL = -7, > + ERROR_GUEST_TIMEDOUT = -8, > }; > > #define LIBXL_VERSION 0 > diff -r deaa7bc1a7ff -r ad6f318aac5d tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c Tue Feb 08 09:13:38 2011 +0000 > +++ b/tools/libxl/libxl_dom.c Tue Feb 08 13:22:29 2011 +0000 > @@ -319,7 +319,7 @@ struct suspendinfo { > int suspend_eventchn; > int domid; > int hvm; > - unsigned int flags; > + int guest_responded; > }; > Even though they are not currently used, I would keep the flags field. > static int libxl__domain_suspend_common_switch_qemu_logdirty(int domid, > unsigned int enable, void *data) > @@ -349,6 +349,7 @@ static int libxl__domain_suspend_common_ > char *path, *state = "suspend"; > int watchdog = 60; > libxl_ctx *ctx = libxl__gc_owner(si->gc); > + xs_transaction_t t; > > if (si->hvm) > xc_get_hvm_param(ctx->xch, si->domid, HVM_PARAM_ACPI_S_STATE, > &s_state); > @@ -363,6 +364,7 @@ static int libxl__domain_suspend_common_ > LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "xc_await_suspend failed > ret=%d", ret); > return 0; > } > + si->guest_responded = 1; > return 1; > } > path = libxl__sprintf(si->gc, "%s/control/shutdown", > libxl__xs_get_dompath(si->gc, si->domid)); > @@ -386,16 +388,40 @@ static int libxl__domain_suspend_common_ > int shutdown_reason; > > shutdown_reason = (info.flags >> XEN_DOMINF_shutdownshift) & > XEN_DOMINF_shutdownmask; > - if (shutdown_reason == SHUTDOWN_suspend) > + if (shutdown_reason == SHUTDOWN_suspend) { > + si->guest_responded = 1; > return 1; > + } > } > state = libxl__xs_read(si->gc, XBT_NULL, path); > watchdog--; > } > + > + /* > + * Guest appear to not be responding, clear the suspend request > + * synchronously using a transaction. > + */ > if (!strcmp(state, "suspend")) { > - LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "guest didn't suspend in time"); > - libxl__xs_write(si->gc, XBT_NULL, path, ""); > + LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "guest didn't suspend in time, > clearing suspend request"); > + retry_transaction: > + t = xs_transaction_start(ctx->xsh); > + > + state = libxl__xs_read(si->gc, t, path); > + > + libxl__xs_write(si->gc, t, path, ""); > + Doesn't it make sense to clear path only if state is still "suspend"? Clearing the path again even if the guest already cleared it may call a spurious xenstore watch event in the guest, right? > + > + /* Final check, guest may have suspended while we were clearing the > request. */ > + if (!strcmp(state, "suspend")) { > + LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "final check, guest didn't > suspend"); > + return 0; > + } > + > + si->guest_responded = 1; > return 1; > } If "suspend" was gone before we cleared it, shouldn't we check xc_domain_getinfolist again before asserting that the domain suspended correcty? Or maybe is better just to return 0 without the final check, assuming that the guest failed in any case. > @@ -414,10 +440,10 @@ int libxl__domain_suspend_common(libxl_c > | (hvm) ? XCFLAGS_HVM : 0; > > si.domid = domid; > - si.flags = flags; > si.hvm = hvm; > si.gc = &gc; > si.suspend_eventchn = -1; > + si.guest_responded = 0; > > si.xce = xc_evtchn_open(NULL, 0); > if (si.xce == NULL) > @@ -441,8 +467,14 @@ int libxl__domain_suspend_common(libxl_c > > rc = xc_domain_save(ctx->xch, fd, domid, 0, 0, flags, &callbacks, hvm); > if ( rc ) { > - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "saving domain"); > - rc = ERROR_FAIL; > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "saving domain: %s", > + si.guest_responded ? > + "domain responded to suspend request" : > + "domain did not respond to suspend request"); > + if ( !si.guest_responded ) > + rc = ERROR_GUEST_TIMEDOUT; > + else > + rc = ERROR_FAIL; > } > > if (si.suspend_eventchn > 0) > diff -r deaa7bc1a7ff -r ad6f318aac5d tools/libxl/xl_cmdimpl.c > --- a/tools/libxl/xl_cmdimpl.c Tue Feb 08 09:13:38 2011 +0000 > +++ b/tools/libxl/xl_cmdimpl.c Tue Feb 08 13:22:29 2011 +0000 > @@ -2583,7 +2583,10 @@ static void migrate_domain(const char *d > if (rc) { > fprintf(stderr, "migration sender: libxl_domain_suspend failed" > " (rc=%d)\n", rc); > - goto failed_resume; > + if (rc == ERROR_GUEST_TIMEDOUT) > + goto failed_suspend; > + else > + goto failed_resume; > } > > //fprintf(stderr, "migration sender: Transfer complete.\n"); > @@ -2661,6 +2664,12 @@ static void migrate_domain(const char *d > libxl_domain_destroy(&ctx, domid, 1); /* bang! */ > fprintf(stderr, "Migration successful.\n"); > exit(0); > + > + failed_suspend: > + close(send_fd); > + migration_child_report(child, recv_fd); > + fprintf(stderr, "Migration failed, failed to suspend at sender.\n"); > + exit(-ERROR_FAIL); > > failed_resume: > close(send_fd); > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |