[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH 2/3] remus: implement remus checkpoint in v2 save



On 16/07/14 17:02, Shriram Rajagopalan wrote:


 
 
+            rc = send_domain_memory_live(ctx);
+        }
+        else
+        {
+            DPRINTF("Starting nonlive save");
+            rc = send_domain_memory_nonlive(ctx);
+        }

-    if ( rc )
-        goto err;
+        if ( rc )
+            goto err;

-    /* Refresh domain information now it has paused. */
-    if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
-         (ctx->dominfo.domid != ctx->domid) )
-    {
-        PERROR("Unable to refresh domain information");
-        rc = -1;
-        goto err;
-    }
-    else if ( (!ctx->dominfo.shutdown ||
-               ctx->dominfo.shutdown_reason != SHUTDOWN_suspend ) &&
-              !ctx->dominfo.paused )
-    {
-        ERROR("Domain has not been suspended");
-        rc = -1;
-        goto err;
-    }
+        /* Refresh domain information now it has paused. */
+        if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) != 1) ||
+             (ctx->dominfo.domid != ctx->domid) )
+        {
+            PERROR("Unable to refresh domain information");
+            rc = -1;
+            goto err;
+        }
+        else if ( (!ctx->dominfo.shutdown ||
+                  ctx->dominfo.shutdown_reason != SHUTDOWN_suspend ) &&
+                  !ctx->dominfo.paused )
+        {
+            ERROR("Domain has not been suspended");
+            rc = -1;
+            goto err;
+        }


A silly question: Shouldn't this check for 'suspended status' be before the call to 
send_domain_memory_live() [under remus]. I am assuming that the
send_domain_memory_live() function is the one that sends the dirty page data
out on the wire.

Even for non-live migrates, we have to ensure that the VM has entered the suspend state.  A PV guest cannot possibly recover from resume if it didn't voluntarily suspend as it will loose its reference to the start info page (whos mfn lives in vcpu0.edx for the duration of the migrate and must be updated on the receiving side).  Any VM which doesn't sufficiently quiesce its fronends risks ring and memory corruption on resume.
In this case, the send_domain_memory_XXX functions do take care of pausing the domain at the appropriate time, but this here is a sanity check.


True. I am assuming that the send_domain_memory_XXX functions do some
heavy lifting (copying out dirty pages onto the fd, etc).  So, shouldn't the sanity 
check occur right after issuing a suspend op, i.e., in the send_domain_memory_XXX functions,
before attempting to send the domain memory on the wire?

Assuming that the check is already there in those functions, then this block 
above would be redundant.



You are correct.  The code is in the way it is because of the order in which I implemented things when building live migration from first principles.

suspend_domain() confirming that the domain has indeed suspended is a recent addition, and does indeed negate the refresh of the domain information in save().  I will fix that in the upcoming series.

However, I think keeping the sanity check that the domain has actually been paused is still a good idea.  It is a stated precondition of the following functions and I know for certain that debugging the tail of migration with an unpaused VM causes some very subtle bugs on resume.

~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.