[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Patch v2] tools/migrate: Fix regression when migrating from older version of Xen



On 22/07/13 18:59, Ian Campbell wrote:
> On Mon, 2013-07-22 at 18:52 +0100, Andrew Cooper wrote:
>> On 18/07/13 19:47, Shriram Rajagopalan wrote:
>>
>>> On Thu, Jul 18, 2013 at 12:27 PM, Andrew Cooper
>>> <andrew.cooper3@xxxxxxxxxx> wrote:
>>>         On 18/07/13 17:20, Shriram Rajagopalan wrote:
>>>         
>>>         > On Tue, Jul 16, 2013 at 6:52 AM, Andrew Cooper
>>>         > <andrew.cooper3@xxxxxxxxxx> wrote: 
>>>         > and you setting ctx->last_checkpoint = 0 basically means
>>>         > that you are
>>>         > banking on the far end (with older version of tools) to
>>>         > close the socket, causing
>>>         > "get pagebuf" to fail and subsequently land on finish.
>>>         > IIRC, this was the behavior before
>>>         > XC_SAVE_ID_LAST_CHECKPOINT was introduced
>>>         > by Ian Campbell, to get rid of this benign error message.
>>>         > 
>>>         
>>>         
>>>         That might have been 'fine' for PV guests, but it cause
>>>         active breakage for HVM domains where the qemu save record
>>>         immediately follows in the migration stream.
>>>         
>>>         
>>>
>>>
>>> Just to clarify.. the code flows like this, iirc.
>>>
>>>
>>> loadpages:
>>>  while (1)
>>>      if !completed
>>>         get pagebufs
>>>         if 0 pages sent, break
>>>      endif
>>>      apply batch (pagebufs)
>>>  endwhile
>>>
>>>
>>>  if !completed
>>>    get tailbuf [[this is where the QEMU record would be obtained]]
>>>    completed = 1
>>>  endif
>>>
>>>
>>>  if last_checkpoint
>>>    goto finish
>>>  endif
>>>
>>>
>>>  get pagebuf or goto finish on error ---> this is where old code
>>> used to exit
>>>  get tailbuf
>>> goto loadpages
>>> finish:
>>>    apply tailbuf [tailbuf obtained inside the 'if !completed' block]
>>>    do the rest of the restore
>> (Logically joining the two divergent threads, as this is the answer to
>> both)
>>
>> This has nothing to do with the buffering mode, and that is not where
>> the Qemu record would be obtained from.
>>
>> As the code currently stands, if a XC_SAVE_ID_LAST_CHECKPOINT chunk is
>> not seen in the stream, we complete the loadpages: section, including
>> the magic pages (TSS, console info etc).
>>
>> We then read the buffer_tail() on line 171, set ctx->completed on line
>> 1725, but fail the ctx->last_checkpoint check on line 1758.
>>
>> What we should do is pass the last_checkpoint test, and goto finish
>> which then calls dump_qemu().  What actually happens is a call to
>> pagebuf_get() on line 1766 which raises an error because of finding a
>> Qemu save record rather than more pages.
>>
>> So this is very much a bug directly introduced by 00a4b65f85, and can
>> only be fixed with knowledge from the higher levels of the toolstack.
>>
>> As for the wording of the parameter, I still prefer the original
>> "last_checkpoint_unaware" over "checkpointed_stream" as it is more
>> accurate.
>>
>> Any migration stream started from a version of the tools after c/s
>> 00a4b65f85 will work, whether it is checkpointed or not (as the
>> XC_SAVE_ID_LAST_CHECKPOINT chunk will be sent in the correct place).
>> Any migration started from a version of the tools before c/s
>> 00a4b65f85 will fail because it has no idea that the receiving end is
>> expecting it to insert XC_SAVE_ID_LAST_CHECKPOINT chunk.  The fault
>> here is with the receiving end expecting to find a
>> XC_SAVE_ID_LAST_CHECKPOINT chunk.
>>
>> The only fix is for newer toolstack to be aware that the migration
>> stream is from an older toolstack, and to set
>> last_checkpoint_unaware=1, which will set ctx->last_checkpoint to 1,
>> allowing the receiving side of the migration to correctly read the
>> migration stream.
> The toolstack cannot in general know that (i.e. how does xl know?)
>
> It does know if it is doing checkpointing or not though.
>
> There are four cases:
>
> Regular migrate from pre-00a4b65f85 to post-00a4b65f85. Receiver sets
> last_checkpoint = 1 at start of day. Doesn't care that sender never
> sends XC_SAVE_ID_LAST_CHECKPOINT.
>
> Regular migrate from post-00a4b65f85 to post-00a4b65f85. Receiver sets
> last_checkpoint = 1 at start of day. Doesn't care that it sees
> XC_SAVE_ID_LAST_CHECKPOINT again, just ends up (pointlessly) setting
> last_checkpoint = 1.
>
> Checkpointed migrate from post-00a4b65f85 to post-00a4b65f85. Receiver
> doesn't set last_checkpoint = 1 at start of day.
> XC_SAVE_ID_LAST_CHECKPOINT is sent at the right time and both sides
> agree etc. Things work.
>
> Checkpointed migrate from pre-00a4b65f85 to post-00a4b65f85. We don't
> support this, because checkpoints should only be supported between like
> versions of Xen (N->N+1 case doesn't apply).
>
> Ian.
>
>

Ok - that seems to make it easier.  I will see about plumbing the
correct information through from xend/remus.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.