On 22/06/14 15:36, Shriram Rajagopalan
wrote:
On Jun 19, 2014 4:16 PM, "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx>
wrote:
>
> On 19/06/14 11:23, Hongyang Yang wrote:
> > On 06/19/2014 05:36 PM, Andrew Cooper wrote:
> >> On 19/06/14 10:13, Hongyang Yang wrote:
> >>> Hi Andrew, Ian,
> >>>
> >>> On 06/18/2014 02:04 AM, Andrew Cooper wrote:
> >>>> On 17/06/14 17:40, Ian Campbell wrote:
> >>>>> On Wed, 2014-06-11 at 19:14 +0100,
Andrew Cooper wrote:
> >>>>>> +The following features are not
yet fully specified and will be
> >>>>>> +included in a future draft.
> >>>>>> +
> >>>>>> +* Remus
> >>>>> What is the plan for Remus here?
> >>>>>
> >>>>> It has pretty large implications for
the flow of a migration
> >>>>> stream and
> >>>>> therefore on the code in the final two
patches, I suspect it will
> >>>>> require high level changes to those
functions, so I'm reluctant to
> >>>>> spend
> >>>>> a lot of time on them as they are.
> >>>>
> >>>> I don't believe too much change will be
required to the final two
> >>>> patches, but it does depend on fixing the
current qemu record layer
> >>>> violations.
> >>>>
> >>>> It will be much easier to do after a
prototype to the libxl level
> >>>> fixes.
> >>>
> >>> I'm trying to porting Remus to migration v2...
> >>
> >> Ah fantastic! Here I was expecting to have
eventually brave that code
> >> myself.
> >>
> >> How is it going? ÂHow are you finding hacking on
v2 compared to the
> >> legacy code? (I think you are the first person who
isn't me trying to
> >> extend it) ÂIs there anything I can do while still
developing v2 to make
> >> things easier?
> >
> > It's just starting, but only on libxc side based on
your patch series.
> > v2 code is more cleaner than legacy code, easy to
understand, and yes,
> > make hacking easier. Maybe I will need your help when
the hacking goes
> > on...
> >
> >>
> >>
> >> I really need to get a prototype libxl framing
document sorted, but in
> >> principle my plan (given only a minimum
understanding of the algorithm)
> >> is this:
> >>
> >> ...
> >> * Write page data update
> >> * Write vcpu context etc
> >> * Write a REMUS_CHECKPOINT record (or appropriate
name)
> >> * Call the checkpoint callback, passing ownership
of the fd to libxl
> >> ** libxl writes a libxl qemu record into the
stream
> >> * checkpoint callback returns to libxl, returning
ownership of the fd
> >> * libxc chooses between sending an END record or
looping
> >> ...
> >>
> >> The fd ownership is expected to work exactly the
same on the receiving
> >> side, using the REMUS_CHECKPOINT record as an
indicator.
> >
> > It mostly looks plausible, but the save side and
restore side needs to
> > be synchronised, otherwise, the following problem may
exists:
> > Â sending side is in libxl and send qemu records,
receiving side still
> > Â in libxc, after it is switched to libxl, part of
record may lose.
> > maybe a handshake will solve the problem, weather it's
in libxl or libxc,
> > but current migration frame dose not support send msgs
from receiving
> > side
> > to sending side, so it need modifications. We should
support this
> > feature.
>
> Ah yes I see.
>
> How about this?
>
> Libxc REMUS_CHECKPOINT is defined as a 0-length record
(like the current
> END record).
> Libxl REMUS_CHECKPOINT is defined containing at least "last
checkpoint"
> bit in the header.
>
> Libxc writes a libxc REMUS_CHECKPOINT record into the
stream and always
> hands the fd to libxl.
> Libxl then writes a libxl REMUS_CHECKPOINT record,
including the last
> checkpoint bit if needed.
>
I am a bit lost on this part. A silly question: the
last I recall (a long time ago), the v2 format didn't allow for
the page compression to be done asynchronously. Has this
limitation changed?
The v2 format specifies records in a stream; nothing more. It has
no bearing on whether the page compression happens asynchronously
wrt unpausing the domain or not.
I presume you actually mean the current implementation...
IOW, in the current migration process, the dirty
page data is written out while the guest remains suspended. With
remus, the compressed page data is written out after resuming
the guest. This deferred write out logic needs to be
incorporated into v2 code.
... which is the way it is because the first implementation was done
with regular basic migration as a top priority. This can certainly
be reworked when remus support is reintroduced.
> This means that it is libxl on the receiving
side which determines
> whether the last checkpoint has been reached, and libxc
must always pass
> the fd up. ÂThis fixes the synchronisation issues, without
requiring a
> back channel, but still maintaining appropriate layering.
>
So there is a TODO item in the current libxl-remus
patches. We need an explicit acknowledgement from the reveiver
side that it has gotten the memory checkpoint. Whether it is
from libxc or libxl on the receiver side does not matter, as
long as the ack signifies reception of the memory checkpoint.
The need for an explicit memory ack is because the disk and
memory checkpoint channels are independent.
We need both acks before releasing the buffered network output
on the receiver side.
 The disk channel (blktap2 or DRBD ) has always sent an
explicit ack. But not the memory channel. Though its over TCP,
on a given iteration, memory checkpoint data may still reside on
the sender side socket buffer while the disk checkpoint has
reached the other end -- which isn't good.
Existing libxc code does a fdatasync or fsync on the
fd at the end of each iteration. I don't think it works as
intended on TCP sockets. Please correct me if I am wrong about
this.
That is a very sensible need for an explicit ack, although it would
seem to make more sense at the libxl level rather than the libxc
level.
~Andrew
|