[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero ioreq page only one time



On 06/11/2015 06:20 PM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Wen Congyang [mailto:wency@xxxxxxxxxxxxxx]
>> Sent: 11 June 2015 09:48
>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen-devel@xxxxxxxxxxxxx
>> Cc: Wei Liu; Ian Campbell; guijianfeng@xxxxxxxxxxxxxx;
>> yunhong.jiang@xxxxxxxxx; Eddie Dong; rshriram@xxxxxxxxx; Ian Jackson
>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero ioreq
>> page only one time
>>
>> On 06/11/2015 04:32 PM, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Wen Congyang [mailto:wency@xxxxxxxxxxxxxx]
>>>> Sent: 11 June 2015 02:14
>>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen-
>> devel@xxxxxxxxxxxxx
>>>> Cc: Wei Liu; Ian Campbell; guijianfeng@xxxxxxxxxxxxxx;
>>>> yunhong.jiang@xxxxxxxxx; Eddie Dong; rshriram@xxxxxxxxx; Ian Jackson
>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero
>> ioreq
>>>> page only one time
>>>>
>>>> On 06/10/2015 07:47 PM, Paul Durrant wrote:
>>>>>> -----Original Message-----
>>>>>> From: xen-devel-bounces@xxxxxxxxxxxxx [mailto:xen-devel-
>>>>>> bounces@xxxxxxxxxxxxx] On Behalf Of Wen Congyang
>>>>>> Sent: 10 June 2015 12:38
>>>>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen-
>>>> devel@xxxxxxxxxxxxx
>>>>>> Cc: Wei Liu; Ian Campbell; guijianfeng@xxxxxxxxxxxxxx;
>>>>>> yunhong.jiang@xxxxxxxxx; Eddie Dong; rshriram@xxxxxxxxx; Ian Jackson
>>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero
>>>> ioreq
>>>>>> page only one time
>>>>>>
>>>>>> On 06/10/2015 06:58 PM, Paul Durrant wrote:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Wen Congyang [mailto:wency@xxxxxxxxxxxxxx]
>>>>>>>> Sent: 10 June 2015 11:55
>>>>>>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen-
>>>>>> devel@xxxxxxxxxxxxx
>>>>>>>> Cc: Wei Liu; Ian Campbell; yunhong.jiang@xxxxxxxxx; Eddie Dong;
>>>>>>>> guijianfeng@xxxxxxxxxxxxxx; rshriram@xxxxxxxxx; Ian Jackson
>>>>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore:
>> zero
>>>>>> ioreq
>>>>>>>> page only one time
>>>>>>>>
>>>>>>>> On 06/10/2015 06:40 PM, Paul Durrant wrote:
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Wen Congyang [mailto:wency@xxxxxxxxxxxxxx]
>>>>>>>>>> Sent: 10 June 2015 10:06
>>>>>>>>>> To: Andrew Cooper; Yang Hongyang; xen-devel@xxxxxxxxxxxxx;
>> Paul
>>>>>>>> Durrant
>>>>>>>>>> Cc: Wei Liu; Ian Campbell; yunhong.jiang@xxxxxxxxx; Eddie Dong;
>>>>>>>>>> guijianfeng@xxxxxxxxxxxxxx; rshriram@xxxxxxxxx; Ian Jackson
>>>>>>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore:
>>>> zero
>>>>>>>> ioreq
>>>>>>>>>> page only one time
>>>>>>>>>>
>>>>>>>>>> Cc: Paul Durrant
>>>>>>>>>>
>>>>>>>>>> On 06/10/2015 03:44 PM, Andrew Cooper wrote:
>>>>>>>>>>> On 10/06/2015 06:26, Yang Hongyang wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 06/09/2015 03:30 PM, Andrew Cooper wrote:
>>>>>>>>>>>>> On 09/06/2015 01:59, Yang Hongyang wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 06/08/2015 06:15 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>>> On 08/06/15 10:58, Yang Hongyang wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 06/08/2015 05:46 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>>>>> On 08/06/15 04:43, Yang Hongyang wrote:
>>>>>>>>>>>>>>>>>> ioreq page contains evtchn which will be set when we
>>>>>> resume
>>>>>>>> the
>>>>>>>>>>>>>>>>>> secondary vm the first time. The hypervisor will check if
>>>> the
>>>>>>>>>>>>>>>>>> evtchn is corrupted, so we cannot zero the ioreq page
>>>> more
>>>>>>>>>>>>>>>>>> than one time.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The ioreq->state is always STATE_IOREQ_NONE after
>> the
>>>> vm
>>>>>> is
>>>>>>>>>>>>>>>>>> suspended, so it is OK if we only zero it one time.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Yang Hongyang
>> <yanghy@xxxxxxxxxxxxxx>
>>>>>>>>>>>>>>>>>> Signed-off-by: Wen congyang
>> <wency@xxxxxxxxxxxxxx>
>>>>>>>>>>>>>>>>>> CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The issue here is that we are running the restore
>> algorithm
>>>>>> over
>>>>>>>> a
>>>>>>>>>>>>>>>>> domain which has already been running in Xen for a
>> while.
>>>>>> This
>>>>>>>> is a
>>>>>>>>>>>>>>>>> brand new usecase, as far as I am aware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Exactly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Does the qemu process associated with this domain get
>>>>>> frozen
>>>>>>>>>>>>>>>>> while the
>>>>>>>>>>>>>>>>> secondary is being reset, or does the process get
>> destroyed
>>>>>> and
>>>>>>>>>>>>>>>>> recreated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What do you mean by reset? do you mean secondary is
>>>>>>>> suspended
>>>>>>>>>> at
>>>>>>>>>>>>>>>> checkpoint?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Well - at the point that the buffered records are being
>>>>>> processed,
>>>>>>>> we
>>>>>>>>>>>>>>> are in the process of resetting the state of the secondary to
>>>>>> match
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> primary.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, at this point, the qemu process associated with this
>>>> domain is
>>>>>>>>>>>>>> frozen.
>>>>>>>>>>>>>> the suspend callback will call libxl__qmp_stop(vm_stop() in
>>>>>> qemu)
>>>>>>>> to
>>>>>>>>>>>>>> pause
>>>>>>>>>>>>>> qemu. After we processed all records, qemu will be restored
>>>> with
>>>>>>>> the
>>>>>>>>>>>>>> received
>>>>>>>>>>>>>> state, that's why we add a
>>>>>>>> libxl__qmp_restore(qemu_load_vmstate()
>>>>>>>>>> in
>>>>>>>>>>>>>> qemu)
>>>>>>>>>>>>>> api to restore qemu with received state. Currently in libxl,
>>>> qemu
>>>>>> only
>>>>>>>>>>>>>> start
>>>>>>>>>>>>>> with the received state, there's no api to load received state
>>>> while
>>>>>>>>>>>>>> qemu is
>>>>>>>>>>>>>> running for a while.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I consider this more, it is absolutely wrong to not zero
>> the
>>>>>> page
>>>>>>>>>>>>> here.  The event channel in the page is not guaranteed to be
>> the
>>>>>>>> same
>>>>>>>>>>>>> between the primary and secondary,
>>>>>>>>>>>>
>>>>>>>>>>>> That's why we don't zero it on secondary.
>>>>>>>>>>>
>>>>>>>>>>> I think you missed my point.  Apologies for the double negative.
>> It
>>>>>>>>>>> must, under all circumstances, be zeroed at this point, for safety
>>>>>>>> reasons.
>>>>>>>>>>>
>>>>>>>>>>> The page in question is subject to logdirty just like any other
>> guest
>>>>>>>>>>> pages, which means that if the guest writes to it naturally (i.e.
>> not a
>>>>>>>>>>> Xen or Qemu write, both of whom have magic mappings which
>> are
>>>>>> not
>>>>>>>>>>> subject to logdirty), it will be transmitted in the stream.  As the
>>>>>>>>>>> event channel could be different, the lack of zeroing it at this
>> point
>>>>>>>>>>> means that the event channel would be wrong as opposed to
>>>> simply
>>>>>>>>>>> missing.  This is a worse position to be in.
>>>>>>>>>>
>>>>>>>>>> The guest should not access this page. I am not sure if the guest
>> can
>>>>>>>>>> access the ioreq page.
>>>>>>>>>>
>>>>>>>>>> But in the exceptional case, the ioreq page is dirtied, and is copied
>> to
>>>>>>>>>> the secondary vm. The ioreq page will contain a wrong event
>>>> channel,
>>>>>> the
>>>>>>>>>> hypervisor will check it: if the event channel is wrong, the guest
>> will
>>>>>>>>>> be crashed.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> and we don't want to unexpectedly
>>>>>>>>>>>>> find a pending/in-flight ioreq.
>>>>>>>>>>>>
>>>>>>>>>>>> ioreq->state is always STATE_IOREQ_NONE after the vm is
>>>>>> suspended,
>>>>>>>>>> there
>>>>>>>>>>>> should be no pending/in-flight ioreq at checkpoint.
>>>>>>>>>>>
>>>>>>>>>>> In the common case perhaps, but we must consider the
>>>> exceptional
>>>>>>>> case.
>>>>>>>>>>> The exceptional case here is some corruption which happens to
>>>>>> appear
>>>>>>>> as
>>>>>>>>>>> an in-flight ioreq.
>>>>>>>>>>
>>>>>>>>>> If the state is STATE_IOREQ_NONE, it may be hypervisor's bug. If
>> the
>>>>>>>>>> hypervisor
>>>>>>>>>> has a bug, anything can happen. I think we should trust the
>>>> hypervisor.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Either qemu needs to take care of re-initialising the event
>>>> channels
>>>>>>>>>>>>> back to appropriate values, or Xen should tolerate the
>> channels
>>>>>>>>>>>>> disappearing.
>>>>>>>>>>>
>>>>>>>>>>> I still stand by this statement.  I believe it is the only safe way 
>>>>>>>>>>> of
>>>>>>>>>>> solving the issue you have discovered.
>>>>>>>>>>
>>>>>>>>>> Add a new qemu monitor command to update ioreq page?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you're attaching to a 'new' VM (i.e one with an updated image)
>> then
>>>> I
>>>>>>>> suspect you're going to have to destroy and re-create the ioreq
>> server
>>>> so
>>>>>>>> that the shared page gets re-populated with the correct event
>>>> channels.
>>>>>>>> Either that or you're going to have to ensure that the page is not part
>> of
>>>>>>>> restored image and sample the new one that Xen should have set
>> up.
>>>>>>>>
>>>>>>>>
>>>>>>>> I agree with it. I will try to add a new qemu monitor command(or do
>> it
>>>>>> when
>>>>>>>> updating qemu's state) to destroy and re-create it.
>>>>>>>
>>>>>>> The slightly tricky part of that is that you're going to have to cache 
>>>>>>> and
>>>>>> replay all the registrations that were done on the old instance, but you
>>>> need
>>>>>> to do that in any case as it's not state that is transferred in the VM 
>>>>>> save
>>>>>> record.
>>>>>>
>>>>>> Why do we have to cache and replay all the registrations that were
>> done
>>>> on
>>>>>> the old instance?
>>>>>
>>>>> Do you not have device models that you need to continue to function?
>>>> When the ioreq server is torn down then all MMIO, port IO and PCI config
>>>> ranges that were mapped to it will disappear.
>>>>
>>>> Yes, I don't known which should be done unless I implement and test it.
>>>>
>>>> I have some questions about it:
>>>> 1. Can guest access the ioreq page? If the page is modified by the guest
>>>> unexpectedly,
>>>>    what will happen?
>>>
>>> No, the guest cannot modify the pages once a non-default ioreq server is
>> active. The pages are removed from the guest P2M when it is activated,
>> which is one of the reasons for modifying QEMU to not behave as a legacy
>> default server.
>>>
>>>> 2. If the ioreq page is dirtied by the guest, it will be transfered from
>> primary
>>>>    to secondary during checkpoint. The evtchn is invalid, I think the best
>>>> behavior
>>>>    is that: make the guest crashed, not continue to run.
>>>
>>> As I said, the pages are not in the P2M if the server is active so they 
>>> will not
>> be transferred as part of the VM state. However, this presents a problem; at
>> the far end, the emulator will not be able to hook into the guest. So, when
>> the source domain is paused, the ioreq server needs to be torn down (so
>> that its pages are re-inserted into the P2M and marked dirty for transfer).
>> This is what happens in a normal migration. One extra problem you have is
>> that the source domain is not then killed, it is resumed along with the
>> emulator. Thus, on resume, the emulator needs to create a new ioreq server
>> and re-register all its device models with that new server.
>>> I don't know the detail of what you do at the far end, but if you always 
>>> start
>> a new emulator instance using the QEMU save record then you should be
>> fine (just like with a normal migration).
>>
>> I don't find the codes where the ioreq server is torn down when the source
>> domain is paused. Which function?
> 
> Sorry, I overstated that. By 'torn down' I meant disabled. The function that 
> does it is:
> 
> static void xen_hvm_change_state_handler(void *opaque, int running,
>                                          RunState rstate)
> {
>     XenIOState *state = opaque;
> 
>     if (running) {
>         xen_main_loop_prepare(state);
>     }
> 
>     xen_set_ioreq_server_state(xen_xc, xen_domid,
>                                state->ioservid,
>                                (rstate == RUN_STATE_RUNNING));
> }
> 

Yes, I see it now.

If the ioreq page is cleared, can xen_main_loop_prepare() put the correct
evtchn in the ioreq page? If so, I think COLO can work without this patch.

In the hypervisor, the ioreq page is cleared before re-inserted into the P2M.
So the far end always gets the zeroed ioreq page.

IIRC, there is only default ioreq server when I wrote this patch. In that case
we don't have xen_main_loop_prepare(). It is OK that COLO cannot work with
an older version qemu.

Yang, can you test COLO without this patch? I guess it can work, and we can drop
this patch now.

Thanks
Wen Congyang

>>
>> In our implementation, we don't start a new emulator. The codes can work,
>> but some bugs may be not triggered.
>>
> 
> How do you reconcile the incoming QEMU save record with the running emulator 
> state?
> 
>   Paul
> 
>> Thanks
>> Wen Congyang
>>
>>>
>>>   Paul
>>>
>>>>
>>>> Thanks
>>>> Wen Congyang
>>>>
>>>>>
>>>>>   Paul
>>>>>
>>>>>> We will set to the guest to a new state, the old state should be
>> dropped.
>>>>>>
>>>>>> Thanks
>>>>>> Wen Congyang
>>>>>>
>>>>>>>
>>>>>>>   Paul
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Wen Congyang
>>>>>>>>
>>>>>>>>>
>>>>>>>>>   Paul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Wen Congyang
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ~Andrew
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@xxxxxxxxxxxxx
>>>>>> http://lists.xen.org/xen-devel
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@xxxxxxxxxxxxx
>>>>> http://lists.xen.org/xen-devel
>>>>> .
>>>>>
>>>
>>> .
>>>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
> .
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.