On 04/01/16 15:31, Ian Jackson wrote:
> Andrew Cooper writes ("Re: [Xen-devel] question about migration"):
>> On 25/12/2015 03:06, Wen Congyang wrote:
>>> Another problem:
>>> If migration fails after the guest is suspended, we will resume it in the 
>>> source.
>>> In this case, we cannot shutdown it. because no process hanlds the shutdown 
>>> event.
>>> The log in /var/log/xen/xl-hvm_nopv.log:
>>> Waiting for domain hvm_nopv (domid 1) to die [pid 5508]
>>> Domain 1 has shut down, reason code 2 0x2
>>> Domain has suspended.
>>> Done. Exiting now
>>> The xl has exited...
> ...
>> Hmm yes.  This is a libxl bug in libxl_evenable_domain_death(). CC'ing 
>> the toolstack maintainers.
> AIUI this is a response to Wen's comments above.
>> It waits for the @releasedomain watch, but doesn't interpret the results 
>> correctly.  In particular, if it can still make successful hypercalls 
>> with the provided domid, that domain was not the subject of 
>> @releasedomain.  (I also observe that domain_death_xswatch_callback() is 
>> very inefficient.  It only needs to make a single hypercall, not query 
>> the entire state of all domains.)
> I don't understand precisely what you allege this bug to be, but:
> * libxl_evenable_domain_death may generate two events, a
>   This is documented in libxl.h (although it refers to DESTROY rather
>   than DEATH - see patch below to fix the doc).
> * @releaseDomain usually triggers twice for each domain: once when it
>   goes to SHUTDOWN and once when it is actually destroyed.  (This is
>   obviously necessary to implement the above.)

So it does.  I clearly had an accident with `git grep` when I came the
opposite conclusion.  Apologies for the noise generated from this.

> * @releaseDomain does not have a specific domain which is the "subject
>   of @releaseDomain".  Arguably this is unhelpful, but it is not
>   libxl's fault.  It arises from the VIRQ generated by Xen.  Note that
>   xenstored needs to search its own list of active domains to see what
>   has happened; it generates the @releaseDomain event and throws away
>   the domid.

The semantics of @releaseDomain are quite mad, but this is have it has
always been.

The current semantics are a scalability limitation which someone in
XenServer will likely get around to in due course (we support 1000 VMs
per host).

> * It is not possible to resume the domain in the source after it has
>   suspended.

This functionality exists and is already used in several circumstances,
both by libxl, and other toolstacks.

xl has an added split-brain problem here that plain demonic toolstacks
don't have; specifically that there are two completely independent
processes playing with the domain state at the same time.

The daemonic xl needs to ignore DOMAIN_SHUTDOWN and tidy up only after
DOMAIN_DEATH.  Under these circumstances, a failed migrate which resumes
the domain won't result in qemu being cleaned up.


