[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control network 
buffering in remus callbacks [and 1 more messages]"):
> On Mon, Nov 4, 2013 at 10:06 AM, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx> 
> wrote:
>     Which of the xc_domain_save (and _restore) callbacks are called each
>     remus iteration ?
> Almost all of them on the xc_domain_save side. (suspend, resume,
> save_qemu state, checkpoint).


>  xc_domain_restore doesn't have any
> callbacks AFAIK. And remus as of now does not have a component on
> the restore side. It piggybacks on live migration's restore
> framework.

But the libxl memory management in the restore code is currently
written to assume a finite lifetime for the ao.  So I think this needs
to be improved.

Perhaps all the suspend/restore callbacks should each get one of the
nested ao type things that Roger needs for his driver domain daemon.

> FWIW, the remus related code that executes per iteration does not
> allocate anything.  All allocations happen only during setup and I
> was under the impression that no other allocations are taking place
> everytime xc_domain_save calls back into libxl.

If this is true, then good, because we don't need to do anything, but
there is a lot of code there and I would want to check.

> However, it may be possible that other parts of the AO machinery
> (and there are a lot of them) are allocating stuff per
> iteration. And if that is the case, it could easily lead to OOMs
> since Remus technically runs as long as the domain lives.

The ao and event machinery doesn't do much allocation itself.

>     Having said that, libxl is not performance-optimised.  Indeed the
>     callback mechanism involves context switching, and IPC, between the
>     save/restore helper and libxl proper.  Probably not too much to be
>     doing every 20ms for a single domain, but if you have a lot of these
>     it's going to end up taking a lot of dom0 cpu etc.
> Yes and that is a problem. Xend+Remus avoided this by linking
> the libcheckpoint library that interfaced with both the python & libxc code.

Have you observed whether the performance is acceptable with your V3
patches ?

>     I assume you're not doing this for HVM domains, which involve saving
>     the qemu state each time too.
> It includes HVM domains too. Although in that case, xenstore based suspend
> takes about 5ms. So the checkpoint interval is typically 50ms or so.


> If there is a latency sensitive task running inside
> the VM, lower checkpoint interval leads to better performance.



Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.