Re: [Xen-devel] [libvirt] limit downtime during life migration from xl/virsh

On Mon, Mar 10, 2014 at 15:36:06 +0100, Olaf Hering wrote:
> During live migration of VMs from one host to another the VM is
> suspended for an unpredictable amount of time. The actual downtime
> depends on how many new pages will be dirty and the band width to the
> destination host. Since VM memory size grows faster than transfer rates
> the currently available tuneables will cause troubles for workloads
> within the VM which can not handle large timejumps.
> I have already written code to tweak the inner loop doing the actual
> migration work in libxc. But the patchset exposes the details of the
> loop to the cmdline, as such it is not portable nor is it a friendly UI
> for the hostadmin.
> Here is my proposal for a new option for virsh and 2 new options for xl:
> [xl | virsh --live] --max-suspend-time N --timeout N VM host
> --max-suspend-time N: as the name suggests, the VM downtime must not be
> longer than specified. The code doing the migration has to estimate the
> transfer speed. If the VM is about to be suspended, it has to check if
> the remaining dirty pages can be transfered within the required
> timeframe. If not, the migration is aborted, the VM continues to run on
> the src host, the new VM on the dst host is destroyed and an error is
> returned.

Libvirt already has virDomainMigrateSetMaxDowntime API with this
semantics. However, using virsh, one can set it with virsh
migrate-setmaxdowntime command while migration is happening. Not sure if
exposing it as yet another parameter of already quite complicated
migrate command would buy us much.

> --timeout N: if a VM is busy and its workload causes many new dirty
> pages the migrate command would take forever. This option is supposed to
> stop the migration attempt if the number of new dirty pages is too high.
> It would change the semantics of "virsh migrate --timeout n", which
> currently forces a suspend (according to the help text).

This is not acceptable. If you want an option to automatically cancel
migration after a given timeout, you would need to introduce a new
option instead of changing semantics of an existing option.


