Xen project Mailing List

Notes from the design session on guest unaware migration:

Migration is often needed for e.g. host maintenance. When this is done, we (XenServer) see two classes of issues with guests:
- Guest kernel crashes (relatively rare). Often detectable by the toolstack and thus reported to the admin, distros generally take patches quickly.

- Guest userspace issues (more common).
Primarily seen around networking - e.g. iptables rules get cleaned up, and not re-injected. This can break e.g. Kubernetes networking.

Some other examples around clustered services (though not clear if this is the guest being aware of the migration or just a result of the downtime).
Generally impossible for the toolstack to detect, so admin normally unaware until users/monitoring complains.

It was also mentioned that NetBSD has issues with live migration around suspend of the network interface.

Possible solutions
1. Do the migration in a way that the guest is entirely unaware of it
Amazon produced a proposal for this non-cooperative migration: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob_plain;f=docs/designs/non-cooperative-migration.md;hb=HEAD
Believed to be some older patch series on this

Some notes from VM forking work that might be relevant:
Some state was not saved as part of regular VM save, so resuming VM didn't work in some cases - likely will need to save this state if doing non-cooperative migration
Dumping / restoring qemu state worked for Windows, but for Linux needed a save, fork, restore, so appears to be some sort of dependency there

There is an issue around domids - in the proposal these are randomised, but that still means certain destinations aren't possible (in Amazon's case they just find a compatible target, but this is not necessarily an option in server virt scenarios where the admin specifies where they want the VM migrated to).
The domid is a 15bit integer, so if you have < 32k VMs you could allocate centrally across a pool of servers.

Could use non-cooperative migration where possible, but not expect it to work everywhere (e.g. within a pool, but not cross-pool in a XenServer example).

Alternative idea from Alejandro - could VMs be faked to always think they always have a fixed domid (e.g. 1), then have dom0 know the actual one, with e.g. xenstore translating?
Suggestion to talk to Juergen, he may have thoughts on this.

Could we use a UUID instead of domid in the protocols?
Large string/value that would be in lots of xenstore messages, could that cause problems.
Does a VM need to know its domid (e.g. for giving to other guests to set up grants), or could it be hidden?
Is this too much of a hack?

If the guest is unaware, we still need to make sure the gratuitous ARP gets sent after migration.

There are other use cases for non-cooperative migration, which would require not having anything custom in the VM.

2. Can we modify netfront so we don't generate the events (link down / interface removed - not clear which?) across a migration, thus userspace isn't aware even if the kernel is?
Likely needs some code inspection to understand what's actually happening here as to any potential improvements.

Design session notes: guest unaware migration