[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Design session notes: guest unaware migration



Notes from the design session on guest unaware migration:

Migration is often needed for e.g. host maintenance. When this is done, we (XenServer) see two classes of issues with guests:
  - Guest kernel crashes (relatively rare). Often detectable by the toolstack and thus reported to the admin, distros generally take patches quickly.

  - Guest userspace issues (more common).
    Primarily seen around networking - e.g. iptables rules get cleaned up, and not re-injected. This can break e.g. Kubernetes networking.
Some other examples around clustered services (though not clear if this is the guest being aware of the migration or just a result of the downtime).
Generally impossible for the toolstack to detect, so admin normally unaware until users/monitoring complains.

It was also mentioned that NetBSD has issues with live migration around suspend of the network interface.

Possible solutions
1. Do the migration in a way that the guest is entirely unaware of it
Amazon produced a proposal for this non-cooperative migration: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob_plain;f=docs/designs/non-cooperative-migration.md;hb=HEAD
Believed to be some older patch series on this
 
Some notes from VM forking work that might be relevant:
  Some state was not saved as part of regular VM save, so resuming VM didn't work in some cases - likely will need to save this state if doing non-cooperative migration
  Dumping / restoring qemu state worked for Windows, but for Linux needed a save, fork, restore, so appears to be some sort of dependency there
 
There is an issue around domids - in the proposal these are randomised, but that still means certain destinations aren't possible (in Amazon's case they just find a compatible target, but this is not necessarily an option in server virt scenarios where the admin specifies where they want the VM migrated to).
The domid is a 15bit integer, so if you have < 32k VMs you could allocate centrally across a pool of servers.

Could use non-cooperative migration where possible, but not expect it to work everywhere (e.g. within a pool, but not cross-pool in a XenServer example).

Alternative idea from Alejandro - could VMs be faked to always think they always have a fixed domid (e.g. 1), then have dom0 know the actual one, with e.g. xenstore translating?
  Suggestion to talk to Juergen, he may have thoughts on this.
 
Could we use a UUID instead of domid in the protocols?
  Large string/value that would be in lots of xenstore messages, could that cause problems.
  Does a VM need to know its domid (e.g. for giving to other guests to set up grants), or could it be hidden?
  Is this too much of a hack?

If the guest is unaware, we still need to make sure the gratuitous ARP gets sent after migration.

There are other use cases for non-cooperative migration, which would require not having anything custom in the VM.


2. Can we modify netfront so we don't generate the events (link down / interface removed - not clear which?) across a migration, thus userspace isn't aware even if the kernel is?
  Likely needs some code inspection to understand what's actually happening here as to any potential improvements.

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.