Xen project Mailing List

[Xen-devel] Buggy interaction of live migration and p2m updates

To: Xen-devel List <xen-devel@xxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 20 Nov 2014 18:28:11 +0000

Cc: Juergen Gross <JGross@xxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, David Vrabel <david.vrabel@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Shriram Rajagopalan <rshriram@xxxxxxxxx>, Hongyang Yang <yanghy@xxxxxxxxxxxxxx>

Delivery-date: Thu, 20 Nov 2014 18:28:37 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hello, Tim, David and I were discussing this over lunch. This email is a (hopefully accurate) account of our findings, and potential solutions. (If I have messed up, please shout.) Currently, correct live migration of PV domains relies on the toolstack (which has a live mapping of the guests p2m) not observing stale values when the guest updates its p2m, and the race condition between a p2m update and an m2p update. Realistically, this means no updates to the p2m at all, due to several potential race conditions. Should any race conditions happen (e.g. ballooning while live migrating), the effects could be anything from an aborted migration to VM memory corruption. It should be noted that migrationv2 does not fix any of this. It alters the way in which some race conditions might be observed. During development of migrationv2, there was an explicit non-requirement of fixing the existing Ballooning+LiveMigration issues we were aware, although at the time, we were not aware of this specific set of issues. Our goal was to simply make migrationv2 work in the same circumstances as previously, but with a bitness-agnostic wire format and forward-extensible protocol. As far as these issues are concerned, there are two distinct p2m modifications which we care about: 1) p2m structure changes (rearranging the layout of the p2m) 2) p2m content changes (altering entries in the p2m) There is no possible way for the toolstack to prevent a domain from altering its p2m. At the moment, ballooning typically only occurs when requested by the toolstack, but the underlying operations (increase/decrease_reservation, mem_exchange, etc) can be used by the guest at any point. This includes Wei's guest memory fragmentation changes. Changes to the content of the p2m also occur for grant map and unmap operations. Currently in PV guests, the p2m is implemented using a 3-level tree, with its root in the guests shared_info page. It provides a hard VM memory limit of 4TB for 32bit PV guests (which is far higher than the 128GB limit from the compat p2m mappings), or 512GB for 64bit PV guests. Juergen has a proposed new p2m interface using a virtual linear mapping. This is conceptually similar to the previous implementation (which is fine from the toolstacks point of view), but far less complicated from the guests point of view, and removes the memory limits imposed by the p2m structure. The new virtual linear mapping suffers from the same interaction issues as the old 3-level tree did, but the introduction of the new interface affords us an opportunity to make all API modifications at once to reduce churn. During live migration, the toolstack maps the guests p2m into a linear mapping in the toolstacks virtual address space. This is done once at the start of migration, and never subsequently altered. During live migration, the p2m is cross-verified with the m2p, and frames are sent using pfns as a reference, as they will be located in different frames on the receiving side. Should the guest change the p2m structure during live migration, the toolstack ends up with a stale p2m with a non-p2m frame in the middle, resulting in bogus cross-referencing. Should the guest change an entry in the p2m, the p2m frame itself will be resent as it would be marked as dirty in the logdirty bitmap, but the target pfn will remain unsent and probably stale on the receiving side. Another factor which needs to be taken into account is Remus/COLO, which run the domains under live migration conditions for the duration of their lifetime. During the live part of migration, the toolstack already has to be able to tolerate failures to normalise the pagetables, which result as a consequent of the pagetables being in active. These failures are fatal on the final iteration after the guest has been paused, but the same logic could be extended to p2m/m2p issues, if needed. There are several potential solutions to these problems. 1) Freeze the guests p2m during live migrate This is the simplest sounding option, but is quite problematic from the point of view of the guest. It is essentially a shared spinlock between the toolstack and the guest kernel. It would prevent any grant map/unmap operations from occurring, and might interact badly with certain p2m updated in the guest which would previously be expected to unconditionally succeed. Pros) (Can't think of any) Cons) Not easy to implement (even conceptually), requires invasive guest changes, will cripple Remus/COLO 2) Deep p2m dirty tracking In the case that a p2m frame is discovered dirty in the logdirty bitmap, we can be certain that a write has occurred to it, and in the common case, means that the mapping has changed. The toolstack could maintain a non-live copy of the p2m which is updated as new frames are sent. When a dirty p2m frame is found, the live and non-live copies can be consulted to find which pfn mappings have changed, and locally mark all the altered pfns for retransmit. Pros) No guest changes required Cons) Toolstack needs to keep an additional copy of the guests p2m on the sending side 3) Eagerly check for p2m structure changes. p2m structure changes are rare after boot, but not impossible. Each iteration of live migration, the toolstack can check for dirty higher-level p2m frames in the dirty bitmap. In the case that a structure update occurs, the toolstack can use information it already has to calculate a subset of pfns affected by the update, and mark them for resending. (This can currently be done to the frame granularity given the p2m frame lit, but in combination with 2), could result in fewer pfns needing resending.) Pros) No guest changes required. Cons) Moderately high toolstack overhead, Possibility to resend far more pfns than strictly required. 4) Request p2m structure change updates from the guest The guest could provide a "p2m generation count" to allow the toolstack to evaluate whether the structure had changed. This would allow the live part of migration to periodically re-evaluate whether it should remap the p2m to avoid stale mappings. Pros) Easy to implement alongside the virtual linear mapping support. Easy for toolstack and guest Cons) Only works with new virtual linear guests. Proposed solution: A combination of 2, 3 and 4. For legacy 3-level p2m guests, the toolstack can detect p2m structure updates by tracking the p2m top and mid levels in the logdirty bitmap, and invalidating the modified subset of pfns. It has to eagerly check the p2m frame list list mfn entry in the shared info to see whether the guest has swapped onto a completely new p2m. For a virtual linear map, the intermediate levels are not available to track, but we can require that the guest increment p2m generation clock in the shared info. When the structure changes, the toolstack can remap the p2m and calculate the altered subset of pfns, and mark for resend. The toolstack must also track changes in the p2m itself, and compare to a local copy showing the mapping at the time at which the pfn was last sent. This can be used to work out which p2m mappings have changed, and also be used to confirm whether the pfns on the receiving side are stale or not. I believe this covered all cases and race conditions. In the case that the p2m is updated before the m2p, the p2m frame will be marked dirty in the bitmap, and discoverable on the next iteration. At that point, if the p2m and m2p are inconsistent, the pfn will be deferred until the final iteration. If not, the frame is sent and everything is all ok. In the case that the p2m is updated after the m2p, the p2m/m2p will be consistent when the dirty bitmap is acted on. Thoughts? (for anyone who has made it this far :) I think I have covered everything.) ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.