Xen project Mailing List

Re: [Xen-devel] [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list

To: Juergen Gross <jgross@xxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>, <jbeulich@xxxxxxxx>, <konrad.wilk@xxxxxxxxxx>, <david.vrabel@xxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Tue, 18 Nov 2014 10:51:32 +0000

Delivery-date: Tue, 18 Nov 2014 10:51:48 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 18/11/14 05:33, Juergen Gross wrote: > On 11/14/2014 05:08 PM, Andrew Cooper wrote: >> On 14/11/14 15:32, Juergen Gross wrote: >>> On 11/14/2014 03:59 PM, Andrew Cooper wrote: >>>> On 14/11/14 14:14, Jürgen Groß wrote: >>>>> On 11/14/2014 02:56 PM, Andrew Cooper wrote: >>>>>> On 14/11/14 12:53, Juergen Gross wrote: >>>>>>> On 11/14/2014 12:41 PM, Andrew Cooper wrote: >>>>>>>> On 14/11/14 09:37, Juergen Gross wrote: >>>>>>>>> The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list >>>>>>>>> currently contains the mfn of the top level page frame of the 3 >>>>>>>>> level >>>>>>>>> p2m tree, which is used by the Xen tools during saving and >>>>>>>>> restoring >>>>>>>>> (and live migration) of pv domains and for crash dump analysis. >>>>>>>>> With >>>>>>>>> three levels of the p2m tree it is possible to support up to 512 >>>>>>>>> GB of >>>>>>>>> RAM for a 64 bit pv domain. >>>>>>>>> >>>>>>>>> A 32 bit pv domain can support more, as each memory page can hold >>>>>>>>> 1024 >>>>>>>>> instead of 512 entries, leading to a limit of 4 TB. >>>>>>>>> >>>>>>>>> To be able to support more RAM on x86-64 switch to a virtual >>>>>>>>> mapped >>>>>>>>> p2m list. >>>>>>>>> >>>>>>>>> This patch expands struct arch_shared_info with a new p2m list >>>>>>>>> virtual >>>>>>>>> address and the mfn of the page table root. The new >>>>>>>>> information is >>>>>>>>> indicated by the domain to be valid by storing ~0UL into >>>>>>>>> pfn_to_mfn_frame_list_list. The hypervisor indicates usability of >>>>>>>>> this >>>>>>>>> feature by a new flag XENFEAT_virtual_p2m. >>>>>>>> >>>>>>>> How do you envisage this being used? Are you expecting the tools >>>>>>>> to do >>>>>>>> manual pagetable walks using xc_map_foreign_xxx() ? >>>>>>> >>>>>>> Yes. Not very different compared to today's mapping via the 3 level >>>>>>> p2m tree. Just another entry format, 4 instead of 3 levels and >>>>>>> starting >>>>>>> at an offset. >>>>>> >>>>>> Yes - David and I were discussing this over lunch, and it is not >>>>>> actually very different. >>>>>> >>>>>> In reality, how likely is it that the pages backing this virtual >>>>>> linear >>>>>> array change? >>>>> >>>>> Very unlikely, I think. But not impossible. >>>>> >>>>>> One issue currently is that, during the live part of migration, the >>>>>> toolstack has no way of working out whether the structure of the >>>>>> p2m has >>>>>> changed (intermediate leaves rearranged, or the length increasing). >>>>>> >>>>>> In the case that the VM does change the structure of the p2m >>>>>> under the >>>>>> feet of the toolstack, migration will either blow up in a >>>>>> non-subtle way >>>>>> with a p2m/m2p mismatch, or in a subtle way with the receiving side >>>>>> copying the new p2m over the wrong part of the new domain. >>>>>> >>>>>> I am wondering whether, with this new p2m method, we can take >>>>>> sufficient >>>>>> steps to be able to guarantee mishaps like this can't occur. >>>>> >>>>> This should be easy: I could add a counter in arch_shared_info >>>>> which is >>>>> incremented whenever a p2m mapping is being changed. The toolstack >>>>> could >>>>> compare the counter values before start and at end of migration and >>>>> redo >>>>> the migration (or fail) if they are different. In order to avoid >>>>> races >>>>> I would have to increment the counter before and after changing the >>>>> mapping. >>>>> >>>> >>>> That is insufficient I believe. >>>> >>>> Consider: >>>> >>>> * Toolstack walks pagetables and maps the frames containing the >>>> linear p2m >>>> * Live migration starts >>>> * VM remaps a frame in the middle of the linear p2m >>>> * Live migration continues, but the toolstack has a stale frame in the >>>> middle of its view of the p2m. >>> >>> This would be covered by my suggestion. At the end of the memory >>> transfer (with some bogus contents) the toolstack would discover the >>> change of the p2m structure and either fail the migration or start it >>> from the beginning and thus overwriting the bogus frames. >> >> Checking after pause is too late. The content of the p2m is used verify >> each frame being sent on the wire, so is in active use for the entire >> duration of live migration. >> >> If the toolstack starts verifying frames being sent using information >> from a stale p2m, the best that can be hoped for is that the toolstack >> declares that the p2m and m2p are inconsistent and abort the migrate. >> >>> >>>> As the p2m is almost never expected to change, I think it might be >>>> better to have a flag the toolstack can set to say "The toolstack is >>>> peeking at your p2m behind your back - you must not change its >>>> structure." >>> >>> Be careful here: changes of the structure can be due to two scenarios: >>> - ballooning (invalid entries being populated): this is no problem, as >>> we can stop the ballooning during live migration. >>> - mapping of grant pages e.g. in a stub domain (first map in an area >>> former marked as invalid): you can't stop this, as the stub domain >>> has to do some work. Here a restart of the migration should work, as >>> the p2m structure change can only happen once for each affected p2m >>> page. >> >> Migration is not at all possible with a domain referencing foreign >> frames. >> >> The live part can cope with foreign frames referenced in the ptes. As >> part of the pause handling in the VM, the frontends must unmap any >> grants they have. After pause, any remaining foreign frames cause a >> migration failure. >> >>> >>>> Having just thought this through, I think there is also a race >>>> condition >>>> between a VM changing an entry in the p2m, and the toolstack doing >>>> verifications of frames being sent. >>> >>> Okay, so the flag you mentioned should just prohibit changes in the >>> p2m list related to memory frames of the affected domain: ballooning >>> up or down, or rearranging the memory layout (does this happen today?). >>> Mapping and unmapping of grant pages should be still allowed. >> >> HVM guests doesn't have any of their p2m updates represented in the >> logdirty bitmap, so ballooning an HVM guest during migrate leads to >> unexpected holes or lack of holes on the resuming side, leading to a >> very confused balloon driver. >> >> At the time I had not found a problem with PV guests, but it is now >> clear that there is a period of time when a guest is altering its p2m >> where the p2m and m2p are out of sync, which will cause a migration >> failure if the toolstack observes this artefact. > > So ballooning should be disabled during migration. I think this should > be handled via callbacks triggered by xenstore: one at start of > migration to stop ballooning and one at end to restart it. I wouldn't > want to tie this functionality to the p2m list structure, as it is > not related to it. It is not just ballooning. It is any change to the p2m whatsoever. This includes mapping/unmapping grants, XENMEM_exchange, and the guest simply changing the p2m layout. I suspect that the only reason this has not been encountered in practice is that noone has attempted migrating a domain which makes use of foreign mappings. It is typically only the backend drivers which map frontend memory, and dom0 doesn't migrate. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.