Xen project Mailing List

Re: [Xen-devel] Analysis of using balloon page compaction in Xen balloon driver

To: Wei Liu <wei.liu2@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Fri, 17 Oct 2014 13:30:21 +0100

Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>, David Vrabel <david.vrabel@xxxxxxxxxx>

Delivery-date: Fri, 17 Oct 2014 12:30:41 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 16/10/2014 18:12, Wei Liu wrote: > This document analyses the impact of using balloon compaction > infrastructure in Xen balloon driver. This is a fantastic start (and I actively recommend similar documents from others for future work). > ## Motives > > 1. Balloon pages fragments guest physical address space. > 2. Balloon compaction infrastructure can migrate ballooned pages from > start of zone to end of zone, hence creating contiguous guest physical > address space. > 3. Having contiguous guest physical address enables some options to > improve performance. > > ## Benefit for auto-translated guest > > HVM/PVH/ARM guest can have contiguous guest physical address space > after balloon pages are compacted, which potentially improves memory > performance provided guest makes use of huge pages, either via > Hugetlbfs or Transparent Huge Page (THP). > > Consider memory access pattern of these guests, one access to guest > physical address involves several accesses to machine memory. The > total number of memory accesses can be represented as: > >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1 > Hx denotes second stage page table walk levels and Gx denotes guest > page table walk levels. I don't think this expresses what you intend to express (or I don't understand how you are trying to convey it). Consider a single memory access in the guest, with no pagefaults, and ignoring for now any TLB effects. This description is based on my knowledge of x86, but I assume ARM functions in a similar way. Both the guest, G, and host, H, are 64bit, so using 4-level translations. Consider first, the worst case where all mappings are 4K pages. gcr3 needs following to find gl4. This involves a complete host pagetable walk (4 translations) gl4 needs following to find gl3. This involves a complete host pagetable walk (4 translations) gl3 to gl2 ... gl2 to gl1 ... gl1 to gpa ... In the worst case, it takes 20 translations for a single guest memory access. Altering the guest to use a 2MB superpage would alleviate 4 traslations Altering the host to use 2MB superpages would alleviate 5 translations. It should be noted that the host pagetable walks are distinct, so a single 2MB superpage could turn only a single host walk from 4 translations to 3 translations. A guest can help itself substantially by allocating its pagetables contiguously, which would cause multiple guest translations to be contained within the same host translation, getting much better temporal locality of reference from the TLB. It should also be noted that Xen is in a far better position to make easy use of 2MB and 1GB superpages, than a guest running normal workloads is. However, the ideal case is to make use of both host and guest superpages. > By having contiguous guest physical address, guest can make use of > huge pages. This can reduce the number of G's in formula. > > Reducing number of H's is another project for hypervisor side > improvement and should be decoupled from Linux side changes. > > ## Design and implementation > > The use of balloon compaction doesn't require introducign new > interfaces between Xen balloon driver and the rest of the system. Most > changes are internal to Xen balloon driver. > > Currently, Xen balloon driver gets its page directly from page > allocator. To enable balloon page migration, those pages now need to > be allocated from core balloon driver. Pages allocated from core > balloon driver are subject to balloon page compaction. > > Xen balloon driver will also need to provide a callback to migrate > balloon page. In essence callback function receives "old page", which > is a already ballooned out page, and "new page", which is a page to be > ballooned out, then it inflates "old page" and deflates "new page". > > The core of migration callback is XENMEM\_exchange hypercall. This > makes sure that inflation of old page and deflation of new page is > done atomically, so even if a domain is beyond its memory target and > being enforced, it can still compact memory. > > ## HAP table fragmentation is not made worse > > *Assumption*: guest physical address space is already heavily > fragmented by balloon pages when balloon page compaction is required. > > For a typical test case like ballooning up and down when doing kernel > compilation, there's usually only a handful huge pages left in the > end. So the observation matches the assumption. On the other hand, if > guest physical address space is not heavily fragmented, it's not > likely balloon page compaction will be triggered automatically. > > In practice, balloon page compaction is not likely to make things > worse. Here is the analysis based on the above assumption. > > Note that HAP table is already shattered by balloon pages. When a > guest page is ballooned out, the underlying HAP entry needs to be > split should that entry pointed to a huge page. > > XENMEM\_exchange works as followed, "old page" is the guest page about > to get inflated and "new page" is the guest page about to get > deflated. It works like this: > > 1. Steal old page from domain. > 2. Allocate a heap page from domheap > 3. Release new page back to Xen > 4. Update guest physmap, old page points to heap page, new page points > to INVALID\_MFN. > > The end result is that HAP entry for "old page" now points to a valid > MFN instead of having INVALID\_MFN; HAP entry for "new page" now points > to INVALID\_MFN. > > So for old page we're in the same position as before. HAP table is > fragmented, however it's not more fragmented than before. > > For new page, the risk is that if the targeting guest new page is part > of a huge page, we need to split HAP entry, hence fragmenting HAP > table. This is valid concern. However in practice, guest address space > is already fragmented by ballooning. It's not likely we need to break > up any more huge pages, because there aren't that many left. So we're > in a position no worse than before. > > Another downside is that when Xen is exchanging a page, it's possible > that Xen may need to break up a huge page to get a 4K page. Xen > domheap is fragmented. However we're not getting any worse than before > as ballooning already fragments domheap. > > ## Beyond Linux balloon compaction infrastructure > > Currently there's no mechanism in Xen to coalesce HAP table > entries. To coalesce HAP entries we would need to make sure all > discrete entries belong to one huge page, are in correct order and > correct state. > > By introducing necessary infrastructure(s) inside hypervisor (page > migration etc.), we might eventually be able to coalesce HAP entries, > hence reducing the number of H's in the aforementioned formula. This, > combined with the work on guest side, can help guest achieve best > possible performance. If I understand your proposal correctly, a VM lifetime would look like this: 1 Xen allocates pages (hopefully 2MB where possible) 2 Guest starts up, and shatters both guest and host superpages by blindly ballooning random gfns 3a During runtime, guest spends time copying pages around in an attempt to coalesce 3b (optionally, given Xen support) Xen spends time copying pages around in an attempt to coalesce 4 Guest reshatters guest and host pages by more ballooning. 5 goto 3 Step 3 is an expensive (especially for Xen, which has far more important things to be doing with its time) and which is self-perpetuating because of the balloon driver reshattering pages. Several factors contribute to shattering host pages. The ones which come to mind are: * Differing cacheability from MTRRs * Mapping a foreign grant into ones own physical address space * Releasing pages back to Xen via the decrease_reservation hypercall In addition, other factors complicate Xens ability to move pages. * Mappings from other domains (Qemu, PV backends, etc) will pin mfns in place * Any IOMMU mappings will pin all (mapped) mfns in place. As a result, by far the most efficient way of prevent superpage fragmentation is to not shattering them in the first place. This can be done by changing the balloon driver in the guest to co-locate all pages it decides to balloon, rather than taking individual pages at random from the main memory pools. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.