[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Analysis of using balloon page compaction in Xen balloon driver

On 16/10/2014 18:12, Wei Liu wrote:
> This document analyses the impact of using balloon compaction
> infrastructure in Xen balloon driver.

This is a fantastic start (and I actively recommend similar documents
from others for future work).

> ## Motives
> 1. Balloon pages fragments guest physical address space.
> 2. Balloon compaction infrastructure can migrate ballooned pages from
>    start of zone to end of zone, hence creating contiguous guest physical
>    address space.
> 3. Having contiguous guest physical address enables some options to
>    improve performance.
> ## Benefit for auto-translated guest
> HVM/PVH/ARM guest can have contiguous guest physical address space
> after balloon pages are compacted, which potentially improves memory
> performance provided guest makes use of huge pages, either via
> Hugetlbfs or Transparent Huge Page (THP).
> Consider memory access pattern of these guests, one access to guest
> physical address involves several accesses to machine memory. The
> total number of memory accesses can be represented as:
>> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1
> Hx denotes second stage page table walk levels and Gx denotes guest
> page table walk levels.

I don't think this expresses what you intend to express (or I don't
understand how you are trying to convey it).

Consider a single memory access in the guest, with no pagefaults, and
ignoring for now any TLB effects.  This description is based on my
knowledge of x86, but I assume ARM functions in a similar way.

Both the guest, G, and host, H, are 64bit, so using 4-level
translations.  Consider first, the worst case where all mappings are 4K

gcr3 needs following to find gl4.  This involves a complete host
pagetable walk (4 translations)
gl4 needs following to find gl3.  This involves a complete host
pagetable walk (4 translations)
gl3 to gl2 ...
gl2 to gl1 ...
gl1 to gpa ...

In the worst case, it takes 20 translations for a single guest memory

Altering the guest to use a 2MB superpage would alleviate 4 traslations

Altering the host to use 2MB superpages would alleviate 5 translations.

It should be noted that the host pagetable walks are distinct, so a
single 2MB superpage could turn only a single host walk from 4
translations to 3 translations.  A guest can help itself substantially
by allocating its pagetables contiguously, which would cause multiple
guest translations to be contained within the same host translation,
getting much better temporal locality of reference from the TLB.

It should also be noted that Xen is in a far better position to make
easy use of 2MB and 1GB superpages, than a guest running normal
workloads is.  However, the ideal case is to make use of both host and
guest superpages.

> By having contiguous guest physical address, guest can make use of
> huge pages. This can reduce the number of G's in formula.
> Reducing number of H's is another project for hypervisor side
> improvement and should be decoupled from Linux side changes.
> ## Design and implementation
> The use of balloon compaction doesn't require introducign new
> interfaces between Xen balloon driver and the rest of the system. Most
> changes are internal to Xen balloon driver.
> Currently, Xen balloon driver gets its page directly from page
> allocator. To enable balloon page migration, those pages now need to
> be allocated from core balloon driver. Pages allocated from core
> balloon driver are subject to balloon page compaction.
> Xen balloon driver will also need to provide a callback to migrate
> balloon page. In essence callback function receives "old page", which
> is a already ballooned out page, and "new page", which is a page to be
> ballooned out, then it inflates "old page" and deflates "new page".
> The core of migration callback is XENMEM\_exchange hypercall. This
> makes sure that inflation of old page and deflation of new page is
> done atomically, so even if a domain is beyond its memory target and
> being enforced, it can still compact memory.
> ## HAP table fragmentation is not made worse
> *Assumption*: guest physical address space is already heavily
> fragmented by balloon pages when balloon page compaction is required.
> For a typical test case like ballooning up and down when doing kernel
> compilation, there's usually only a handful huge pages left in the
> end. So the observation matches the assumption. On the other hand, if
> guest physical address space is not heavily fragmented, it's not
> likely balloon page compaction will be triggered automatically.
> In practice, balloon page compaction is not likely to make things
> worse. Here is the analysis based on the above assumption.
> Note that HAP table is already shattered by balloon pages. When a
> guest page is ballooned out, the underlying HAP entry needs to be
> split should that entry pointed to a huge page.
> XENMEM\_exchange works as followed, "old page" is the guest page about
> to get inflated and "new page" is the guest page about to get
> deflated. It works like this:
> 1. Steal old page from domain.
> 2. Allocate a heap page from domheap
> 3. Release new page back to Xen
> 4. Update guest physmap, old page points to heap page, new page points
>    to INVALID\_MFN.
> The end result is that HAP entry for "old page" now points to a valid
> MFN instead of having INVALID\_MFN; HAP entry for "new page" now points
> So for old page we're in the same position as before. HAP table is
> fragmented, however it's not more fragmented than before.
> For new page, the risk is that if the targeting guest new page is part
> of a huge page, we need to split HAP entry, hence fragmenting HAP
> table. This is valid concern. However in practice, guest address space
> is already fragmented by ballooning. It's not likely we need to break
> up any more huge pages, because there aren't that many left. So we're
> in a position no worse than before.
> Another downside is that when Xen is exchanging a page, it's possible
> that Xen may need to break up a huge page to get a 4K page. Xen
> domheap is fragmented. However we're not getting any worse than before
> as ballooning already fragments domheap.
> ## Beyond Linux balloon compaction infrastructure
> Currently there's no mechanism in Xen to coalesce HAP table
> entries. To coalesce HAP entries we would need to make sure all
> discrete entries belong to one huge page, are in correct order and
> correct state.
> By introducing necessary infrastructure(s) inside hypervisor (page
> migration etc.), we might eventually be able to coalesce HAP entries,
> hence reducing the number of H's in the aforementioned formula. This,
> combined with the work on guest side, can help guest achieve best
> possible performance.

If I understand your proposal correctly, a VM lifetime would look like this:

1 Xen allocates pages (hopefully 2MB where possible)
2 Guest starts up, and shatters both guest and host superpages by
blindly ballooning random gfns
3a During runtime, guest spends time copying pages around in an attempt
to coalesce
3b (optionally, given Xen support) Xen spends time copying pages around
in an attempt to coalesce
4 Guest reshatters guest and host pages by more ballooning.
5 goto 3

Step 3 is an expensive (especially for Xen, which has far more important
things to be doing with its time) and which is self-perpetuating because
of the balloon driver reshattering pages.

Several factors contribute to shattering host pages.  The ones which
come to mind are:
* Differing cacheability from MTRRs
* Mapping a foreign grant into ones own physical address space
* Releasing pages back to Xen via the decrease_reservation hypercall

In addition, other factors complicate Xens ability to move pages.
* Mappings from other domains (Qemu, PV backends, etc) will pin mfns in
* Any IOMMU mappings will pin all (mapped) mfns in place.

As a result, by far the most efficient way of prevent superpage
fragmentation is to not shattering them in the first place.  This can be
done by changing the balloon driver in the guest to co-locate all pages
it decides to balloon, rather than taking individual pages at random
from the main memory pools.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.