[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Analysis of using balloon page compaction in Xen balloon driver

On Fri, Oct 17, 2014 at 01:30:21PM +0100, Andrew Cooper wrote:
> On 16/10/2014 18:12, Wei Liu wrote:
> > This document analyses the impact of using balloon compaction
> > infrastructure in Xen balloon driver.
> This is a fantastic start (and I actively recommend similar documents
> from others for future work).
> > ## Motives
> >
> > 1. Balloon pages fragments guest physical address space.
> > 2. Balloon compaction infrastructure can migrate ballooned pages from
> >    start of zone to end of zone, hence creating contiguous guest physical
> >    address space.
> > 3. Having contiguous guest physical address enables some options to
> >    improve performance.
> >
> > ## Benefit for auto-translated guest
> >
> > HVM/PVH/ARM guest can have contiguous guest physical address space
> > after balloon pages are compacted, which potentially improves memory
> > performance provided guest makes use of huge pages, either via
> > Hugetlbfs or Transparent Huge Page (THP).
> >
> > Consider memory access pattern of these guests, one access to guest
> > physical address involves several accesses to machine memory. The
> > total number of memory accesses can be represented as:
> >
> >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1
> > Hx denotes second stage page table walk levels and Gx denotes guest
> > page table walk levels.
> I don't think this expresses what you intend to express (or I don't
> understand how you are trying to convey it).
> Consider a single memory access in the guest, with no pagefaults, and
> ignoring for now any TLB effects.  This description is based on my
> knowledge of x86, but I assume ARM functions in a similar way.
> Both the guest, G, and host, H, are 64bit, so using 4-level
> translations.  Consider first, the worst case where all mappings are 4K
> pages.
> gcr3 needs following to find gl4.  This involves a complete host
> pagetable walk (4 translations)
> gl4 needs following to find gl3.  This involves a complete host
> pagetable walk (4 translations)
> gl3 to gl2 ...
> gl2 to gl1 ...
> gl1 to gpa ...
> In the worst case, it takes 20 translations for a single guest memory
> access.
> Altering the guest to use a 2MB superpage would alleviate 4 traslations
> Altering the host to use 2MB superpages would alleviate 5 translations.

That's basically what I was trying to express in that formula, only that
I missed gcr3 lookup. I think I wrote the wrong definition of Hx. Hx
should be the number of host page table walks (not levels).  And also
"reduce the number of H's" should be "make individual H smaller".

So we're still on the same page here.

> It should be noted that the host pagetable walks are distinct, so a
> single 2MB superpage could turn only a single host walk from 4
> translations to 3 translations.  A guest can help itself substantially
> by allocating its pagetables contiguously, which would cause multiple
> guest translations to be contained within the same host translation,
> getting much better temporal locality of reference from the TLB.


> It should also be noted that Xen is in a far better position to make
> easy use of 2MB and 1GB superpages, than a guest running normal
> workloads is. 

I don't think there's / should be linkage between Xen's knowledge and
guest's knowledge, is there? Guest only sees its own address space, it
can not control what types of pages are backing its address space. So
this is a moot point IMHO.

> However, the ideal case is to make use of both host and
> guest superpages.

Yes. This work is to enable guest to more easily make use of huge pages.

Host part is not included and should be solved separately.

> > By having contiguous guest physical address, guest can make use of
> > huge pages. This can reduce the number of G's in formula.
> >
> > Reducing number of H's is another project for hypervisor side
> > improvement and should be decoupled from Linux side changes.
> >
> > ## Design and implementation
> >
> > The use of balloon compaction doesn't require introducign new
> > interfaces between Xen balloon driver and the rest of the system. Most
> > changes are internal to Xen balloon driver.
> >
> > Currently, Xen balloon driver gets its page directly from page
> > allocator. To enable balloon page migration, those pages now need to
> > be allocated from core balloon driver. Pages allocated from core
> > balloon driver are subject to balloon page compaction.
> >
> > Xen balloon driver will also need to provide a callback to migrate
> > balloon page. In essence callback function receives "old page", which
> > is a already ballooned out page, and "new page", which is a page to be
> > ballooned out, then it inflates "old page" and deflates "new page".
> >
> > The core of migration callback is XENMEM\_exchange hypercall. This
> > makes sure that inflation of old page and deflation of new page is
> > done atomically, so even if a domain is beyond its memory target and
> > being enforced, it can still compact memory.
> >
> > ## HAP table fragmentation is not made worse
> >
> > *Assumption*: guest physical address space is already heavily
> > fragmented by balloon pages when balloon page compaction is required.
> >
> > For a typical test case like ballooning up and down when doing kernel
> > compilation, there's usually only a handful huge pages left in the
> > end. So the observation matches the assumption. On the other hand, if
> > guest physical address space is not heavily fragmented, it's not
> > likely balloon page compaction will be triggered automatically.
> >
> > In practice, balloon page compaction is not likely to make things
> > worse. Here is the analysis based on the above assumption.
> >
> > Note that HAP table is already shattered by balloon pages. When a
> > guest page is ballooned out, the underlying HAP entry needs to be
> > split should that entry pointed to a huge page.
> >
> > XENMEM\_exchange works as followed, "old page" is the guest page about
> > to get inflated and "new page" is the guest page about to get
> > deflated. It works like this:
> >
> > 1. Steal old page from domain.
> > 2. Allocate a heap page from domheap
> > 3. Release new page back to Xen
> > 4. Update guest physmap, old page points to heap page, new page points
> >    to INVALID\_MFN.
> >
> > The end result is that HAP entry for "old page" now points to a valid
> > MFN instead of having INVALID\_MFN; HAP entry for "new page" now points
> > to INVALID\_MFN.
> >
> > So for old page we're in the same position as before. HAP table is
> > fragmented, however it's not more fragmented than before.
> >
> > For new page, the risk is that if the targeting guest new page is part
> > of a huge page, we need to split HAP entry, hence fragmenting HAP
> > table. This is valid concern. However in practice, guest address space
> > is already fragmented by ballooning. It's not likely we need to break
> > up any more huge pages, because there aren't that many left. So we're
> > in a position no worse than before.
> >
> > Another downside is that when Xen is exchanging a page, it's possible
> > that Xen may need to break up a huge page to get a 4K page. Xen
> > domheap is fragmented. However we're not getting any worse than before
> > as ballooning already fragments domheap.
> >
> > ## Beyond Linux balloon compaction infrastructure
> >
> > Currently there's no mechanism in Xen to coalesce HAP table
> > entries. To coalesce HAP entries we would need to make sure all
> > discrete entries belong to one huge page, are in correct order and
> > correct state.
> >
> > By introducing necessary infrastructure(s) inside hypervisor (page
> > migration etc.), we might eventually be able to coalesce HAP entries,
> > hence reducing the number of H's in the aforementioned formula. This,
> > combined with the work on guest side, can help guest achieve best
> > possible performance.
> If I understand your proposal correctly, a VM lifetime would look like this:
> 1 Xen allocates pages (hopefully 2MB where possible)
> 2 Guest starts up, and shatters both guest and host superpages by
> blindly ballooning random gfns
> 3a During runtime, guest spends time copying pages around in an attempt
> to coalesce
> 3b (optionally, given Xen support) Xen spends time copying pages around
> in an attempt to coalesce
> 4 Guest reshatters guest and host pages by more ballooning.
> 5 goto 3


> Step 3 is an expensive (especially for Xen, which has far more important
> things to be doing with its time) and which is self-perpetuating because
> of the balloon driver reshattering pages.
> Several factors contribute to shattering host pages.  The ones which
> come to mind are:
> * Differing cacheability from MTRRs
> * Mapping a foreign grant into ones own physical address space

Not quite sure about the first one but the second one is not affected by
balloon compaction as those pages, even mapped with balloon pages, are
handled separately.

> * Releasing pages back to Xen via the decrease_reservation hypercall

This will affect Xen heap, however we're not making it worse. See my
analysis above.

> In addition, other factors complicate Xens ability to move pages.
> * Mappings from other domains (Qemu, PV backends, etc) will pin mfns in
> place
> * Any IOMMU mappings will pin all (mapped) mfns in place.

Right. But balloon compaction is not making things any worse either.

> As a result, by far the most efficient way of prevent superpage
> fragmentation is to not shattering them in the first place.  This can be
> done by changing the balloon driver in the guest to co-locate all pages
> it decides to balloon, rather than taking individual pages at random
> from the main memory pools.

I understand your concern, I also understand the best solution so far is
to avoid shattering in the first place.  However these points won't
invalidate this work, because balloon compaction is not making things
any worse, and with a few tricks inside Xen balloon driver (as you
proposed) can make things better.

Your proposal of "changing the balloon driver in the guest to co-locate
all pages it decides to balloon", I can see two upstreamable solutions
at a quick glance:

1. Allocate / release huge page from / to HV in the first place.
2. Allocate normal page, and occasionally swap them with huge page if
   resources (both in Xen and guest) permitted.

#1 is not practical in a busy system, it also won't work against "bad
neighbor".  (I'm very happy to just change "order" in every page
allocation call if that would do what we want)

#2 requires balloon page compaction, which is exactly the series is


> ~Andrew

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.