[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Linux Xen Balloon Driver Improvement (Draft 2)

On Mon, Oct 27, 2014 at 05:29:16PM +0000, David Vrabel wrote:
> On 27/10/14 16:29, Wei Liu wrote:
> > On Mon, Oct 27, 2014 at 02:23:22PM +0000, David Vrabel wrote:
> >> On 27/10/14 12:33, Wei Liu wrote:
> >>>
> >>> Changes in this version:
> >>>
> >>> 1. Style, grammar and typo fixes.
> >>> 2. Make this document Linux centric.
> >>> 3. Add a new section for NUMA-aware ballooning.
> >>
> >> You've not included the required changes to the toolstack and
> >> autoballoon driver to always use 2M multiples when creating VMs and
> >> setting targets.
> >>
> > 
> > When creating VM, toolstack already tries to use as many huge pages as
> > possible.
> > 
> > Setting target doesn't use 2M multiples.  But I don't think this is
> > necessary. To balloon in / out X MB memory
> > 
> >   nr_2m = X % 2M
> >   nr_4k = (X / 2M) / 4k
> > 
> > The remainder just goes to 4K queue.
> I understand that it will work with 4k multiples but it is not /optimal/
> to do so since it will result in more fragmentation.

The fragmentation should be less than 2M right? Is that terrible?

> > And what do you mean by "autoballoon" driver? Do you mean functionality
> > of xl? In the end the request is still fulfilled by Xen balloon driver
> > in kernel. So if dom0 is using the new balloon driver proposed here, it
> > should balloon down in 2M multiples automatically.
> Both xl and the auto-balloon driver in the kernel should only set the
> target in multiples of 2M.

This is easy to achieve. I can always round up to 2M multiples in
balloon driver.  Change to toolstack is simple as well. Just that it's
very well possible newer kernel runs on older toolstack, or even
homebrew toolstacks that don't set target to 2M multiples. So after all
there are always suboptimal situations, be it 1) balloon out a bit more
memory than requested or 2) a little bit fragmentation.

I don't have very strong opinion on this. I will round up to 2M
multiples in balloon driver. Toolstack change will be introduced

> >>> ## Goal of improvement
> >>>
> >>> The balloon driver makes use of as many huge pages as possible,
> >>> defragmenting guest address space. Contiguous guest address space
> >>> permits huge page ballooning which helps prevent host address space
> >>> fragmentation.
> >>>
> >>> This should be achieved without any particular hypervisor side
> >>> feature.
> >>
> >> I really think you need to be taking whole-system view and not focusing
> >> on just the guest balloon driver.
> >>
> > 
> > I don't think there's terribly tight linkage between hypervisor side
> > change and guest side change.
> I don't see how you can think this unless you also have a design for the
> hypervisor side.

Because the basic requirement for this design is to not rely on
hypervisor side feature, so that we can have it worked on older
hypervisor as well. And by far the proposed design seems to stick to
that principle well.

> I do not want a situation were effective and efficient host
> defragmentation requires balloon driver changes to avoid a regression.

Fair enough. I think that should be classified as a bug in hypervisor.
We should not change guest side for that reason. And more reasoning
coming near the end of this mail...

> > To have guest automatically defragmenting it's address space while at
> > the same time helps prevent hypervisor memory from fragmenting (at least
> > this is what the design aims for, as for how it works in practice, it
> > needs to be prototyped and benchmarked).
> > 
> > The above reasoning is good enough to justify this change, isn't it?
> Having a whole system design does not mean that it must be all
> implemented.  If one part has benefits independently from the rest then
> it can be implemented and merged.

So are you worrying about this change in guest makes corresponding
feature in hypervisor harder to implement? Do you see harm (whether to
the guest itself or to the hypervisor) in this guest side change? Are we
any worse than before at least in the theoretical point of view? Of
course in practice we would still need to see how this goes.

> >>> ### Periodically exchange normal size pages with huge pages
> >>>
> >>> Worker thread wakes up periodically to check if there are enough pages
> >>> in normal size page queue to coalesce into a huge page. If so, it will
> >>> try to exchange that huge page into a number of normal size pages with
> >>> XENMEM\_exchange hypercall.
> >>
> >> I don't see what this is supposed to achieve.  This is going to take a
> >> (potentially) non-fragmented superpage and fragment it.
> >>
> > 
> > Let's look at this from start of day.
> > 
> > Guest always tries to balloon in / out as many 2M pages as possible. So
> > if we have a long list of 4K pages, it means the underlying host super
> > frames are fragmented already.
> > 
> > So if 1) there are enough 4K pages in ballooned out list, 2) there is a
> > spare 2M page, it means that the 2M page comes from the result of
> > balloon page compaction, which means the underlying host super frame is
> > fragmented.
> This assumption is only true because your page migration isn't trying
> hard enough to defragment super frames,

However hard it tries, if the hypervisor is not defragmenting, this
assumption still stands. As long as you get the 2M page as a result of
balloon compaction, the underlying host frame is fragmented. Note, we're
not worse than before.

> and it is assuming that Xen does
> nothing to address host super frame fragmentation.  This highlights the
> importance of looking at a system-level for designs, IMO.

What would make this design different when Xen knows how to defragment

We end up ballooning out a 2M host frame if the underlying huge frame is
defragemented (instead of a bunch of 4K frames). We're giving huge frame
back to Xen, so it's OK; then we exchange in 512 4K consecutive pages
(or a 2M page if we merge them) with 2M frame backing them. Xen is not
harmed; guest now has got a huge frame. It's only making things better
if Xen knows how to defragemnt.

In any case, I will need to prototype different approach to see which
works best. I think figures of Xen heap fragmentation and guest P2M
entry counts grouped by page order will be interesting.

Does it require change to guest balloon driver if we're to implement Xen
side feature? From the guest's point of view I don't see one. To do any
work with regard to changing guest P2M we would surely need to get hold
of domain lock and p2m lock, in which case a guest is blocked from
issuing any memory hypercall anyway.

If contention is a problem, how are we worse off what we have now?
Ballooning in / out certainly causes contention too. All we can do from
guest side is avoid trying too hard. But if the guest tries too hard
it's harming itself anyway.


> David

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.