[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed new "memory capacity claim" hypercall/feature



> From: Tim Deegan [mailto:tim@xxxxxxx]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature

Hi Tim --

> At 11:43 -0800 on 04 Nov (1352029386), Dan Magenheimer wrote:
> > > From: Keir Fraser [mailto:keir@xxxxxxx]
> > > Sent: Friday, November 02, 2012 3:30 AM
> > > To: Jan Beulich; Dan Magenheimer
> > > Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; 
> > > DarioFaggioli; xen-
> > > devel@xxxxxxxxxxxxx; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; 
> > > Zhigang Wang; TimDeegan
> > > Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> > >
> > > On 02/11/2012 09:01, "Jan Beulich" <JBeulich@xxxxxxxx> wrote:
> > >
> > > > Plus, if necessary, that loop could be broken up so that only the
> > > > initial part of it gets run with the lock held (see c/s
> > > > 22135:69e8bb164683 for why the unlock was moved past the
> > > > loop). That would make for a shorter lock hold time, but for a
> > > > higher allocation latency on large oder allocations (due to worse
> > > > cache locality).
> > >
> > > In fact I believe only the first page needs to have its count_info set to 
> > > !=
> > > PGC_state_free, while the lock is held. That is sufficient to defeat the
> > > buddy merging in free_heap_pages(). Similarly, we could hoist most of the
> > > first loop in free_heap_pages() outside the lock. There's a lot of scope 
> > > for
> > > optimisation here.
> >
> > (sorry for the delayed response)
> >
> > Aren't we getting a little sidetracked here?  (Maybe my fault for
> > looking at whether this specific loop is fast enough...)
> >
> > This loop handles only order=N chunks of RAM.  Speeding up this
> > loop and holding the heap_lock here for a shorter period only helps
> > the TOCTOU race if the entire domain can be allocated as a
> > single order-N allocation.
> 
> I think the idea is to speed up allocation so that, even for a large VM,
> you can just allocate memory instead of needing a reservation hypercall
> (whose only purpose, AIUI, is to give you an immediate answer).

Its purpose is to give an immediate answer on whether sufficient
space is available for allocation AND (atomically) claim it so
no other call to the allocator can race and steal some or all of
it away. So unless the allocation is sped up enough (given an arbitrary
size domain and arbitrary state of memory fragmentation) so that
the heap_lock can be held for that length of time, speeding
up allocation doesn't solve the problem.
 
> > So unless the code for the _entire_ memory allocation path can
> > be optimized so that the heap_lock can be held across _all_ the
> > allocations necessary to create an arbitrary-sized domain, for
> > any arbitrary state of memory fragmentation, the original
> > problem has not been solved.
> >
> > Or am I misunderstanding?
> >
> > I _think_ the claim hypercall/subop should resolve this, though
> > admittedly I have yet to prove (and code) it.
> 
> I don't think it solves it - or rather it might solve this _particular_
> instance of it but it doesn't solve the bigger problem.  If you have a
> set of overcommitted hosts and you want to start a new VM, you need to:
> 
>  - (a) decide which of your hosts is the least overcommitted;
>  - (b) free up enough memory on that host to build the VM; and
>  - (c) build the VM.
>
> The claim hypercall _might_ fix (c) (if it could handle allocations that
> need address-width limits or contiguous pages).  But (b) and (a) have
> exactly the same problem, unless there is a central arbiter of memory
> allocation (or equivalent distributed system).  If you try to start 2
> VMs at once,
> 
>  - (a) the toolstack will choose to start them both on the same machine,
>        even if that's not optimal, or in the case where one creation is
>        _bound_ to fail after some delay.
>  - (b) the other VMs (and perhaps tmem) start ballooning out enough
>        memory to start the new VM.  This can take even longer than
>        allocating it since it depends on guest behaviour.  It can fail
>        after an arbitrary delay (ditto).
> 
> If you have a toolstack with enough knowledge and control over memory
> allocation to sort out stages (a) and (b) in such a way that there are
> no delayed failures, (c) should be trivial.

(You've used the labels (a) and (b) twice so I'm not quite sure
I understand... but in any case)

Sigh.  No, you are missing the beauty of tmem and dynamic allocation;
you are thinking from the old static paradigm where the toolstack
controls how much memory is available.  There is no central arbiter
of memory anymore than there is a central toolstack (other than the
hypervisor on a one server Xen environment) that decides exactly
when to assign vcpus to pcpus.  There is no "free up enough memory
on that host".  Tmem doesn't start ballooning out enough memory
to start the VM... the guests are responsible for doing the ballooning
and it is _already done_.  The machine either has sufficient free+freeable
memory or it does not; and it is _that_ determination that needs
to be done atomically because many threads are micro-allocating, and
possibly multiple toolstack threads are macro-allocating,
simultaneously.

Everything is handled dynamically.  And just like a CPU scheduler
built into a hypervisor that dynamically allocates vcpu->pcpus
has proven more effective than partitioning pcpus to different
domains, dynamic memory management should prove more effective
than some bossy toolstack trying to control memory statically.

I understand that you can solve "my" problem in your paradigm
without a claim hypercall and/or by speeding up allocations.
I _don't_ see that you can solve "my" problem in _my_ paradigm
without a claim hypercall... speeding up allocations doesn't
solve the TOCTOU race so allocating sufficient space for a
domain must be atomic.

Sigh.

Dan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.