[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] domain creation vs querying free memory (xend and xl)



> From: Tim Deegan [mailto:tim@xxxxxxx]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl)
> 
> At 12:33 -0700 on 02 Oct (1349181195), Dan Magenheimer wrote:
> > > From: Tim Deegan [mailto:tim@xxxxxxx]
> > > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend 
> > > and xl)
> > >
> > > At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote:
> > > > Bearing in mind that I know almost nothing about xl or
> > > > the tools layer, and that, as a result, I tend to look
> > > > for hypervisor solutions, I'm thinking it's not possible to
> > > > solve this without direct participation of the hypervisor anyway,
> > > > at least while ensuring the solution will successfully
> > > > work with any memory technology that involves ballooning
> > > > with the possibility of overcommit (i.e. tmem, page sharing
> > > > and host-swapping, manual ballooning, PoD)...  EVEN if the
> > > > toolset is single threaded (i.e. only one domain may
> > > > be created at a time, such as xapi). [1]
> > >
> > > TTBOMK, Xapi actually _has_ solved this problem, even with ballooning
> > > and PoD.  I don't know if they have any plans to support sharing,
> > > swapping or tmem, though.
> >
> > Is this because PoD never independently increases the size of a domain's
> > allocation?
> 
> AIUI xapi uses the domains' maximum allocations, centrally controlled,
> to place an upper bound on the amount of guest memory that can be in
> use.  Within those limits there can be ballooning activity.  But TBH I
> don't know the details.

Yes, that's the same as saying there is no memory-overcommit.

The original problem occurs only if there are multiple threads
of execution that can be simultaneously asking the hypervisor
to allocate memory without the knowledge of a single centralized
"controller".
 
> > > Adding a 'reservation' of free pages that may only be allocated by a
> > > given domain should be straightforward enough, but I'm not sure it helps
> >
> > It absolutely does help.  With tmem (and I think with paging), the
> > total allocation of a domain may be increased without knowledge by
> > the toolset.
> 
> But not past the domains' maximum allowance, right?  That's not the case
> with paging, anyway.

Right.  We can quibble about memory hot-add, depending on its design.

> > > much.  In the 'balloon-to-fit' model where all memory is already
> > > allocated to some domain (or tmem), some part of the toolstack needs to
> > > sort out freeing up the memory before allocating it to another VM.
> >
> > By balloon-to-fit, do you mean that all RAM is occupied?  Tmem
> > handles the "sort out freeing up the memory" entirely in the
> > hypervisor, so the toolstack never knows.
> 
> Does tmem replace ballooning/sharing/swapping entirely?  I thought they
> could coexist.  Or, if you jut mean that tmem owns all otherwise-free
> memory and will relinquish it on demand, then the same problems occur
> while the toolstack is moving memory from owned-by-guests to
> owned-by-tmem.

Tmem replaces sharing/swapping entirely for guests that support it.
Since kernel changes are required to support it, not all guests
will ever support it.  Now with full tmem support in the Linux kernel,
it is possible that eventually all non-legacy Linux guests will support
it.

Tmem dynamically handles all the transfer of owned-by memory capacity
in the hypervisor, essentially augmenting the page allocator, so
the hypervisor is the "controller".

Oh, and tmem doesn't replace ballooning at all... it works best with
selfballooning (which is also now in the Linux kernel).  Ballooning
is still a useful mechanism for moving memory capacity between
the guest and the hypervisor; tmem caches data and handles policy.

> > > Surely that component needs to handle the exclusion too - otherwise a
> > > series of small VM creations could stall a large one indefinitely.
> >
> > Not sure I understand this, but it seems feasible.
> 
> If you ask for a large VM and a small VM to be started at about the same
> time, the small VM will always win (since you'll free enough memory for
> the small VM before you free enough for the big one).  If you then ask
> for another small VM it will win again, and so forth, indefinitely
> postponing the large VM in the waiting-for-memory state, unless some
> agent explicitly enforces that VMs be started in order.  If you have
> such an agent you probably don't need a hypervisor interlock as well.

OK, I see, thanks.

> I think it would be better to back up a bit.  Maybe you could sketch out
> how you think [lib]xl ought to be handling ballooning/swapping/sharing/tmem
> when it's starting VMs.  I don't have a strong objection to accounting
> free memory to particular domains if it turns out to be useful, but as
> always I prefer not to have things happen in the hypervisor if they
> could happen in less privileged code.

I sketched it out earlier in this thread, will attach again below.

I agree with your last statement in general, but would modify it to
"if they could happen efficiently and effectively in less privileged code".
Obviously everything that Xen does can be done in less privileged code...
in an emulator.  Emulator's just don't do it fast enough.

Tmem argues that doing "memory capacity transfers" at a page granularity
can only be done efficiently in the hypervisor.  This is true for
page-sharing when it breaks a "share" also... it can't go ask the
toolstack to approve allocation of a new page every time a write to a shared
page occurs.

Does that make sense?

So the original problem must be solved if:
1) Domain creation is not serialized
2) Any domain's current memory allocation can be increased without
   approval of the toolstack.
Problem (1) arose independently and my interest is that it gets
solved in a way that (2) can also benefit.

Dan

(rough proposed design re-attached below)

> From: Dan Magenheimer
> Sent: Monday, October 01, 2012 2:04 PM
>    :
>    :
> Back to design brainstorming:
> 
> The way I am thinking about it, the tools need to be involved
> to the extent that they would need to communicate to the
> hypervisor the following facts (probably via new hypercall):
> 
> X1) I am launching a domain X and it is eventually going to
>    consume up to a maximum of N MB.  Please tell me if
>    there is sufficient RAM available AND, if so, reserve
>    it until I tell you I am done. ("AND" implies transactional
>    semantics)
> X2) The launch of X is complete and I will not be requesting
>    the allocation of any more RAM for it.  Please release
>    the reservation, whether or not I've requested a total
>    of N MB.
> 
> The calls may be nested or partially ordered, i.e.
>    X1...Y1...Y2...X2
>    X1...Y1...X2...Y2
> and the hypervisor must be able to deal with this.
> 
> Then there would need to be two "versions" of "xm/xl free".
> We can quibble about which should be the default, but
> they would be:
> 
> - "xl --reserved free" asks the hypervisor how much RAM
>    is available taking into account reservations
> - "xm --raw free" asks the hypervisor for the instantaneous
>    amount of RAM unallocated, not counting reservations
> 
> When the tools are not launching a domain (that is there
> has been a matching X2 for all X1), the results of the
> above "free" queries are always identical.
> 
> So, IanJ, does this match up with the design you were thinking
> about?
> 
> Thanks,
> Dan
> 
> [1] I think the core culprits are (a) the hypervisor accounts for
> memory allocation of pages strictly on a first-come-first-served
> basis and (b) the tools don't have any form of need-this-much-memory
> "transaction" model

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.