[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions



> From: Ian Campbell [mailto:Ian.Campbell@xxxxxxxxxx]
> Sent: Tuesday, January 08, 2013 2:03 AM
> To: Dan Magenheimer
> Cc: Andres Lagar-Cavilla; Tim (Xen.org); Konrad Rzeszutek Wilk; 
> xen-devel@xxxxxxxxxxxxx; Keir
> (Xen.org); George Dunlap; Ian Jackson; Jan Beulich
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of 
> problem and alternate
> solutions
> 
> On Mon, 2013-01-07 at 18:41 +0000, Dan Magenheimer wrote:
> > > From: Ian Campbell [mailto:Ian.Campbell@xxxxxxxxxx]
> > >
> > > On Thu, 2013-01-03 at 18:49 +0000, Dan Magenheimer wrote:
> > > >
> > > > Well, perhaps my statement is a bit heavy-handed, but I don't see
> > > > how it ends the discussion... you simply need to prove my statement
> > > > incorrect! ;-)  To me, that would mean pointing out any existing
> > > > implementation or even university research that successfully
> > > > predicts or externally infers future memory demand for guests.
> > > > (That's a good approximation of my definition of an omniscient
> > > > toolstack.)
> > >
> > > I don't think a solution involving massaging of tot_pages need involve
> > > either frequent changes to tot_pages nor omniscience from the tool
> > > stack.
> > >
> > > Start by separating the lifetime_maxmem from current_maxmem. The
> > > lifetime_maxmem is internal to the toolstack (it is effectively your
> > > tot_pages from today) and current_maxmem becomes whatever the toolstack
> > > has actually pushed down into tot_pages at any given time.
> > >
> > > In the normal steady state lifetime_maxmem == current_maxmem.
> > >
> > > When you want to claim some memory in order to start a new domain of
> > > size M you *temporarily* reduce current_maxmem for some set of domains
> > > on the chosen host and arrange that the total of all the current_maxmems
> > > on the host is such that "HOST_MEM - SUM(current_maxmems) > M".
> > >
> > > Once the toolstack has built (or failed to build) the domain it can set
> > > all the current_maxmems back to their lifetime_maxmem values.
> > >
> > > If you want to build multiple domains in parallel then M just becomes
> > > the sum over all the domains currently being built.
> >
> > Hi Ian --
> >
> > Happy New Year!
> >
> > Perhaps you are missing an important point that is leading
> > you to oversimplify and draw conclusions based on that
> > oversimplification...
> >
> > We are _primarily_ discussing the case where physical RAM is
> > overcommitted, or to use your terminology IIUC:
> >
> >    SUM(lifetime_maxmem) > HOST_MEM
> 
> I understand this perfectly well.
> 
> > Thus:
> >
> > > In the normal steady state lifetime_maxmem == current_maxmem.
> >
> > is a flawed assumption, except perhaps as an initial condition
> > or in systems where RAM is almost never a bottleneck.
> 
> I see that I have incorrectly (but it seems at least consistently) said
> "d->tot_pages" where I meant d->max_pages. This was no doubt extremely
> confusing and does indeed render the scheme unworkable. Sorry.
> 
> AIUI you currently set d->max_pages == lifetime_maxmem. In the steady
> state therefore current_maxmem == lifetime_maxmem == d->max_pages and
> nothing changes compared with how things are for you today
> 
> In the case where you are claiming some memory you change only max_pages
> (and not tot_pages as I incorrectly stated before, tot_pages can
> continue to vary dynamically, albeit with reduced range). So
> d->max_pages == current_maxmem which is derived as I describe previously
> (managing to keep my tot and max straight for once):
> 
>         When you want to claim some memory in order to start a new
>         domain of size M you *temporarily* reduce current_maxmem for
>         some set of domains on the chosen host and arrange that the
>         total of all the current_maxmems on the host is such that
>         "HOST_MEM - SUM(current_maxmems) > M".
> 
> I hope that clarifies what I was suggesting.
> 
> > Without that assumption, in your model, the toolstack must
> > make intelligent policy decisions about how to vary
> > current_maxmem relative to lifetime_maxmem, across all the
> > domains on the system.  Since the memory demands of any domain
> > often vary frequently, dramatically and unpredictably (i.e.
> > "spike") and since the performance consequences of inadequate
> > memory can be dire (i.e. "swap storm"), that is why I say the
> > toolstack (in your model) must both make frequent changes
> > to tot_pages and "be omniscient".
> 
> Agreed, I was mistaken in saying tot_pages where I meant max_pages.
> 
> My intention was to describe a scheme where max_pages would change only
> a) when you start building a new domain and b) when you finish building
> a domain. There should be no need to make adjustments between those
> events.
> 
> The inputs into the calculations are lifetime_maxmems for all domains,
> the current number of domains in the system, the initial allocation of
> any domain(s) currently being built (AKA the current claim) and the
> total physical RAM present in the host. AIUI all of those are either
> static or dynamic but only actually changing when new domains are
> introduced/removed (or otherwise only changing infrequently).
> 
> > So, Ian, would you please acknowledge that the Oracle model
> > is valid and, in such cases where your maxmem assumption
> > is incorrect, that hypervisor-controlled capacity allocation
> > (i.e. XENMEM_claim_pages) is an acceptable solution?
> 
> I have no problem with the validity of the Oracle model. I don't think
> we have reached the consensus that the hypervisor-controlled capacity
> allocation is the only possible solution, or the preferable solution
> from the PoV of the hypervisor maintainers. In that sense it is
> "unacceptable" because things which can be done outside the hypervisor
> should be and so I cannot acknowledge what you ask.
> 
> Apologies again for my incorrect use of tot_pages which has lead to this
> confusion.

Hi Ian --

> I have no problem with the validity of the Oracle model. I don't think
> we have reached the consensus that the hypervisor-controlled capacity
> allocation is the only possible solution, or the preferable solution
> from the PoV of the hypervisor maintainers. In that sense it is
> "unacceptable" because things which can be done outside the hypervisor
> should be and so I cannot acknowledge what you ask.

IMHO, you have not yet demonstrated that your alternate proposal solves
the problem in the context which Oracle cares about, so I regret that we
must continue this discussion.

> I see that I have incorrectly (but it seems at least consistently) said
> "d->tot_pages" where I meant d->max_pages. This was no doubt extremely
> confusing and does indeed render the scheme unworkable. Sorry.

I am fairly sure I understood exactly what you were saying and my
comments are the same even with your text substituted, i.e. your proposal
works fine when there are no memory overcommit technologies active
and, thus, on legacy proprietary domains; but your proposal fails
in the Oracle context.

So let's ensure we agree on a few premises:

First, you said we agree that we are discussing the case of overcommitted
memory, where:

   SUM(lifetime_maxmem) > HOST_MEM

So that's good.

Then a second premise that I would like to check to ensure we
agree:  In the Oracle model, as I said, "open source guest kernels
can intelligently participate in optimizing their own memory usage...
such guests are now shipping" (FYI Fedora, Ubuntu, and Oracle Linux).
With these mechanisms, there is direct guest->hypervisor interaction
that, without knowledge of the toolstack, causes d->tot_pages
to increase.  This interaction may (and does) occur from several
domains simultaneously and the increase for any domain may occur
frequently, unpredictably and sometimes dramatically.

Ian, do you agree with this premise and that a "capacity allocation
solution" (whether hypervisor-based or toolstack-based) must work
properly in this context?  Or are you maybe proposing to eliminate
all such interactions?  Or are you maybe proposing to insert the
toolstack in the middle of all such interactions?

Next, in your most recent reply, I think you skipped replying to my
comment of "[in your proposal] the toolstack must make intelligent
policy decisions about how to vary current_maxmem relative to
lifetime_maxmem, across all the domains on the system [1]".  We
seem to disagree on whether this need only be done twice per domain
launch (once at domain creation start and once at domain creation
finish, in your proposal) vs. more frequently.  But in either case,
do you agree that the toolstack is not equipped to make policy
decisions across multiple guests to do this and that poor
choices may have dire consequences (swapstorm, OOM) on a guest?
This is a third premise: Launching a domain should never cause
another unrelated domain to crash.  Do you agree?

I have more, but let's make sure we are on the same page
with these first.

Thanks,
Dan

[1] A clarification: In the Oracle model, there is only maxmem;
i.e. current_maxmem is always the same as lifetime_maxmem;
i.e. d->max_pages is fixed for the life of the domain and
only d->tot_pages varies; i.e. no intelligence is required
in the toolstack.  AFAIK, the distinction between current_maxmem
and lifetime_maxmem was added for Citrix DMC support.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.