Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

Date: Tue, 8 Jan 2013 11:41:21 -0800 (PST)

Cc: "Keir \(Xen.org\)" <keir@xxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>, "Tim \(Xen.org\)" <tim@xxxxxxx>, xen-devel@xxxxxxxxxxxxx, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>

Delivery-date: Tue, 08 Jan 2013 19:42:03 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

> From: Ian Campbell [mailto:Ian.Campbell@xxxxxxxxxx] > Sent: Tuesday, January 08, 2013 2:03 AM > To: Dan Magenheimer > Cc: Andres Lagar-Cavilla; Tim (Xen.org); Konrad Rzeszutek Wilk; > xen-devel@xxxxxxxxxxxxx; Keir > (Xen.org); George Dunlap; Ian Jackson; Jan Beulich > Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of > problem and alternate > solutions > > On Mon, 2013-01-07 at 18:41 +0000, Dan Magenheimer wrote: > > > From: Ian Campbell [mailto:Ian.Campbell@xxxxxxxxxx] > > > > > > On Thu, 2013-01-03 at 18:49 +0000, Dan Magenheimer wrote: > > > > > > > > Well, perhaps my statement is a bit heavy-handed, but I don't see > > > > how it ends the discussion... you simply need to prove my statement > > > > incorrect! ;-) To me, that would mean pointing out any existing > > > > implementation or even university research that successfully > > > > predicts or externally infers future memory demand for guests. > > > > (That's a good approximation of my definition of an omniscient > > > > toolstack.) > > > > > > I don't think a solution involving massaging of tot_pages need involve > > > either frequent changes to tot_pages nor omniscience from the tool > > > stack. > > > > > > Start by separating the lifetime_maxmem from current_maxmem. The > > > lifetime_maxmem is internal to the toolstack (it is effectively your > > > tot_pages from today) and current_maxmem becomes whatever the toolstack > > > has actually pushed down into tot_pages at any given time. > > > > > > In the normal steady state lifetime_maxmem == current_maxmem. > > > > > > When you want to claim some memory in order to start a new domain of > > > size M you *temporarily* reduce current_maxmem for some set of domains > > > on the chosen host and arrange that the total of all the current_maxmems > > > on the host is such that "HOST_MEM - SUM(current_maxmems) > M". > > > > > > Once the toolstack has built (or failed to build) the domain it can set > > > all the current_maxmems back to their lifetime_maxmem values. > > > > > > If you want to build multiple domains in parallel then M just becomes > > > the sum over all the domains currently being built. > > > > Hi Ian -- > > > > Happy New Year! > > > > Perhaps you are missing an important point that is leading > > you to oversimplify and draw conclusions based on that > > oversimplification... > > > > We are _primarily_ discussing the case where physical RAM is > > overcommitted, or to use your terminology IIUC: > > > > SUM(lifetime_maxmem) > HOST_MEM > > I understand this perfectly well. > > > Thus: > > > > > In the normal steady state lifetime_maxmem == current_maxmem. > > > > is a flawed assumption, except perhaps as an initial condition > > or in systems where RAM is almost never a bottleneck. > > I see that I have incorrectly (but it seems at least consistently) said > "d->tot_pages" where I meant d->max_pages. This was no doubt extremely > confusing and does indeed render the scheme unworkable. Sorry. > > AIUI you currently set d->max_pages == lifetime_maxmem. In the steady > state therefore current_maxmem == lifetime_maxmem == d->max_pages and > nothing changes compared with how things are for you today > > In the case where you are claiming some memory you change only max_pages > (and not tot_pages as I incorrectly stated before, tot_pages can > continue to vary dynamically, albeit with reduced range). So > d->max_pages == current_maxmem which is derived as I describe previously > (managing to keep my tot and max straight for once): > > When you want to claim some memory in order to start a new > domain of size M you *temporarily* reduce current_maxmem for > some set of domains on the chosen host and arrange that the > total of all the current_maxmems on the host is such that > "HOST_MEM - SUM(current_maxmems) > M". > > I hope that clarifies what I was suggesting. > > > Without that assumption, in your model, the toolstack must > > make intelligent policy decisions about how to vary > > current_maxmem relative to lifetime_maxmem, across all the > > domains on the system. Since the memory demands of any domain > > often vary frequently, dramatically and unpredictably (i.e. > > "spike") and since the performance consequences of inadequate > > memory can be dire (i.e. "swap storm"), that is why I say the > > toolstack (in your model) must both make frequent changes > > to tot_pages and "be omniscient". > > Agreed, I was mistaken in saying tot_pages where I meant max_pages. > > My intention was to describe a scheme where max_pages would change only > a) when you start building a new domain and b) when you finish building > a domain. There should be no need to make adjustments between those > events. > > The inputs into the calculations are lifetime_maxmems for all domains, > the current number of domains in the system, the initial allocation of > any domain(s) currently being built (AKA the current claim) and the > total physical RAM present in the host. AIUI all of those are either > static or dynamic but only actually changing when new domains are > introduced/removed (or otherwise only changing infrequently). > > > So, Ian, would you please acknowledge that the Oracle model > > is valid and, in such cases where your maxmem assumption > > is incorrect, that hypervisor-controlled capacity allocation > > (i.e. XENMEM_claim_pages) is an acceptable solution? > > I have no problem with the validity of the Oracle model. I don't think > we have reached the consensus that the hypervisor-controlled capacity > allocation is the only possible solution, or the preferable solution > from the PoV of the hypervisor maintainers. In that sense it is > "unacceptable" because things which can be done outside the hypervisor > should be and so I cannot acknowledge what you ask. > > Apologies again for my incorrect use of tot_pages which has lead to this > confusion. Hi Ian -- > I have no problem with the validity of the Oracle model. I don't think > we have reached the consensus that the hypervisor-controlled capacity > allocation is the only possible solution, or the preferable solution > from the PoV of the hypervisor maintainers. In that sense it is > "unacceptable" because things which can be done outside the hypervisor > should be and so I cannot acknowledge what you ask. IMHO, you have not yet demonstrated that your alternate proposal solves the problem in the context which Oracle cares about, so I regret that we must continue this discussion. > I see that I have incorrectly (but it seems at least consistently) said > "d->tot_pages" where I meant d->max_pages. This was no doubt extremely > confusing and does indeed render the scheme unworkable. Sorry. I am fairly sure I understood exactly what you were saying and my comments are the same even with your text substituted, i.e. your proposal works fine when there are no memory overcommit technologies active and, thus, on legacy proprietary domains; but your proposal fails in the Oracle context. So let's ensure we agree on a few premises: First, you said we agree that we are discussing the case of overcommitted memory, where: SUM(lifetime_maxmem) > HOST_MEM So that's good. Then a second premise that I would like to check to ensure we agree: In the Oracle model, as I said, "open source guest kernels can intelligently participate in optimizing their own memory usage... such guests are now shipping" (FYI Fedora, Ubuntu, and Oracle Linux). With these mechanisms, there is direct guest->hypervisor interaction that, without knowledge of the toolstack, causes d->tot_pages to increase. This interaction may (and does) occur from several domains simultaneously and the increase for any domain may occur frequently, unpredictably and sometimes dramatically. Ian, do you agree with this premise and that a "capacity allocation solution" (whether hypervisor-based or toolstack-based) must work properly in this context? Or are you maybe proposing to eliminate all such interactions? Or are you maybe proposing to insert the toolstack in the middle of all such interactions? Next, in your most recent reply, I think you skipped replying to my comment of "[in your proposal] the toolstack must make intelligent policy decisions about how to vary current_maxmem relative to lifetime_maxmem, across all the domains on the system [1]". We seem to disagree on whether this need only be done twice per domain launch (once at domain creation start and once at domain creation finish, in your proposal) vs. more frequently. But in either case, do you agree that the toolstack is not equipped to make policy decisions across multiple guests to do this and that poor choices may have dire consequences (swapstorm, OOM) on a guest? This is a third premise: Launching a domain should never cause another unrelated domain to crash. Do you agree? I have more, but let's make sure we are on the same page with these first. Thanks, Dan [1] A clarification: In the Oracle model, there is only maxmem; i.e. current_maxmem is always the same as lifetime_maxmem; i.e. d->max_pages is fixed for the life of the domain and only d->tot_pages varies; i.e. no intelligence is required in the toolstack. AFAIK, the distinction between current_maxmem and lifetime_maxmem was added for Citrix DMC support. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.