[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
On 18/12/12 22:17, Konrad Rzeszutek Wilk wrote: Hi Dan, an issue with your reasoning throughout has been the constant invocation of the multi host environment as a justification for your proposal. But this argument is not used in your proposal below beyond this mention in passing. Further, there is no relation between what you are changing (the hypervisor) and what you are claiming it is needed for (multi host VM management).Heh. I hadn't realized that the emails need to conform to a the way legal briefs are written in US :-) Meaning that each topic must be addressed. Every time we try to suggest alternatives, Dan goes on some rant about how we're on different planets, how we're all old-guard stuck in static-land thinking, and how we're focused on single-server use cases, but that multi-server use cases are so different. That's not a one-off, Dan has brought up the multi-server case as a reason that a user-space version won't work several times. But when it comes down to it, he (apparently) has barely mentioned it. If it's such a key reason point, why does he not bring it up here? It turns out we were right all along -- the whole multi-server thing has nothing to do with it. That's the point Andres is getting at, I think. (FYI I'm not wasting my time reading mail from Dan anymore on this subject. As far as I can tell in this entire discussion he has never changed his mind or his core argument in response to anything anyone has said, nor has he understood better our ideas or where we are coming from. He has only responded by generating more verbiage than anyone has the time to read and understand, much less respond to. That's why I suggested to Dan that he ask someone else to take over the conversation.) Anyhow, the multi-host env or a single-host env has the same issue - you try to launch multiple guests and you some of them might not launch. The changes that Dan is proposing (the claim hypercall) would provide the functionality to fix this problem.A fairly bizarre limitation of a balloon-based approach to memory management. Why on earth should the guest be allowed to change the size of its balloon, and therefore its footprint on the host. This may be justified with arguments pertaining to the stability of the in-guest workload. What they really reveal are limitations of ballooning. But the inadequacy of the balloon in itself doesn't automatically translate into justifying the need for a new hyper call.Why is this a limitation? Why shouldn't the guest the allowed to change its memory usage? It can go up and down as it sees fit. And if it goes down and it gets better performance - well, why shouldn't it do it? I concur it is odd - but it has been like that for decades. Well, it shouldn't be allowed to do it because it causes this problem you're having with creating guests in parallel. Ultimately, that is the core of your problem. So if you want us to solve the problem by implementing something in the hypervisor, then you need to justify why "Just don't have guests balloon down" is an unacceptable option. Saying "why shouldn't it", and "it's been that way for decades*" isn't a good enough reason. * Xen is only just 10, so "decades" is a bit of a hyperbole. :-) the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge. The toolstack controls constraints (essentially a minimum and maximum) which the hypervisor enforces. The toolstack can ensure that the minimum and maximum are identical to essentially disallow Linux from using this functionality. Indeed, this is precisely what Citrix's Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory footprint changes. But DMC is not prescribed by the toolstack,Neither is enforcing min==max. This was my argument when previously commenting on this thread. The fact that you have enforcement of a maximum domain allocation gives you an excellent tool to keep a domain's unsupervised growth at bay. The toolstack can choose how fine-grained, how often to be alerted and stall the domain.There is a down-call (so events) to the tool-stack from the hypervisor when the guest tries to balloon in/out? So the need for this problem arose but the mechanism to deal with it has been shifted to the user-space then? What to do when the guest does this in/out balloon at freq intervals? I am missing actually the reasoning behind wanting to stall the domain? Is that to compress/swap the pages that the guest requests? Meaning an user-space daemon that does "things" and has ownership of the pages?and some real Oracle Linux customers use and depend on the flexibility provided by in-guest ballooning. So guest-privileged-user-driven- ballooning is a potential issue for toolstack-based capacity allocation. [IIGT: This is why I have brought up DMC several times and have called this the "Citrix model,".. I'm not trying to be snippy or impugn your morals as maintainers.] B) Xen's page sharing feature has slowly been completed over a number of recent Xen releases. It takes advantage of the fact that many pages often contain identical data; the hypervisor merges them to saveGreat care has been taken for this statement to not be exactly true. The hypervisor discards one of two pages that the toolstack tells it to (and patches the physmap of the VM previously pointing to the discard page). It doesn't merge, nor does it look into contents. The hypervisor doesn't care about the page contents. This is deliberate, so as to avoid spurious claims of "you are using technique X!"Is the toolstack (or a daemon in userspace) doing this? I would have thought that there would be some optimization to do this somewhere?physical RAM. When any "shared" page is written, the hypervisor "splits" the page (aka, copy-on-write) by allocating a new physical page. There is a long history of this feature in other virtualization products and it is known to be possible that, under many circumstances, thousands of splits may occur in any fraction of a second. The hypervisor does not notify or ask permission of the toolstack. So, page-splitting is an issue for toolstack-based capacity allocation, at least as currently coded in Xen. [Andre: Please hold your objection here until you read further.]Name is Andres. And please cc me if you'll be addressing me directly! Note that I don't disagree with your previous statement in itself. Although "page-splitting" is fairly unique terminology, and confusing (at least to me). CoW works.<nods>C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and toolstack for over three years. It depends on an in-guest-kernel adaptive technique to constantly adjust the domain memory footprint as well as hooks in the in-guest-kernel to move data to and from the hypervisor. While the data is in the hypervisor's care, interesting memory-load balancing between guests is done, including optional compression and deduplication. All of this has been in Xen since 2009 and has been awaiting changes in the (guest-side) Linux kernel. Those changes are now merged into the mainstream kernel and are fully functional in shipping distros. While a complete description of tmem's guest<->hypervisor interaction is beyond the scope of this document, it is important to understand that any tmem-enabled guest kernel may unpredictably request thousands or even millions of pages directly via hypercalls from the hypervisor in a fraction of a second with absolutely no interaction with the toolstack. Further, the guest-side hypercalls that allocate pages via the hypervisor are done in "atomic" code deep in the Linux mm subsystem. Indeed, if one truly understands tmem, it should become clear that tmem is fundamentally incompatible with toolstack-based capacity allocation. But let's stop discussing tmem for now and move on.You have not discussed tmem pool thaw and freeze in this proposal.Oooh, you know about it :-) Dan didn't want to go too verbose on people. It is a bit of rathole - and this hypercall would allow to deprecate said freeze/thaw calls.OK. So with existing code both in Xen and Linux guests, there are three challenges to toolstack-based capacity allocation. We'd really still like to do capacity allocation in the toolstack. Can something be done in the toolstack to "fix" these three cases? Possibly. But let's first look at hypervisor-based capacity allocation: the proposed "XENMEM_claim_pages" hypercall. HYPERVISOR-BASED CAPACITY ALLOCATION The posted patch for the claim hypercall is quite simple, but let's look at it in detail. The claim hypercall is actually a subop of an existing hypercall. After checking parameters for validity, a new function is called in the core Xen memory management code. This function takes the hypervisor heaplock, checks for a few special cases, does some arithmetic to ensure a valid claim, stakes the claim, releases the hypervisor heaplock, and then returns. To review from earlier, the hypervisor heaplock protects _all_ page/slab allocations, so we can be absolutely certain that there are no other page allocation races. This new function is about 35 lines of code, not counting comments. The patch includes two other significant changes to the hypervisor: First, when any adjustment to a domain's memory footprint is made (either through a toolstack-aware hypercall or one of the three toolstack-unaware methods described above), the heaplock is taken, arithmetic is done, and the heaplock is released. This is 12 lines of code. Second, when any memory is allocated within Xen, a check must be made (with the heaplock already held) to determine if, given a previous claim, the domain has exceeded its upper bound, maxmem. This code is a single conditional test. With some declarations, but not counting the copious comments, all told, the new code provided by the patch is well under 100 lines. What about the toolstack side? First, it's important to note that the toolstack changes are entirely optional. If any toolstack wishes either to not fix the original problem, or avoid toolstack- unaware allocation completely by ignoring the functionality provided by in-guest ballooning, page-sharing, and/or tmem, that toolstack need not use the new hyper call.You are ruling out any other possibility here. In particular, but not limited to, use of max_pages.The one max_page check that comes to my mind is the one that Xapi uses. That is it has a daemon that sets the max_pages of all the guests at some value so that it can squeeze in as many guests as possible. It also balloons pages out of a guest to make space if need to launch. The heurestic of how many pages or the ratio of max/min looks to be proportional (so to make space for 1GB for a guest, and say we have 10 guests, we will subtract 101MB from each guest - the extra 1MB is for extra overhead). This depends on one hypercall that 'xl' or 'xm' toolstack do not use - which sets the max_pages. That code makes certain assumptions - that the guest will not go/up down in the ballooning once the toolstack has decreed how much memory the guest should use. It also assumes that the operations are semi-atomic - and to make it so as much as it can - it executes these operations in serial. No, the xapi code does no such assumptions. After it tells a guest to balloon down, it watches to see what actually happens, and has heuristics to deal with "non-cooperative guests". It does assume that if it sets max_pages lower than or equal to the current amount of used memory, that the hypervisor will not allow the guest to balloon up -- but that's a pretty safe assumption. A guest can balloon down if it wants to, but as xapi does not consider that memory free, it will never use it. BTW, I don't know if you realize this: Originally Xen would return an error if you tried to set max_pages below tot_pages. But as a result of the DMC work, it was seen as useful to allow the toolstack to tell the hypervisor once, "Once the VM has ballooned down to X, don't let it balloon up above X anymore." This goes back to the problem statement - if we try to parallize this we run in the problem that the amount of memory we thought we free is not true anymore. The start of this email has a good description of some of the issues. In essence, the max_pages does work - _if_ one does these operations in serial. We are trying to make this work in parallel and without any failures - for that we - one way that is quite simplistic is the claim hypercall. It sets up a 'stake' of the amount of memory that the hypervisor should reserve. This way other guests creations/ ballonning do not infringe on the 'claimed' amount. I'm not sure what you mean by "do these operations in serial" in this context. Each of your "reservation hypercalls" has to happen in serial. If we had a user-space daemon that was in charge of freeing up or reserving memory, each request to that daemon would happen in serial as well. But once the allocation / reservation happened, the domain builds could happen in parallel. I believe with this hypercall the Xapi can be made to do its operations in parallel as well. xapi can already boot guests in parallel when there's enough memory to do so -- what operations did you have in mind? I haven't followed all of the discussion (for reasons mentioned above), but I think the alternative to Dan's solution is something like below. Maybe you can tell me why it's not very suitable: Have one place in the user-space -- either in the toolstack, or a separate daemon -- that is responsible for knowing all the places where memory might be in use. Memory can be in use either by Xen, or by one of several VMs, or in a tmem pool. In your case, when not creating VMs, it can remove all limitations -- allow the guests or tmem to grow or shrink as much as they want. When a request comes in for a certain amount of memory, it will go and set each VM's max_pages, and the max tmem pool size. It can then check whether there is enough free memory to complete the allocation or not (since there's a race between checking how much memory a guest is using and setting max_pages). If that succeeds, it can return "success". If, while that VM is being built, another request comes in, it can again go around and set the max sizes lower. It has to know how much of the memory is "reserved" for the first guest being built, but if there's enough left after that, it can return "success" and allow the second VM to start being built. After the VMs are built, the toolstack can remove the limits again if it wants, again allowing the free flow of memory. Do you see any problems with this scheme? All it requires is for the toolstack to be able to temporarliy set limits on both guests ballooning up and on tmem allocating more than a certain amount of memory. We already have mechanisms for the first, so if we had a "max_pages" for tmem, then you'd have all the tools you need to implement it. This is the point at which Dan says something about giant multi-host deployments, which has absolutely no bearing on the issue -- the reservation happens at a host level, whether it's in userspace or the hypervisor. It's also where he goes on about how we're stuck in an old stodgy static world and he lives in a magical dynamic hippie world of peace and free love... er, free memory. Which is also not true -- in the scenario I describe above, tmem is actively being used, and guests can actively balloon down and up, while the VM builds are happening. In Dan's proposal, tmem and guests are prevented from allocating "reserved" memory by some complicated scheme inside the allocator; in the above proposal, tmem and guests are prevented from allocating "reserved" memory by simple hypervisor-enforced max_page settings. The end result looks the same to me. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |