[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On Mon, Aug 19, 2013 at 01:58:51PM +0100, David Vrabel wrote: > On 16/08/13 05:13, Yechen Li wrote: > > > > +### nodemask VNODE\_TO\_PNODE(int vnode) ### > > + > > +This service is provided by the hypervisor (and wired, if necessary, all > > the > > +way up to the proper toolstack layer or guest kernel), since it is only Xen > > +that knows both the virtual and the physical topologies. > > The physical NUMA topology must not be exposed to guests that have a > virtual NUMA topology -- only the toolstack and Xen should know the > mapping between the two. I think exposing any NUMA topology to guest - irregardless whether it is based on real NUMA or not, is OK - and actually a pretty neat thing. Meaning you could tell a PV guest that it is running on a 16 socket NUMA box while in reality it is running on a single socket box. Or vice-versa. It can serve as a way to increase performance (or decrease) - and also do resource capping (This PV guest will only get 1G of real fast memory and then 7GB of slow memory) and let the OS handle the details of it (which it does nowadays). The mapping thought - of which PV pages should belong to which fake PV NUMA node - and how they bind to the real NUMA topology - that part I am not sure how to solve. More on this later. > > A guest cannot make sensible use of a machine topology as it may be > migrated to a host with a different topology. Correct. And that is OK - it just means that the performance can suck horribly while it is there. Or the guest can be migrated to even a better NUMA machine where it will perform even better. That is nothing new and this is no different if you had PV NUMA or not in a guest. > > > +## Description of the problem ## I think you have to backup with the problem description. That is you need to think of: - How a PV guest will allocate pages at bootup based on this - How it will balloon up/down within those "buckets". If you are using the guests NUMA hints it usually is in the form of 'allocate pages on this node' and the node information is of type 'pfn X to pfn Y are on this NUMA'. That does not work very well with ballooning as it can be scattered across various nodes. But that is mostly b/c the balloon driver is not even trying to use NUMA APIs. It could use it and then it would do the best it can and perhaps balloon round-robin across the NUMA pools. Or perhaps a better option would be to use the hotplug memory mechanism (which is implemented in the balloon driver) and do large swaths of memory. But more problematic is the migration. If you migrate a guest to node that has different NUMA topologies what you really really want is: - unplug all of the memory in the guest - replug the memory with the new NUMA topology Obviously this means you need some dynamic NUMA system - and I don't know of such. The unplug/plug can be done via the balloon driver and or hotplug memory system. But then - the boundaries of the NUMA pools is set a bootup time. And you would want to change them. Is SRAT/SLIT dynamic? Could it change during runtime? Then there is the concept of AutoNUMA were you would migrate pages from one node to another. With a PV guest that would imply that the hypervisor would poke the guest and say: "ok, time alter your P2M table". Which I guess right now is done best via the balloon driver - so what you really want is a callback to tell the balloon driver: Hey, balloon down and up this PFN block with on NUMA node X. Perhaps what could be done is to setup in the cluster of hosts the worst case NUMA topology and force it on all the guests. Then when migrating the "pools" can be filled/unfilled depending on which host the guest is - and whether it can fill up the NUMA pools properly. For example it migrates from a 1 node box to a 16 node box and all the memory is remote. It will empty out the PV NUMA box of the "closest" memory to zero and fill up the PV NUMA pool of the "farthest" with all memory to balance it out and have some real sense of the PV to machine host memory. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |