Xen project Mailing List

Re: [Xen-devel] [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation

On Wed, 2012-04-11 at 14:17 +0100, Dario Faggioli wrote: > Add some rationale and usage documentation for the new automatic > NUMA placement feature of xl. > > TODO: * Decide whether we want to have things like "Future Steps/Roadmap" > and/or "Performances/Benchmarks Results" here as well. I think these would be better in the list archives and on the wiki respectively. > Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> > > diff --git a/docs/misc/xl-numa-placement.txt b/docs/misc/xl-numa-placement.txt > new file mode 100644 > --- /dev/null > +++ b/docs/misc/xl-numa-placement.txt It looks like you are using something approximating markdown syntax here, so you might as well name this xl-numa-placement.markdown and get a .html version etc almost for free. > @@ -0,0 +1,205 @@ > + ------------------------------------- > + NUMA Guest Placement Design and Usage > + ------------------------------------- > + > +Xen deals with Non-Uniform Memory Access (NUMA) machines in many ways. For > +example each domain has its own "node affinity", i.e., a set of NUMA nodes > +of the host from which memory for that domain is allocated. That becomes > +very important as soon as many domains start running memory-intensive > +workloads on a shared host. In fact, accessing non node-local memory > +locations costs much more than node-local ones, to the point that the > +degradation in performance is likely to be noticeable. > + > +It is then quite obvious that, any mechanism that enable the most of the > +memory accesses for the most of the most of the guest domains to stay > +local is something very important to achieve when dealing with NUMA > +platforms. > + > + > +Node Affinity and CPU Affinity > +------------------------------ > + > +There is another very popular 'affinity', besides node affinity we are > +discussing here, which is '(v)cpu affinity'. Moreover, to make things > +even worse, the two are different but somehow related things. In fact, > +in both Xen and Linux worlds, 'cpu affinity' is the set of CPUs a domain > +(that would be a task, when talking about Linux) can be scheduled on. > +This seems to have few to do with memory accesses, but it does, as the ^little > +CPU where a domain run is also from where it tries to access its memory, > +i.e., that is one half of what decides whether a memory access is remote > +or local --- the other half being where the location it wants to access > +is stored. > + > +Of course, if a domain is known to only run on a subset of the physical > +CPUs of the host, it is very easy to turn all its memory accesses into > +local ones, by just constructing it's node affinity (in Xen) basing on ^based > +what nodes these CPUs belongs to. Actually, that is exactly what is being > +done by the hypervisor by default, as soon as it finds out a domain (or > +better, the vcpus of a domain, but let's avoid getting into too much > +details here) has a cpu affinity. > + > +This is working quite well, but it requires the user/system administrator > +to explicitly specify such property --- the cpu affinity --- while the > +domain is being created, or Xen won't be able to exploit that for ensuring > +accesses locality. > + > +On the other hand, as node affinity directly affects where domain's memory > +lives, it makes a lot of sense for it to be involved in scheduling decisions, > +as it would be great if the hypervisor would manage in scheduling all the > +vcpus of all the domains on CPUs attached to the various domains' local > +memory. That is why, the node affinity of a domain is treated by the > scheduler > +as the set of nodes on which it would be preferable to run it, although > +not at the cost of violating the scheduling algorithm behavior and > +invariants. This means it Xen will check whether a vcpu of a domain can run > +on one of the CPUs belonging to the nodes of the domain's node affinity, > +but will better run it somewhere else --- even on another, remote, CPU --- > +than violating the priority ordering (e.g., by kicking out from there another > +running vcpu with higher priority) it is designed to enforce. > + > +So, last but not least, what if a domain has both vcpu and node affinity, and > +they only partially match or they do not match at all (to understand how that > +can happen, see the following sections)? Well, in such case, all the domain > +memory will be allocated reflecting its node affinity, while scheduling will > +happen according to its vcpu affinities, meaning that it is easy enough to > +construct optimal, sub-optimal, neutral and even bad and awful configurations > +(which is something nice, e.g., for benchmarking purposes). The remainder > +part of this document is explaining how to do so. > + > + > +Specifying Node Affinity > +------------------------ > + > +Besides being automatically computed from the vcpu affinities of a domain > +(or also from it being part of a cpupool) within Xen, it might make sense > +for the user to specify the node affinity of its domains by hand, while > +editing their config files, as another form of partitioning the host > +resources. If that is the case, this is where the "nodes" option of the xl > +config file becomes useful. In fact, specifying something like the below > + > + nodes = [ '0', '1', '3', '4' ] > + > +in a domain configuration file would result in Xen assigning host NUMA nodes > +number 0, 1, 3 and 4 to the domain's node affinity, regardless of any vcpu > +affinity setting for the same domain. The idea is, yes, the to things are two > +related, and if only one is present, it makes sense to use the other for > +inferring it, but it is always possible to explicitly specify both of them, > +independently on how good or awful it could end up being. > + > +Therefore, this is what one should expect when using "nodes", perhaps in > +conjunction with "cpus" in a domain configuration file: > + > + * `cpus = "0, 1"` and no `nodes=` at all > + (i.e., only vcpu affinity specified): > + domain's vcpus can and will run only on host CPUs 0 and 1. Also, as > + domain's node affinity will be computed by Xen and set to whatever > + nodes host CPUs 0 and 1 belongs, all the domain's memory accesses will > + be local accesses; > + > + * `nodes = [ '0', '1' ]` and no `cpus=` at all > + (i.e., only node affinity present): > + domain's vcpus can run on any of the host CPUs, but the scheduler (at > + least if credit is used, as it is the only scheduler supporting this > + right now) will try running them on the CPUs that are part of host > + NUMA nodes 0 and 1. Memory-wise, all the domain's memory will be > + allocated on host NUMA nodes nodes 0 and 1. This means the most of > + the memory accesses of the domain should be local, but that will > + depend on the on-line load, behavior and actual scheduling of both > + the domain in question and all the other domains on the same host; > + > + * `nodes = [ '0', '1' ]` and `cpus = "0"`, with CPU 0 within node 0: > + (i.e., cpu affinity subset of node affinity): > + domain's vcpus can and will only run on host CPU 0. As node affinity > + is being explicitly set to host NUMA nodes 0 and 1 --- which includes > + CPU 0 --- all the memory access of the domain will be local; In this case won't some of (half?) the memory come from node 1 and therefore be non-local to cpu 0? > + > + * `nodes = [ '0', '1' ]` and `cpus = "0, 4", with CPU 0 in node 0 but > + CPU 4 in, say, node 2 (i.e., cpu affinity superset of node affinity): > + domain's vcpus can run on host CPUs 0 and 4, with CPU 4 not being within > + the node affinity (explicitly set to host NUMA nodes 0 and 1). The > + (credit) scheduler will try to keep memory accesses local by scheduling > + the domain's vcpus on CPU 0, but it may not achieve 100% success; > + > + * `nodes = [ '0', '1' ]` and `cpus = "4"`, with CPU 4 within, say, node 2 These examples might be a little clearer if you defined up front what the nodes and cpus were and then used that for all of them? > + (i.e., cpu affinity disjointed with node affinity): > + domain's vcpus can and will run only on host CPU 4, i.e., completely > + "outside" of the chosen node affinity. That necessarily means all the > + domain's memory access will be remote. > + > + > +Automatic NUMA Placement > +------------------------ > + > +Just in case one does not want to take the burden of manually specifying > +all the node (and, perhaps, CPU) affinities for all its domains, xl > implements > +some automatic placement logic. This basically means the user can ask the > +toolstack to try sorting things out in the best possible way for him. > +This is instead of specifying manually a domain's node affinity and can be > +paired or not with any vcpu affinity (in case it is, the relationship between > +vcpu and node affinities just stays as stated above). To serve this purpose, > +a new domain config switch has been introduces, i.e., the "nodes_policy" > +option. As the name suggests, it allows for specifying a policy to be used > +while attempting automatic placement of the new domain. Available policies > +at the time of writing are: A bunch of what follows would be good to have in the xl or xl.cfg man pages too/instead. (I started with this docs patch so I haven't actually looked at the earlier ones yet, perhaps this is already the case) > + > + * "auto": automatic placement by means of a not better specified (xl > + implementation dependant) algorithm. It is basically for those > + who do want automatic placement, but have no idea what policy > + or algorithm would be better... <<Just give me a sane default!>> > + > + * "ffit": automatic placement via the First Fit algorithm, applied checking > + the memory requirement of the domain against the amount of free > + memory in the various host NUMA nodes; > + > + * "bfit": automatic placement via the Best Fit algorithm, applied checking > + the memory requirement of the domain against the amount of free > + memory in the various host NUMA nodes; > + > + * "wfit": automatic placement via the Worst Fit algorithm, applied checking > + the memory requirement of the domain against the amount of free > + memory in the various host NUMA nodes; > + > +The various algorithms have been implemented as they offer different behavior > +and performances (for different performance metrics). For instance, First Fit > +is known to be efficient and quick, and it generally works better than Best > +Fit wrt memory fragmentation, although it tends to occupy "early" nodes more > +than "late" ones. On the other hand, Best Fit aims at optimizing memory > usage, > +although it introduces quite a bit of fragmentation, by leaving large amounts > +of small free memory areas. Finally, the idea behind Worst Fit is that it > will > +leave big enough free memory chunks to limit the amount of fragmentation, but > +it (as well as Best Fit does) is more expensive in terms of execution time, > as > +it needs the "list" of free memory areas to be kept sorted. > + > +Therefore, achieving automatic placement actually happens by properly using > +the "nodes" and "nodes_config" configuration options as follows: > + > + * `nodes="auto` or `nodes_policy="auto"`: > + xl will try fitting the domain on the host NUMA nodes by using its > + own default placing algorithm, with default parameters. Most likely, > + all nodes will be considered suitable for the domain (unless a vcpu > + affinity is specified, see the last entry of this list; > + > + * `nodes_policy="ffit"` (or `"bfit"`, `"wfit"`) and no `nodes=` at all: > + xl will try fitting the domain on the host NUMA nodes by using the > + requested policy. All nodes will be considered suitable for the > + domain, and consecutive fitting attempts will be performed while > + increasing the number of nodes on which to put the domain itself > + (unless a vcpu affinity is specified, see the last entry of this list); > + > + * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `nodes=2`: > + xl will try fitting the domain on the host NUMA nodes by using the > + requested policy and only the number of nodes specified in `nodes=` > + (2 in this example). Number of nodes rather than specifically node 2? This is different to the examples in the preceding section? > All the nodes will be considered suitable for > + the domain, and consecutive attempts will be performed while > + increasing such a value; > + > + * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `cpus="0-6": > + xl will try fitting the domain on the host NUMA nodes to which the CPUs > + specified as vcpu affinity (0 to 6 in this example) belong, by using the > + requested policy. In case it fails, consecutive fitting attempts will > + be performed with both a reduced (first) and an increased (next) number > + of nodes). > + > +Different usage patterns --- like specifying both a policy and a list of > nodes > +are accepted, but does not make much sense after all. Therefore, although xl > +will try at its best to interpret user's will, the resulting behavior is > +somehow unspecified. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.