Xen project Mailing List

[Xen-changelog] [xen master] docs: rearrange and update NUMA placement documentation

Date: Wed, 17 Apr 2013 23:56:50 +0000

Delivery-date: Wed, 17 Apr 2013 23:57:00 +0000

List-id: "Change log for Mercurial \(receive only\)" <xen-changelog.lists.xen.org>

commit b3fc812715e35b730bdd1fcf7a942fa0380adc64 Author: Dario Faggioli <dario.faggioli@xxxxxxxxxx> AuthorDate: Wed Apr 17 10:57:38 2013 +0000 Commit: Ian Campbell <ian.campbell@xxxxxxxxxx> CommitDate: Wed Apr 17 12:11:15 2013 +0100 docs: rearrange and update NUMA placement documentation To include the new concept of NUMA aware scheduling and describe its impact. Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> Acked-by: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx> Acked-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx> --- docs/misc/xl-numa-placement.markdown | 94 ++++++++++++++++++++++++++++++---- 1 files changed, 83 insertions(+), 11 deletions(-) diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown index 802bcfe..caa3fec 100644 --- a/docs/misc/xl-numa-placement.markdown +++ b/docs/misc/xl-numa-placement.markdown @@ -14,22 +14,67 @@ the memory directly attached to the set of cores. The Xen hypervisor deals with NUMA machines by assigning to each domain a "node affinity", i.e., a set of NUMA nodes of the host from which they -get their memory allocated. +get their memory allocated. Also, even if the node affinity of a domain +is allowed to change on-line, it is very important to "place" the domain +correctly when it is fist created, as the most of its memory is allocated +at that time and can not (for now) be moved easily. NUMA awareness becomes very important as soon as many domains start running memory-intensive workloads on a shared host. In fact, the cost of accessing non node-local memory locations is very high, and the performance degradation is likely to be noticeable. -## Guest Placement in xl ## +For more information, have a look at the [Xen NUMA Introduction][numa_intro] +page on the Wiki. + +### Placing via pinning and cpupools ### + +The simplest way of placing a domain on a NUMA node is statically pinning +the domain's vCPUs to the pCPUs of the node. This goes under the name of +CPU affinity and can be set through the "cpus=" option in the config file +(more about this below). Another option is to pool together the pCPUs +spanning the node and put the domain in such a cpupool with the "pool=" +config option (as documented in our [Wiki][cpupools_howto]). + +In both the above cases, the domain will not be able to execute outside +the specified set of pCPUs for any reasons, even if all those pCPUs are +busy doing something else while there are others, idle, pCPUs. + +So, when doing this, local memory accesses are 100% guaranteed, but that +may come at he cost of some load imbalances. + +### NUMA aware scheduling ### + +If the credit scheduler is in use, the concept of node affinity defined +above does not only apply to memory. In fact, starting from Xen 4.3, the +scheduler always tries to run the domain's vCPUs on one of the nodes in +its node affinity. Only if that turns out to be impossible, it will just +pick any free pCPU. + +This is, therefore, something more flexible than CPU affinity, as a domain +can still run everywhere, it just prefers some nodes rather than others. +Locality of access is less guaranteed than in the pinning case, but that +comes along with better chances to exploit all the host resources (e.g., +the pCPUs). + +In fact, if all the pCPUs in a domain's node affinity are busy, it is +possible for the domain to run outside of there, but it is very likely that +slower execution (due to remote memory accesses) is still better than no +execution at all, as it would happen with pinning. For this reason, NUMA +aware scheduling has the potential of bringing substantial performances +benefits, although this will depend on the workload. + +## Guest placement in xl ## If using xl for creating and managing guests, it is very easy to ask for both manual or automatic placement of them across the host's NUMA nodes. -Note that xm/xend does the very same thing, the only differences residing -in the details of the heuristics adopted for the placement (see below). +Note that xm/xend does a very similar thing, the only differences being +the details of the heuristics adopted for automatic placement (see below), +and the lack of support (in both xm/xend and the Xen versions where that\ +was the default toolstack) for NUMA aware scheduling. -### Manual Guest Placement with xl ### +### Placing the guest manually ### Thanks to the "cpus=" option, it is possible to specify where a domain should be created and scheduled on, directly in its config file. This @@ -41,14 +86,19 @@ This is very simple and effective, but requires the user/system administrator to explicitly specify affinities for each and every domain, or Xen won't be able to guarantee the locality for their memory accesses. -It is also possible to deal with NUMA by partitioning the system using -cpupools. Again, this could be "The Right Answer" for many needs and -occasions, but has to be carefully considered and setup by hand. +Notice that this also pins the domain's vCPUs to the specified set of +pCPUs, so it not only sets the domain's node affinity (its memory will +come from the nodes to which the pCPUs belong), but at the same time +forces the vCPUs of the domain to be scheduled on those same pCPUs. -### Automatic Guest Placement with xl ### +### Placing the guest automatically ### If no "cpus=" option is specified in the config file, libxl tries to figure out on its own on which node(s) the domain could fit best. +If it finds one (some), the domain's node affinity get set to there, +and both memory allocations and NUMA aware scheduling (for the credit +scheduler and starting from Xen 4.3) will comply with it. + It is worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an host is an incarnation of the Bin Packing Problem. In fact, the various VMs with different memory sizes are the items to be packed, @@ -81,7 +131,7 @@ largest amounts of free memory helps keeping the memory fragmentation small, and maximizes the probability of being able to put more domains there. -## Guest Placement within libxl ## +## Guest placement in libxl ## xl achieves automatic NUMA placement because that is what libxl does by default. No API is provided (yet) for modifying the behaviour of @@ -93,15 +143,34 @@ any placement from happening: libxl_defbool_set(&domain_build_info->numa_placement, false); Also, if `numa_placement` is set to `true`, the domain must not -have any cpu affinity (i.e., `domain_build_info->cpumap` must +have any CPU affinity (i.e., `domain_build_info->cpumap` must have all its bits set, as it is by default), or domain creation will fail returning `ERROR_INVAL`. +Starting from Xen 4.3, in case automatic placement happens (and is +successful), it will affect the domain's node affinity and _not_ its +CPU affinity. Namely, the domain's vCPUs will not be pinned to any +pCPU on the host, but the memory from the domain will come from the +selected node(s) and the NUMA aware scheduling (if the credit scheduler +is in use) will try to keep the domain there as much as possible. + Besides than that, looking and/or tweaking the placement algorithm search "Automatic NUMA placement" in libxl\_internal.h. Note this may change in future versions of Xen/libxl. +## Xen < 4.3 ## + +As NUMA aware scheduling is a new feature of Xen 4.3, things are a little +bit different for earlier version of Xen. If no "cpus=" option is specified +and Xen 4.2 is in use, the automatic placement algorithm still runs, but +the results is used to _pin_ the vCPUs of the domain to the output node(s). +This is consistent with what was happening with xm/xend, which were also +affecting the domain's CPU affinity. + +On a version of Xen earlier than 4.2, there is not automatic placement at +all in xl or libxl, and hence no node or CPU affinity being affected. + ## Limitations ## Analyzing various possible placement solutions is what makes the @@ -109,3 +178,6 @@ algorithm flexible and quite effective. However, that also means it won't scale well to systems with arbitrary number of nodes. For this reason, automatic placement is disabled (with a warning) if it is requested on a host with more than 16 NUMA nodes. + +[numa_intro]: http://wiki.xen.org/wiki/Xen_NUMA_Introduction +[cpupools_howto]: http://wiki.xen.org/wiki/Cpupools_Howto -- generated by git-patchbot for /home/xen/git/xen.git#master _______________________________________________ Xen-changelog mailing list Xen-changelog@xxxxxxxxxxxxx http://lists.xensource.com/xen-changelog

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.