Before this change, each vcpu had its own vcpu-affinity
(in v->cpu_affinity), representing the set of pcpus where
the vcpu is allowed to run. Since when NUMA-aware scheduling
was introduced the (credit1 only, for now) scheduler also
tries as much as it can to run all the vcpus of a domain
on one of the nodes that constitutes the domain's
node-affinity.
The idea here is making the mechanism more general by:
   * allowing for this 'preference' for some pcpus/nodes to be
     expressed on a per-vcpu basis, instead than for the domain
     as a whole. That is to say, each vcpu should have its own
     set of preferred pcpus/nodes, instead than it being the
     very same for all the vcpus of the domain;
   * generalizing the idea of 'preferred pcpus' to not only NUMA
     awareness and support. That is to say, independently from
     it being or not (mostly) useful on NUMA systems, it should
     be possible to specify, for each vcpu, a set of pcpus where
     it prefers to run (in addition, and possibly unrelated to,
     the set of pcpus where it is allowed to run).
We will be calling this set of *preferred* pcpus the vcpu's
soft affinity, and this changes introduce it, and starts using it
for scheduling, replacing the indirect use of the domain's NUMA
node-affinity. This is more general, as soft affinity does not
have to be related to NUMA. Nevertheless, it allows to achieve the
same results of NUMA-aware scheduling, just by making soft affinity
equal to the domain's node affinity, for all the vCPUs (e.g.,
from the toolstack).
This also means renaming most of the NUMA-aware scheduling related
functions, in credit1, to something more generic, hinting toward
the concept of soft affinity rather than directly to NUMA awareness.
As a side effects, this simplifies the code quit a bit. In fact,
prior to this change, we needed to cache the translation of
d->node_affinity (which is a nodemask_t) to a cpumask_t, since that
is what scheduling decisions require (we used to keep it in
node_affinity_cpumask). This, and all the complicated logic
required to keep it updated, is not necessary any longer.
The high level description of NUMA placement and scheduling in
docs/misc/xl-numa-placement.markdown is being updated too, to match
the new architecture.
Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
Reviewed-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
---
Changes from v2:
  * this patch folds patches 6 ("xen: sched: make space for
    cpu_soft_affinity") and 10 ("xen: sched: use soft-affinity
    instead of domain's node-affinity"), as suggested during
    review. 'Reviewed-by' from George is there since both patch
    6 and 10 had it, and I didn't do anything else than squashing
    them.
Changes from v1:
  * in v1, "7/12 xen: numa-sched: use per-vcpu node-affinity for
    actual scheduling" was doing something very similar to this
    patch.
---
  docs/misc/xl-numa-placement.markdown |  148 ++++++++++++++++++++------------
  xen/common/domain.c                  |    5 +-
  xen/common/keyhandler.c              |    2 +
  xen/common/sched_credit.c            |  153 +++++++++++++---------------------
  xen/common/schedule.c                |    3 +
  xen/include/xen/sched.h              |    3 +
  6 files changed, 168 insertions(+), 146 deletions(-)
diff --git a/docs/misc/xl-numa-placement.markdown 
b/docs/misc/xl-numa-placement.markdown
index caa3fec..b1ed361 100644
--- a/docs/misc/xl-numa-placement.markdown
+++ b/docs/misc/xl-numa-placement.markdown
@@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA 
node is usually
  defined as a set of processor cores (typically a physical CPU package) and
  the memory directly attached to the set of cores.
  
-The Xen hypervisor deals with NUMA machines by assigning to each domain
-a "node affinity", i.e., a set of NUMA nodes of the host from which they
-get their memory allocated. Also, even if the node affinity of a domain
-is allowed to change on-line, it is very important to "place" the domain
-correctly when it is fist created, as the most of its memory is allocated
-at that time and can not (for now) be moved easily.
-
  NUMA awareness becomes very important as soon as many domains start
  running memory-intensive workloads on a shared host. In fact, the cost
  of accessing non node-local memory locations is very high, and the
@@ -27,14 +20,37 @@ performance degradation is likely to be noticeable.
  For more information, have a look at the [Xen NUMA Introduction][numa_intro]
  page on the Wiki.
  
+## Xen and NUMA machines: the concept of _node-affinity_ ##
+
+The Xen hypervisor deals with NUMA machines throughout the concept of
+_node-affinity_. The node-affinity of a domain is the set of NUMA nodes
+of the host where the memory for the domain is being allocated (mostly,
+at domain creation time). This is, at least in principle, different and
+unrelated with the vCPU (hard and soft, see below) scheduling affinity,
+which instead is the set of pCPUs where the vCPU is allowed (or prefers)
+to run.
+
+Of course, despite the fact that they belong to and affect different
+subsystems, the domain node-affinity and the vCPUs affinity are not
+completely independent.
+In fact, if the domain node-affinity is not explicitly specified by the
+user, via the proper libxl calls or xl config item, it will be computed
+basing on the vCPUs' scheduling affinity.
+
+Notice that, even if the node affinity of a domain may change on-line,
+it is very important to "place" the domain correctly when it is fist
+created, as the most of its memory is allocated at that time and can
+not (for now) be moved easily.
+
  ### Placing via pinning and cpupools ###
  
-The simplest way of placing a domain on a NUMA node is statically pinning
-the domain's vCPUs to the pCPUs of the node. This goes under the name of
-CPU affinity and can be set through the "cpus=" option in the config file
-(more about this below). Another option is to pool together the pCPUs
-spanning the node and put the domain in such a cpupool with the "pool="
-config option (as documented in our [Wiki][cpupools_howto]).
+The simplest way of placing a domain on a NUMA node is setting the hard
+scheduling affinity of the domain's vCPUs to the pCPUs of the node. This
+also goes under the name of vCPU pinning, and can be done through the
+"cpus=" option in the config file (more about this below). Another option
+is to pool together the pCPUs spanning the node and put the domain in
+such a _cpupool_ with the "pool=" config option (as documented in our
+[Wiki][cpupools_howto]).
  
  In both the above cases, the domain will not be able to execute outside
  the specified set of pCPUs for any reasons, even if all those pCPUs are
@@ -45,24 +61,45 @@ may come at he cost of some load imbalances.
  
  ### NUMA aware scheduling ###
  
-If the credit scheduler is in use, the concept of node affinity defined
-above does not only apply to memory. In fact, starting from Xen 4.3, the
-scheduler always tries to run the domain's vCPUs on one of the nodes in
-its node affinity. Only if that turns out to be impossible, it will just
-pick any free pCPU.
-
-This is, therefore, something more flexible than CPU affinity, as a domain
-can still run everywhere, it just prefers some nodes rather than others.
-Locality of access is less guaranteed than in the pinning case, but that
-comes along with better chances to exploit all the host resources (e.g.,
-the pCPUs).
-
-In fact, if all the pCPUs in a domain's node affinity are busy, it is
-possible for the domain to run outside of there, but it is very likely that
-slower execution (due to remote memory accesses) is still better than no
-execution at all, as it would happen with pinning. For this reason, NUMA
-aware scheduling has the potential of bringing substantial performances
-benefits, although this will depend on the workload.
+If using the credit1 scheduler, and starting from Xen 4.3, the scheduler
+itself always tries to run the domain's vCPUs on one of the nodes in
+its node-affinity. Only if that turns out to be impossible, it will just
+pick any free pCPU. Locality of access is less guaranteed than in the
+pinning case, but that comes along with better chances to exploit all
+the host resources (e.g., the pCPUs).
+
+Starting from Xen 4.4, credit1 supports two forms of affinity: hard and