[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling



instead of relying the domain-wide node-affinity. To achieve
that, we use the specific vcpu's node-affinity when computing
the NUMA load balancing mask in csched_balance_cpumask().

As a side effects, this simplifies the code quit a bit. In
fact, prior to this change, we needed to cache the translation
of d->node_affinity (which is a nodemask_t) to a cpumask_t,
since that is what is needed during the actual scheduling
(we used to keep it in node_affinity_cpumask).

Since the per-vcpu node-affinity is maintained in a cpumask_t
field (v->node_affinity) already, we don't need that complicated
updating logic in place any longer, and this change hence can
remove stuff like sched_set_node_affinity(),
csched_set_node_affinity() and, of course, node_affinity_cpumask
from csched_dom.

The high level description of NUMA placement and scheduling in
docs/misc/xl-numa-placement.markdown is updated too, to match
the new behavior. While at it, an attempt is made to make such
document as few ambiguous as possible, with respect to the
concepts of vCPU pinning, domain node-affinity and vCPU
node-affinity.

Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
---
 docs/misc/xl-numa-placement.markdown |  124 +++++++++++++++++++++++-----------
 xen/common/domain.c                  |    2 -
 xen/common/sched_credit.c            |   62 ++---------------
 xen/common/schedule.c                |    5 -
 xen/include/xen/sched-if.h           |    2 -
 xen/include/xen/sched.h              |    1 
 6 files changed, 94 insertions(+), 102 deletions(-)

diff --git a/docs/misc/xl-numa-placement.markdown 
b/docs/misc/xl-numa-placement.markdown
index caa3fec..890b856 100644
--- a/docs/misc/xl-numa-placement.markdown
+++ b/docs/misc/xl-numa-placement.markdown
@@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA 
node is usually
 defined as a set of processor cores (typically a physical CPU package) and
 the memory directly attached to the set of cores.
 
-The Xen hypervisor deals with NUMA machines by assigning to each domain
-a "node affinity", i.e., a set of NUMA nodes of the host from which they
-get their memory allocated. Also, even if the node affinity of a domain
-is allowed to change on-line, it is very important to "place" the domain
-correctly when it is fist created, as the most of its memory is allocated
-at that time and can not (for now) be moved easily.
-
 NUMA awareness becomes very important as soon as many domains start
 running memory-intensive workloads on a shared host. In fact, the cost
 of accessing non node-local memory locations is very high, and the
@@ -27,13 +20,37 @@ performance degradation is likely to be noticeable.
 For more information, have a look at the [Xen NUMA Introduction][numa_intro]
 page on the Wiki.
 
+## Xen and NUMA machines: the concept of _node-affinity_ ##
+
+The Xen hypervisor deals with NUMA machines throughout the concept of
+_node-affinity_. When talking about node-affinity, it is important to
+distinguish two different situations. The node-affinity *of a domain* is
+the set of NUMA nodes of the host where the memory for the domain is
+being allocated (mostly, at domain creation time). The node-affinity *of
+a vCPU* is the set of NUMA nodes of the host where the vCPU prefers to
+run on, and the (credit1 only, for now) scheduler will try to accomplish
+that, whenever it is possible.
+
+Of course, despite the fact that they belong to and affect different
+subsystems domain and vCPUs node-affinities are related. In fact, the
+node-affinity of a domain is the union of the node-affinities of all the
+domain's vCPUs.
+
+The above means that, when changing the vCPU node-affinity, the domain
+node-affinity also changes. Well, although this is allowed to happen
+on-line (i.e., when a domain is already running), that will not result
+in the memory that has already been allocated being moved to a different
+host NUMA node. This is why it is very important to "place" the domain
+correctly when it is fist created, as the most of its memory is allocated
+at that time and can not (for now) be moved easily.
+
 ### Placing via pinning and cpupools ###
 
 The simplest way of placing a domain on a NUMA node is statically pinning
-the domain's vCPUs to the pCPUs of the node. This goes under the name of
-CPU affinity and can be set through the "cpus=" option in the config file
-(more about this below). Another option is to pool together the pCPUs
-spanning the node and put the domain in such a cpupool with the "pool="
+the domain's vCPUs to the pCPUs of the node. This also goes under the name
+of vCPU-affinity and can be set through the "cpus=" option in the config
+file (more about this below). Another option is to pool together the pCPUs
+spanning the node and put the domain in such a _cpupool_ with the "pool="
 config option (as documented in our [Wiki][cpupools_howto]).
 
 In both the above cases, the domain will not be able to execute outside
@@ -45,24 +62,45 @@ may come at he cost of some load imbalances.
 
 ### NUMA aware scheduling ###
 
-If the credit scheduler is in use, the concept of node affinity defined
-above does not only apply to memory. In fact, starting from Xen 4.3, the
-scheduler always tries to run the domain's vCPUs on one of the nodes in
-its node affinity. Only if that turns out to be impossible, it will just
-pick any free pCPU.
+If using credit scheduler, and starting from Xen 4.3, the scheduler always
+tries to run the domain's vCPUs on one of the nodes in their node-affinity.
+Only if that turns out to be impossible, it will just pick any free pCPU.
+Moreover, starting from Xen 4.4, each vCPU can have its own node-affinity,
+potentially different from the ones of all the other vCPUs of the domain.
 
-This is, therefore, something more flexible than CPU affinity, as a domain
-can still run everywhere, it just prefers some nodes rather than others.
+This is, therefore, something more flexible than vCPU pinning, as vCPUs
+can still run everywhere, they just prefer some nodes rather than others.
 Locality of access is less guaranteed than in the pinning case, but that
 comes along with better chances to exploit all the host resources (e.g.,
 the pCPUs).
 
-In fact, if all the pCPUs in a domain's node affinity are busy, it is
-possible for the domain to run outside of there, but it is very likely that
-slower execution (due to remote memory accesses) is still better than no
-execution at all, as it would happen with pinning. For this reason, NUMA
-aware scheduling has the potential of bringing substantial performances
-benefits, although this will depend on the workload.
+In fact, if all the pCPUs in a VCPU's node-affinity are busy, it is
+possible for the domain to run outside from there. The idea is that
+slower execution (due to remote memory accesses) is still better than
+no execution at all (as it would happen with pinning). For this reason,
+NUMA aware scheduling has the potential of bringing substantial
+performances benefits, although this will depend on the workload.
+
+Notice that, for each vCPU, the following three scenarios are possbile:
+
+  * a vCPU *is pinned* to some pCPUs and *does not have* any vCPU
+    node-affinity. In this case, the vCPU is always scheduled on one
+    of the pCPUs to which it is pinned, without any specific peference
+    on which one of them. Internally, the vCPU's node-affinity is just
+    automatically computed from the vCPU pinning, and the scheduler
+    just ignores it;
+  * a vCPU *has* its own vCPU node-affinity and *is not* pinned to
+    any particular pCPU. In this case, the vCPU can run on every pCPU.
+    Nevertheless, the scheduler will try to have it running on one of
+    the pCPUs of the node(s) to which it has node-affinity with;
+  * a vCPU *has* its own vCPU node-affinity and *is also* pinned to
+    some pCPUs. In this case, the vCPU is always scheduled on one of the
+    pCPUs to which it is pinned, with, among them, a preference for the
+    ones that are from the node(s) it has node-affinity with. In case
+    pinning and node-affinity form two disjoint sets of pCPUs, pinning
+    "wins", and the node-affinity, although it is still used to derive
+    the domain's node-affinity (for memory allocation), is, from the
+    scheduler's perspective, just ignored.
 
 ## Guest placement in xl ##
 
@@ -71,25 +109,23 @@ both manual or automatic placement of them across the 
host's NUMA nodes.
 
 Note that xm/xend does a very similar thing, the only differences being
 the details of the heuristics adopted for automatic placement (see below),
-and the lack of support (in both xm/xend and the Xen versions where that\
+and the lack of support (in both xm/xend and the Xen versions where that
 was the default toolstack) for NUMA aware scheduling.
 
 ### Placing the guest manually ###
 
 Thanks to the "cpus=" option, it is possible to specify where a domain
 should be created and scheduled on, directly in its config file. This
-affects NUMA placement and memory accesses as the hypervisor constructs
-the node affinity of a VM basing right on its CPU affinity when it is
-created.
+affects NUMA placement and memory accesses as, in this case, the hypervisor
+constructs the node-affinity of a VM basing right on its vCPU pinning
+when it is created.
 
 This is very simple and effective, but requires the user/system
-administrator to explicitly specify affinities for each and every domain,
+administrator to explicitly specify the pinning of each and every domain,
 or Xen won't be able to guarantee the locality for their memory accesses.
 
-Notice that this also pins the domain's vCPUs to the specified set of
-pCPUs, so it not only sets the domain's node affinity (its memory will
-come from the nodes to which the pCPUs belong), but at the same time
-forces the vCPUs of the domain to be scheduled on those same pCPUs.
+That, of course, alsos mean the vCPUs of the domain will only be able to
+execute on those same pCPUs.
 
 ### Placing the guest automatically ###
 
@@ -97,7 +133,10 @@ If no "cpus=" option is specified in the config file, libxl 
tries
 to figure out on its own on which node(s) the domain could fit best.
 If it finds one (some), the domain's node affinity get set to there,
 and both memory allocations and NUMA aware scheduling (for the credit
-scheduler and starting from Xen 4.3) will comply with it.
+scheduler and starting from Xen 4.3) will comply with it. Starting from
+Xen 4.4, this just means all the vCPUs of the domain will have the same
+vCPU node-affinity, that is the outcome of such automatic "fitting"
+procedure.
 
 It is worthwhile noting that optimally fitting a set of VMs on the NUMA
 nodes of an host is an incarnation of the Bin Packing Problem. In fact,
@@ -143,22 +182,29 @@ any placement from happening:
     libxl_defbool_set(&domain_build_info->numa_placement, false);
 
 Also, if `numa_placement` is set to `true`, the domain must not
-have any CPU affinity (i.e., `domain_build_info->cpumap` must
-have all its bits set, as it is by default), or domain creation
-will fail returning `ERROR_INVAL`.
+have any vCPU pinning (i.e., `domain_build_info->cpumap` must have
+all its bits set, as it is by default), or domain creation will fail
+with an `ERROR_INVAL`.
 
 Starting from Xen 4.3, in case automatic placement happens (and is
 successful), it will affect the domain's node affinity and _not_ its
-CPU affinity. Namely, the domain's vCPUs will not be pinned to any
+vCPU pinning. Namely, the domain's vCPUs will not be pinned to any
 pCPU on the host, but the memory from the domain will come from the
 selected node(s) and the NUMA aware scheduling (if the credit scheduler
-is in use) will try to keep the domain there as much as possible.
+is in use) will try to keep the domain's vCPUs there as much as possible.
 
 Besides than that, looking and/or tweaking the placement algorithm
 search "Automatic NUMA placement" in libxl\_internal.h.
 
 Note this may change in future versions of Xen/libxl.
 
+## Xen < 4.4 ##
+
+The concept of per-vCPU node-affinity has been introduced for the first
+time in Xen 4.4. In Xen versions earlier than that, the node-affinity is
+the same for the whole domain, that is to say the same for all the vCPUs
+of the domain.
+
 ## Xen < 4.3 ##
 
 As NUMA aware scheduling is a new feature of Xen 4.3, things are a little
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 366d9b9..ae29945 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -410,8 +410,6 @@ void domain_update_node_affinity(struct domain *d)
     for_each_cpu ( cpu, cpumask )
         node_set(cpu_to_node(cpu), d->node_affinity);
 
-    sched_set_node_affinity(d, &d->node_affinity);
-
     spin_unlock(&d->node_affinity_lock);
 
     free_cpumask_var(online_affinity);
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index c53a36b..d228127 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -178,9 +178,6 @@ struct csched_dom {
     struct list_head active_vcpu;
     struct list_head active_sdom_elem;
     struct domain *dom;
-    /* cpumask translated from the domain's node-affinity.
-     * Basically, the CPUs we prefer to be scheduled on. */
-    cpumask_var_t node_affinity_cpumask;
     uint16_t active_vcpu_count;
     uint16_t weight;
     uint16_t cap;
@@ -261,32 +258,6 @@ __runq_remove(struct csched_vcpu *svc)
     list_del_init(&svc->runq_elem);
 }
 
-/*
- * Translates node-affinity mask into a cpumask, so that we can use it during
- * actual scheduling. That of course will contain all the cpus from all the
- * set nodes in the original node-affinity mask.
- *
- * Note that any serialization needed to access mask safely is complete
- * responsibility of the caller of this function/hook.
- */
-static void csched_set_node_affinity(
-    const struct scheduler *ops,
-    struct domain *d,
-    nodemask_t *mask)
-{
-    struct csched_dom *sdom;
-    int node;
-
-    /* Skip idle domain since it doesn't even have a node_affinity_cpumask */
-    if ( unlikely(is_idle_domain(d)) )
-        return;
-
-    sdom = CSCHED_DOM(d);
-    cpumask_clear(sdom->node_affinity_cpumask);
-    for_each_node_mask( node, *mask )
-        cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask,
-                   &node_to_cpumask(node));
-}
 
 #define for_each_csched_balance_step(step) \
     for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ )
@@ -294,12 +265,12 @@ static void csched_set_node_affinity(
 
 /*
  * vcpu-affinity balancing is always necessary and must never be skipped.
- * OTOH, if a domain's node-affinity is said to be automatically computed
- * (or if it just spans all the nodes), we can safely avoid dealing with
- * node-affinity entirely.
+ * OTOH, if the vcpu's numa-affinity is being automatically computed out of
+ * the vcpu's vcpu-affinity, or if it just spans all the nodes, we can
+ * safely avoid dealing with numa-affinity entirely.
  *
- * Node-affinity is also deemed meaningless in case it has empty
- * intersection with mask, to cover the cases where using the node-affinity
+ * A vcpu's numa-affinity is also deemed meaningless in case it has empty
+ * intersection with mask, to cover the cases where using the numa-affinity
  * mask seems legit, but would instead led to trying to schedule the vcpu
  * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu's
  * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus
@@ -308,11 +279,9 @@ static void csched_set_node_affinity(
 static inline int __vcpu_has_node_affinity(const struct vcpu *vc,
                                            const cpumask_t *mask)
 {
-    const struct domain *d = vc->domain;
-    const struct csched_dom *sdom = CSCHED_DOM(d);
-
-    if (cpumask_full(sdom->node_affinity_cpumask)
-         || !cpumask_intersects(sdom->node_affinity_cpumask, mask) )
+    if ( vc->auto_node_affinity == 1
+         || cpumask_full(vc->node_affinity)
+         || !cpumask_intersects(vc->node_affinity, mask) )
         return 0;
 
     return 1;
@@ -330,8 +299,7 @@ csched_balance_cpumask(const struct vcpu *vc, int step, 
cpumask_t *mask)
 {
     if ( step == CSCHED_BALANCE_NODE_AFFINITY )
     {
-        cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask,
-                    vc->cpu_affinity);
+        cpumask_and(mask, vc->node_affinity, vc->cpu_affinity);
 
         if ( unlikely(cpumask_empty(mask)) )
             cpumask_copy(mask, vc->cpu_affinity);
@@ -1110,13 +1078,6 @@ csched_alloc_domdata(const struct scheduler *ops, struct 
domain *dom)
     if ( sdom == NULL )
         return NULL;
 
-    if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) )
-    {
-        xfree(sdom);
-        return NULL;
-    }
-    cpumask_setall(sdom->node_affinity_cpumask);
-
     /* Initialize credit and weight */
     INIT_LIST_HEAD(&sdom->active_vcpu);
     INIT_LIST_HEAD(&sdom->active_sdom_elem);
@@ -1146,9 +1107,6 @@ csched_dom_init(const struct scheduler *ops, struct 
domain *dom)
 static void
 csched_free_domdata(const struct scheduler *ops, void *data)
 {
-    struct csched_dom *sdom = data;
-
-    free_cpumask_var(sdom->node_affinity_cpumask);
     xfree(data);
 }
 
@@ -1975,8 +1933,6 @@ const struct scheduler sched_credit_def = {
     .adjust         = csched_dom_cntl,
     .adjust_global  = csched_sys_cntl,
 
-    .set_node_affinity  = csched_set_node_affinity,
-
     .pick_cpu       = csched_cpu_pick,
     .do_schedule    = csched_schedule,
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index b3966ad..454f27d 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -648,11 +648,6 @@ int cpu_disable_scheduler(unsigned int cpu)
     return ret;
 }
 
-void sched_set_node_affinity(struct domain *d, nodemask_t *mask)
-{
-    SCHED_OP(DOM2OP(d), set_node_affinity, d, mask);
-}
-
 int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity)
 {
     cpumask_t online_affinity;
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index d95e254..4164dff 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -158,8 +158,6 @@ struct scheduler {
                                     struct xen_domctl_scheduler_op *);
     int          (*adjust_global)  (const struct scheduler *,
                                     struct xen_sysctl_scheduler_op *);
-    void         (*set_node_affinity) (const struct scheduler *,
-                                       struct domain *, nodemask_t *);
     void         (*dump_settings)  (const struct scheduler *);
     void         (*dump_cpu_state) (const struct scheduler *, int);
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index d8e4735..83f89c7 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -595,7 +595,6 @@ void sched_destroy_domain(struct domain *d);
 int sched_move_domain(struct domain *d, struct cpupool *c);
 long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *);
 long sched_adjust_global(struct xen_sysctl_scheduler_op *);
-void sched_set_node_affinity(struct domain *, nodemask_t *);
 int  sched_id(void);
 void sched_tick_suspend(void);
 void sched_tick_resume(void);


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.