[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH v8 02/13] xen: sched: introduce soft-affinity and use it instead d->node-affinity



Before this change, each vcpu had its own vcpu-affinity
(in v->cpu_affinity), representing the set of pcpus where
the vcpu is allowed to run. Since when NUMA-aware scheduling
was introduced the (credit1 only, for now) scheduler also
tries as much as it can to run all the vcpus of a domain
on one of the nodes that constitutes the domain's
node-affinity.

The idea here is making the mechanism more general by:
  * allowing for this 'preference' for some pcpus/nodes to be
    expressed on a per-vcpu basis, instead than for the domain
    as a whole. That is to say, each vcpu should have its own
    set of preferred pcpus/nodes, instead than it being the
    very same for all the vcpus of the domain;
  * generalizing the idea of 'preferred pcpus' to not only NUMA
    awareness and support. That is to say, independently from
    it being or not (mostly) useful on NUMA systems, it should
    be possible to specify, for each vcpu, a set of pcpus where
    it prefers to run (in addition, and possibly unrelated to,
    the set of pcpus where it is allowed to run).

We will be calling this set of *preferred* pcpus the vcpu's
soft affinity, and this changes introduce it, and starts using it
for scheduling, replacing the indirect use of the domain's NUMA
node-affinity. This is more general, as soft affinity does not
have to be related to NUMA. Nevertheless, it allows to achieve the
same results of NUMA-aware scheduling, just by making soft affinity
equal to the domain's node affinity, for all the vCPUs (e.g.,
from the toolstack).

This also means renaming most of the NUMA-aware scheduling related
functions, in credit1, to something more generic, hinting toward
the concept of soft affinity rather than directly to NUMA awareness.

As a side effects, this simplifies the code quit a bit. In fact,
prior to this change, we needed to cache the translation of
d->node_affinity (which is a nodemask_t) to a cpumask_t, since that
is what scheduling decisions require (we used to keep it in
node_affinity_cpumask). This, and all the complicated logic
required to keep it updated, is not necessary any longer.

The high level description of NUMA placement and scheduling in
docs/misc/xl-numa-placement.markdown is being updated too, to match
the new architecture.

Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
Reviewed-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Acked-by: Jan Beulich <jbeulich@xxxxxxxx>
---
Changes from v6:
 * set_node_affinity completely removed (I mean the field and
   the scheduling abstraction around it), as requested during
   review.

Changes from v2:
 * this patch folds patches 6 ("xen: sched: make space for
   cpu_soft_affinity") and 10 ("xen: sched: use soft-affinity
   instead of domain's node-affinity"), as suggested during
   review. 'Reviewed-by' from George is there since both patch
   6 and 10 had it, and I didn't do anything else than squashing
   them.

Changes from v1:
 * in v1, "7/12 xen: numa-sched: use per-vcpu node-affinity for
   actual scheduling" was doing something very similar to this
   patch.
---
 docs/misc/xl-numa-placement.markdown |  148 ++++++++++++++++++++++-----------
 xen/common/domain.c                  |    5 +
 xen/common/keyhandler.c              |    2 
 xen/common/sched_credit.c            |  153 +++++++++++++---------------------
 xen/common/schedule.c                |    8 +-
 xen/include/xen/sched-if.h           |    2 
 xen/include/xen/sched.h              |    4 +
 7 files changed, 168 insertions(+), 154 deletions(-)

diff --git a/docs/misc/xl-numa-placement.markdown 
b/docs/misc/xl-numa-placement.markdown
index caa3fec..9d64eae 100644
--- a/docs/misc/xl-numa-placement.markdown
+++ b/docs/misc/xl-numa-placement.markdown
@@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA 
node is usually
 defined as a set of processor cores (typically a physical CPU package) and
 the memory directly attached to the set of cores.
 
-The Xen hypervisor deals with NUMA machines by assigning to each domain
-a "node affinity", i.e., a set of NUMA nodes of the host from which they
-get their memory allocated. Also, even if the node affinity of a domain
-is allowed to change on-line, it is very important to "place" the domain
-correctly when it is fist created, as the most of its memory is allocated
-at that time and can not (for now) be moved easily.
-
 NUMA awareness becomes very important as soon as many domains start
 running memory-intensive workloads on a shared host. In fact, the cost
 of accessing non node-local memory locations is very high, and the
@@ -27,14 +20,37 @@ performance degradation is likely to be noticeable.
 For more information, have a look at the [Xen NUMA Introduction][numa_intro]
 page on the Wiki.
 
+## Xen and NUMA machines: the concept of _node-affinity_ ##
+
+The Xen hypervisor deals with NUMA machines throughout the concept of
+_node-affinity_. The node-affinity of a domain is the set of NUMA nodes
+of the host where the memory for the domain is being allocated (mostly,
+at domain creation time). This is, at least in principle, different and
+unrelated with the vCPU (hard and soft, see below) scheduling affinity,
+which instead is the set of pCPUs where the vCPU is allowed (or prefers)
+to run.
+
+Of course, despite the fact that they belong to and affect different
+subsystems, the domain node-affinity and the vCPUs affinity are not
+completely independent.
+In fact, if the domain node-affinity is not explicitly specified by the
+user, via the proper libxl calls or xl config item, it will be computed
+basing on the vCPUs' scheduling affinity.
+
+Notice that, even if the node affinity of a domain may change on-line,
+it is very important to "place" the domain correctly when it is fist
+created, as the most of its memory is allocated at that time and can
+not (for now) be moved easily.
+
 ### Placing via pinning and cpupools ###
 
-The simplest way of placing a domain on a NUMA node is statically pinning
-the domain's vCPUs to the pCPUs of the node. This goes under the name of
-CPU affinity and can be set through the "cpus=" option in the config file
-(more about this below). Another option is to pool together the pCPUs
-spanning the node and put the domain in such a cpupool with the "pool="
-config option (as documented in our [Wiki][cpupools_howto]).
+The simplest way of placing a domain on a NUMA node is setting the hard
+scheduling affinity of the domain's vCPUs to the pCPUs of the node. This
+also goes under the name of vCPU pinning, and can be done through the
+"cpus=" option in the config file (more about this below). Another option
+is to pool together the pCPUs spanning the node and put the domain in
+such a _cpupool_ with the "pool=" config option (as documented in our
+[Wiki][cpupools_howto]).
 
 In both the above cases, the domain will not be able to execute outside
 the specified set of pCPUs for any reasons, even if all those pCPUs are
@@ -45,24 +61,45 @@ may come at he cost of some load imbalances.
 
 ### NUMA aware scheduling ###
 
-If the credit scheduler is in use, the concept of node affinity defined
-above does not only apply to memory. In fact, starting from Xen 4.3, the
-scheduler always tries to run the domain's vCPUs on one of the nodes in
-its node affinity. Only if that turns out to be impossible, it will just
-pick any free pCPU.
-
-This is, therefore, something more flexible than CPU affinity, as a domain
-can still run everywhere, it just prefers some nodes rather than others.
-Locality of access is less guaranteed than in the pinning case, but that
-comes along with better chances to exploit all the host resources (e.g.,
-the pCPUs).
-
-In fact, if all the pCPUs in a domain's node affinity are busy, it is
-possible for the domain to run outside of there, but it is very likely that
-slower execution (due to remote memory accesses) is still better than no
-execution at all, as it would happen with pinning. For this reason, NUMA
-aware scheduling has the potential of bringing substantial performances
-benefits, although this will depend on the workload.
+If using the credit1 scheduler, and starting from Xen 4.3, the scheduler
+itself always tries to run the domain's vCPUs on one of the nodes in
+its node-affinity. Only if that turns out to be impossible, it will just
+pick any free pCPU. Locality of access is less guaranteed than in the
+pinning case, but that comes along with better chances to exploit all
+the host resources (e.g., the pCPUs).
+
+Starting from Xen 4.5, credit1 supports two forms of affinity: hard and
+soft, both on a per-vCPU basis. This means each vCPU can have its own
+soft affinity, stating where such vCPU prefers to execute on. This is
+less strict than what it (also starting from 4.5) is called hard affinity,
+as the vCPU can potentially run everywhere, it just prefers some pCPUs
+rather than others.
+In Xen 4.5, therefore, NUMA-aware scheduling is achieved by matching the
+soft affinity of the vCPUs of a domain with its node-affinity.
+
+In fact, as it was for 4.3, if all the pCPUs in a vCPU's soft affinity
+are busy, it is possible for the domain to run outside from there. The
+idea is that slower execution (due to remote memory accesses) is still
+better than no execution at all (as it would happen with pinning). For
+this reason, NUMA aware scheduling has the potential of bringing
+substantial performances benefits, although this will depend on the
+workload.
+
+Notice that, for each vCPU, the following three scenarios are possbile:
+
+  * a vCPU *is pinned* to some pCPUs and *does not have* any soft affinity
+    In this case, the vCPU is always scheduled on one of the pCPUs to which
+    it is pinned, without any specific peference among them.
+  * a vCPU *has* its own soft affinity and *is not* pinned to any particular
+    pCPU. In this case, the vCPU can run on every pCPU. Nevertheless, the
+    scheduler will try to have it running on one of the pCPUs in its soft
+    affinity;
+  * a vCPU *has* its own vCPU soft affinity and *is also* pinned to some
+    pCPUs. In this case, the vCPU is always scheduled on one of the pCPUs
+    onto which it is pinned, with, among them, a preference for the ones
+    that also forms its soft affinity. In case pinning and soft affinity
+    form two disjoint sets of pCPUs, pinning "wins", and the soft affinity
+    is just ignored.
 
 ## Guest placement in xl ##
 
@@ -71,25 +108,23 @@ both manual or automatic placement of them across the 
host's NUMA nodes.
 
 Note that xm/xend does a very similar thing, the only differences being
 the details of the heuristics adopted for automatic placement (see below),
-and the lack of support (in both xm/xend and the Xen versions where that\
+and the lack of support (in both xm/xend and the Xen versions where that
 was the default toolstack) for NUMA aware scheduling.
 
 ### Placing the guest manually ###
 
 Thanks to the "cpus=" option, it is possible to specify where a domain
 should be created and scheduled on, directly in its config file. This
-affects NUMA placement and memory accesses as the hypervisor constructs
-the node affinity of a VM basing right on its CPU affinity when it is
-created.
+affects NUMA placement and memory accesses as, in this case, the
+hypervisor constructs the node-affinity of a VM basing right on its
+vCPU pinning when it is created.
 
 This is very simple and effective, but requires the user/system
-administrator to explicitly specify affinities for each and every domain,
+administrator to explicitly specify the pinning for each and every domain,
 or Xen won't be able to guarantee the locality for their memory accesses.
 
-Notice that this also pins the domain's vCPUs to the specified set of
-pCPUs, so it not only sets the domain's node affinity (its memory will
-come from the nodes to which the pCPUs belong), but at the same time
-forces the vCPUs of the domain to be scheduled on those same pCPUs.
+That, of course, also mean the vCPUs of the domain will only be able to
+execute on those same pCPUs.
 
 ### Placing the guest automatically ###
 
@@ -97,7 +132,9 @@ If no "cpus=" option is specified in the config file, libxl 
tries
 to figure out on its own on which node(s) the domain could fit best.
 If it finds one (some), the domain's node affinity get set to there,
 and both memory allocations and NUMA aware scheduling (for the credit
-scheduler and starting from Xen 4.3) will comply with it.
+scheduler and starting from Xen 4.3) will comply with it. Starting from
+Xen 4.5, this also means that the mask resulting from this "fitting"
+procedure will become the soft affinity of all the vCPUs of the domain.
 
 It is worthwhile noting that optimally fitting a set of VMs on the NUMA
 nodes of an host is an incarnation of the Bin Packing Problem. In fact,
@@ -142,34 +179,43 @@ any placement from happening:
 
     libxl_defbool_set(&domain_build_info->numa_placement, false);
 
-Also, if `numa_placement` is set to `true`, the domain must not
-have any CPU affinity (i.e., `domain_build_info->cpumap` must
-have all its bits set, as it is by default), or domain creation
-will fail returning `ERROR_INVAL`.
+Also, if `numa_placement` is set to `true`, the domain's vCPUs must
+not be pinned (i.e., `domain_build_info->cpumap` must have all its
+bits set, as it is by default), or domain creation will fail with
+`ERROR_INVAL`.
 
 Starting from Xen 4.3, in case automatic placement happens (and is
-successful), it will affect the domain's node affinity and _not_ its
-CPU affinity. Namely, the domain's vCPUs will not be pinned to any
+successful), it will affect the domain's node-affinity and _not_ its
+vCPU pinning. Namely, the domain's vCPUs will not be pinned to any
 pCPU on the host, but the memory from the domain will come from the
 selected node(s) and the NUMA aware scheduling (if the credit scheduler
-is in use) will try to keep the domain there as much as possible.
+is in use) will try to keep the domain's vCPUs there as much as possible.
 
 Besides than that, looking and/or tweaking the placement algorithm
 search "Automatic NUMA placement" in libxl\_internal.h.
 
 Note this may change in future versions of Xen/libxl.
 
+## Xen < 4.5 ##
+
+The concept of vCPU soft affinity has been introduced for the first time
+in Xen 4.5. In 4.3, it is the domain's node-affinity that drives the
+NUMA-aware scheduler. The main difference is soft affinity is per-vCPU,
+and so each vCPU can have its own mask of pCPUs, while node-affinity is
+per-domain, that is the equivalent of having all the vCPUs with the same
+soft affinity.
+
 ## Xen < 4.3 ##
 
 As NUMA aware scheduling is a new feature of Xen 4.3, things are a little
 bit different for earlier version of Xen. If no "cpus=" option is specified
 and Xen 4.2 is in use, the automatic placement algorithm still runs, but
 the results is used to _pin_ the vCPUs of the domain to the output node(s).
-This is consistent with what was happening with xm/xend, which were also
-affecting the domain's CPU affinity.
+This is consistent with what was happening with xm/xend.
 
 On a version of Xen earlier than 4.2, there is not automatic placement at
-all in xl or libxl, and hence no node or CPU affinity being affected.
+all in xl or libxl, and hence no node-affinity, vCPU affinity or pinning
+being introduced/modified.
 
 ## Limitations ##
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 141a5dc..e20d3bf 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -128,6 +128,7 @@ struct vcpu *alloc_vcpu(
     if ( !zalloc_cpumask_var(&v->cpu_hard_affinity) ||
          !zalloc_cpumask_var(&v->cpu_hard_affinity_tmp) ||
          !zalloc_cpumask_var(&v->cpu_hard_affinity_saved) ||
+         !zalloc_cpumask_var(&v->cpu_soft_affinity) ||
          !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) )
         goto fail_free;
 
@@ -159,6 +160,7 @@ struct vcpu *alloc_vcpu(
         free_cpumask_var(v->cpu_hard_affinity);
         free_cpumask_var(v->cpu_hard_affinity_tmp);
         free_cpumask_var(v->cpu_hard_affinity_saved);
+        free_cpumask_var(v->cpu_soft_affinity);
         free_cpumask_var(v->vcpu_dirty_cpumask);
         free_vcpu_struct(v);
         return NULL;
@@ -446,8 +448,6 @@ void domain_update_node_affinity(struct domain *d)
                 node_set(node, d->node_affinity);
     }
 
-    sched_set_node_affinity(d, &d->node_affinity);
-
     spin_unlock(&d->node_affinity_lock);
 
     free_cpumask_var(online_affinity);
@@ -795,6 +795,7 @@ static void complete_domain_destroy(struct rcu_head *head)
             free_cpumask_var(v->cpu_hard_affinity);
             free_cpumask_var(v->cpu_hard_affinity_tmp);
             free_cpumask_var(v->cpu_hard_affinity_saved);
+            free_cpumask_var(v->cpu_soft_affinity);
             free_cpumask_var(v->vcpu_dirty_cpumask);
             free_vcpu_struct(v);
         }
diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
index d6eb026..809378c 100644
--- a/xen/common/keyhandler.c
+++ b/xen/common/keyhandler.c
@@ -297,6 +297,8 @@ static void dump_domains(unsigned char key)
             printk("dirty_cpus=%s ", tmpstr);
             cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_hard_affinity);
             printk("cpu_affinity=%s\n", tmpstr);
+            cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_soft_affinity);
+            printk("cpu_soft_affinity=%s\n", tmpstr);
             printk("    pause_count=%d pause_flags=%lx\n",
                    atomic_read(&v->pause_count), v->pause_flags);
             arch_dump_vcpu_info(v);
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index c6a2560..8b02b7b 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -112,10 +112,24 @@
 
 
 /*
- * Node Balancing
+ * Hard and soft affinity load balancing.
+ *
+ * Idea is each vcpu has some pcpus that it prefers, some that it does not
+ * prefer but is OK with, and some that it cannot run on at all. The first
+ * set of pcpus are the ones that are both in the soft affinity *and* in the
+ * hard affinity; the second set of pcpus are the ones that are in the hard
+ * affinity but *not* in the soft affinity; the third set of pcpus are the
+ * ones that are not in the hard affinity.
+ *
+ * We implement a two step balancing logic. Basically, every time there is
+ * the need to decide where to run a vcpu, we first check the soft affinity
+ * (well, actually, the && between soft and hard affinity), to see if we can
+ * send it where it prefers to (and can) run on. However, if the first step
+ * does not find any suitable and free pcpu, we fall back checking the hard
+ * affinity.
  */
-#define CSCHED_BALANCE_NODE_AFFINITY    0
-#define CSCHED_BALANCE_CPU_AFFINITY     1
+#define CSCHED_BALANCE_SOFT_AFFINITY    0
+#define CSCHED_BALANCE_HARD_AFFINITY    1
 
 /*
  * Boot parameters
@@ -138,7 +152,7 @@ struct csched_pcpu {
 
 /*
  * Convenience macro for accessing the per-PCPU cpumask we need for
- * implementing the two steps (vcpu and node affinity) balancing logic.
+ * implementing the two steps (soft and hard affinity) balancing logic.
  * It is stored in csched_pcpu so that serialization is not an issue,
  * as there is a csched_pcpu for each PCPU and we always hold the
  * runqueue spin-lock when using this.
@@ -178,9 +192,6 @@ struct csched_dom {
     struct list_head active_vcpu;
     struct list_head active_sdom_elem;
     struct domain *dom;
-    /* cpumask translated from the domain's node-affinity.
-     * Basically, the CPUs we prefer to be scheduled on. */
-    cpumask_var_t node_affinity_cpumask;
     uint16_t active_vcpu_count;
     uint16_t weight;
     uint16_t cap;
@@ -261,59 +272,28 @@ __runq_remove(struct csched_vcpu *svc)
     list_del_init(&svc->runq_elem);
 }
 
-/*
- * Translates node-affinity mask into a cpumask, so that we can use it during
- * actual scheduling. That of course will contain all the cpus from all the
- * set nodes in the original node-affinity mask.
- *
- * Note that any serialization needed to access mask safely is complete
- * responsibility of the caller of this function/hook.
- */
-static void csched_set_node_affinity(
-    const struct scheduler *ops,
-    struct domain *d,
-    nodemask_t *mask)
-{
-    struct csched_dom *sdom;
-    int node;
-
-    /* Skip idle domain since it doesn't even have a node_affinity_cpumask */
-    if ( unlikely(is_idle_domain(d)) )
-        return;
-
-    sdom = CSCHED_DOM(d);
-    cpumask_clear(sdom->node_affinity_cpumask);
-    for_each_node_mask( node, *mask )
-        cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask,
-                   &node_to_cpumask(node));
-}
 
 #define for_each_csched_balance_step(step) \
-    for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ )
+    for ( (step) = 0; (step) <= CSCHED_BALANCE_HARD_AFFINITY; (step)++ )
 
 
 /*
- * vcpu-affinity balancing is always necessary and must never be skipped.
- * OTOH, if a domain's node-affinity is said to be automatically computed
- * (or if it just spans all the nodes), we can safely avoid dealing with
- * node-affinity entirely.
+ * Hard affinity balancing is always necessary and must never be skipped.
+ * OTOH, if the vcpu's soft affinity is full (it spans all the possible
+ * pcpus) we can safely avoid dealing with it entirely.
  *
- * Node-affinity is also deemed meaningless in case it has empty
- * intersection with mask, to cover the cases where using the node-affinity
+ * A vcpu's soft affinity is also deemed meaningless in case it has empty
+ * intersection with mask, to cover the cases where using the soft affinity
  * mask seems legit, but would instead led to trying to schedule the vcpu
  * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu's
- * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus
+ * hard affinity, or to the && of hard affinity and the set of online cpus
  * in the domain's cpupool.
  */
-static inline int __vcpu_has_node_affinity(const struct vcpu *vc,
+static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
                                            const cpumask_t *mask)
 {
-    const struct domain *d = vc->domain;
-    const struct csched_dom *sdom = CSCHED_DOM(d);
-
-    if ( d->auto_node_affinity
-         || cpumask_full(sdom->node_affinity_cpumask)
-         || !cpumask_intersects(sdom->node_affinity_cpumask, mask) )
+    if ( cpumask_full(vc->cpu_soft_affinity)
+         || !cpumask_intersects(vc->cpu_soft_affinity, mask) )
         return 0;
 
     return 1;
@@ -321,23 +301,22 @@ static inline int __vcpu_has_node_affinity(const struct 
vcpu *vc,
 
 /*
  * Each csched-balance step uses its own cpumask. This function determines
- * which one (given the step) and copies it in mask. For the node-affinity
- * balancing step, the pcpus that are not part of vc's vcpu-affinity are
+ * which one (given the step) and copies it in mask. For the soft affinity
+ * balancing step, the pcpus that are not part of vc's hard affinity are
  * filtered out from the result, to avoid running a vcpu where it would
  * like, but is not allowed to!
  */
 static void
 csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
 {
-    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
+    if ( step == CSCHED_BALANCE_SOFT_AFFINITY )
     {
-        cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask,
-                    vc->cpu_hard_affinity);
+        cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity);
 
         if ( unlikely(cpumask_empty(mask)) )
             cpumask_copy(mask, vc->cpu_hard_affinity);
     }
-    else /* step == CSCHED_BALANCE_CPU_AFFINITY */
+    else /* step == CSCHED_BALANCE_HARD_AFFINITY */
         cpumask_copy(mask, vc->cpu_hard_affinity);
 }
 
@@ -398,15 +377,15 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
     else if ( !idlers_empty )
     {
         /*
-         * Node and vcpu-affinity balancing loop. For vcpus without
-         * a useful node-affinity, consider vcpu-affinity only.
+         * Soft and hard affinity balancing loop. For vcpus without
+         * a useful soft affinity, consider hard affinity only.
          */
         for_each_csched_balance_step( balance_step )
         {
             int new_idlers_empty;
 
-            if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY
-                 && !__vcpu_has_node_affinity(new->vcpu,
+            if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+                 && !__vcpu_has_soft_affinity(new->vcpu,
                                               new->vcpu->cpu_hard_affinity) )
                 continue;
 
@@ -418,11 +397,11 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
 
             /*
              * Let's not be too harsh! If there aren't idlers suitable
-             * for new in its node-affinity mask, make sure we check its
-             * vcpu-affinity as well, before taking final decisions.
+             * for new in its soft affinity mask, make sure we check its
+             * hard affinity as well, before taking final decisions.
              */
             if ( new_idlers_empty
-                 && balance_step == CSCHED_BALANCE_NODE_AFFINITY )
+                 && balance_step == CSCHED_BALANCE_SOFT_AFFINITY )
                 continue;
 
             /*
@@ -649,23 +628,23 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu 
*vc, bool_t commit)
         /*
          * We want to pick up a pcpu among the ones that are online and
          * can accommodate vc, which is basically what we computed above
-         * and stored in cpus. As far as vcpu-affinity is concerned,
+         * and stored in cpus. As far as hard affinity is concerned,
          * there always will be at least one of these pcpus, hence cpus
          * is never empty and the calls to cpumask_cycle() and
          * cpumask_test_cpu() below are ok.
          *
-         * On the other hand, when considering node-affinity too, it
+         * On the other hand, when considering soft affinity too, it
          * is possible for the mask to become empty (for instance, if the
          * domain has been put in a cpupool that does not contain any of the
-         * nodes in its node-affinity), which would result in the ASSERT()-s
+         * pcpus in its soft affinity), which would result in the ASSERT()-s
          * inside cpumask_*() operations triggering (in debug builds).
          *
-         * Therefore, in this case, we filter the node-affinity mask against
-         * cpus and, if the result is empty, we just skip the node-affinity
+         * Therefore, in this case, we filter the soft affinity mask against
+         * cpus and, if the result is empty, we just skip the soft affinity
          * balancing step all together.
          */
-        if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY
-             && !__vcpu_has_node_affinity(vc, &cpus) )
+        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+             && !__vcpu_has_soft_affinity(vc, &cpus) )
             continue;
 
         /* Pick an online CPU from the proper affinity mask */
@@ -1122,13 +1101,6 @@ csched_alloc_domdata(const struct scheduler *ops, struct 
domain *dom)
     if ( sdom == NULL )
         return NULL;
 
-    if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) )
-    {
-        xfree(sdom);
-        return NULL;
-    }
-    cpumask_setall(sdom->node_affinity_cpumask);
-
     /* Initialize credit and weight */
     INIT_LIST_HEAD(&sdom->active_vcpu);
     INIT_LIST_HEAD(&sdom->active_sdom_elem);
@@ -1158,9 +1130,6 @@ csched_dom_init(const struct scheduler *ops, struct 
domain *dom)
 static void
 csched_free_domdata(const struct scheduler *ops, void *data)
 {
-    struct csched_dom *sdom = data;
-
-    free_cpumask_var(sdom->node_affinity_cpumask);
     xfree(data);
 }
 
@@ -1486,19 +1455,19 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int 
balance_step)
             BUG_ON( is_idle_vcpu(vc) );
 
             /*
-             * If the vcpu has no useful node-affinity, skip this vcpu.
-             * In fact, what we want is to check if we have any node-affine
-             * work to steal, before starting to look at vcpu-affine work.
+             * If the vcpu has no useful soft affinity, skip this vcpu.
+             * In fact, what we want is to check if we have any "soft-affine
+             * work" to steal, before starting to look at "hard-affine work".
              *
              * Notice that, if not even one vCPU on this runq has a useful
-             * node-affinity, we could have avoid considering this runq for
-             * a node balancing step in the first place. This, for instance,
+             * soft affinity, we could have avoid considering this runq for
+             * a soft balancing step in the first place. This, for instance,
              * can be implemented by taking note of on what runq there are
-             * vCPUs with useful node-affinities in some sort of bitmap
+             * vCPUs with useful soft affinities in some sort of bitmap
              * or counter.
              */
-            if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY
-                 && !__vcpu_has_node_affinity(vc, vc->cpu_hard_affinity) )
+            if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+                 && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
                 continue;
 
             csched_balance_cpumask(vc, balance_step, csched_balance_mask);
@@ -1546,17 +1515,17 @@ csched_load_balance(struct csched_private *prv, int cpu,
         SCHED_STAT_CRANK(load_balance_other);
 
     /*
-     * Let's look around for work to steal, taking both vcpu-affinity
-     * and node-affinity into account. More specifically, we check all
+     * Let's look around for work to steal, taking both hard affinity
+     * and soft affinity into account. More specifically, we check all
      * the non-idle CPUs' runq, looking for:
-     *  1. any node-affine work to steal first,
-     *  2. if not finding anything, any vcpu-affine work to steal.
+     *  1. any "soft-affine work" to steal first,
+     *  2. if not finding anything, any "hard-affine work" to steal.
      */
     for_each_csched_balance_step( bstep )
     {
         /*
          * We peek at the non-idling CPUs in a node-wise fashion. In fact,
-         * it is more likely that we find some node-affine work on our same
+         * it is more likely that we find some affine work on our same
          * node, not to mention that migrating vcpus within the same node
          * could well expected to be cheaper than across-nodes (memory
          * stays local, there might be some node-wide cache[s], etc.).
@@ -1982,8 +1951,6 @@ const struct scheduler sched_credit_def = {
     .adjust         = csched_dom_cntl,
     .adjust_global  = csched_sys_cntl,
 
-    .set_node_affinity  = csched_set_node_affinity,
-
     .pick_cpu       = csched_cpu_pick,
     .do_schedule    = csched_schedule,
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index b579dfa..d76a425 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -198,6 +198,8 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor)
     else
         cpumask_setall(v->cpu_hard_affinity);
 
+    cpumask_setall(v->cpu_soft_affinity);
+
     /* Initialise the per-vcpu timers. */
     init_timer(&v->periodic_timer, vcpu_periodic_timer_fn,
                v, v->processor);
@@ -286,6 +288,7 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
         migrate_timer(&v->poll_timer, new_p);
 
         cpumask_setall(v->cpu_hard_affinity);
+        cpumask_setall(v->cpu_soft_affinity);
 
         lock = vcpu_schedule_lock_irq(v);
         v->processor = new_p;
@@ -645,11 +648,6 @@ int cpu_disable_scheduler(unsigned int cpu)
     return ret;
 }
 
-void sched_set_node_affinity(struct domain *d, nodemask_t *mask)
-{
-    SCHED_OP(DOM2OP(d), set_node_affinity, d, mask);
-}
-
 int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity)
 {
     cpumask_t online_affinity;
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index d95e254..4164dff 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -158,8 +158,6 @@ struct scheduler {
                                     struct xen_domctl_scheduler_op *);
     int          (*adjust_global)  (const struct scheduler *,
                                     struct xen_sysctl_scheduler_op *);
-    void         (*set_node_affinity) (const struct scheduler *,
-                                       struct domain *, nodemask_t *);
     void         (*dump_settings)  (const struct scheduler *);
     void         (*dump_cpu_state) (const struct scheduler *, int);
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 6f91abd..445b659 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -225,6 +225,9 @@ struct vcpu
     /* Used to restore affinity across S3. */
     cpumask_var_t    cpu_hard_affinity_saved;
 
+    /* Bitmask of CPUs on which this VCPU prefers to run. */
+    cpumask_var_t    cpu_soft_affinity;
+
     /* Bitmask of CPUs which are holding onto this VCPU's state. */
     cpumask_var_t    vcpu_dirty_cpumask;
 
@@ -627,7 +630,6 @@ void sched_destroy_domain(struct domain *d);
 int sched_move_domain(struct domain *d, struct cpupool *c);
 long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *);
 long sched_adjust_global(struct xen_sysctl_scheduler_op *);
-void sched_set_node_affinity(struct domain *, nodemask_t *);
 int  sched_id(void);
 void sched_tick_suspend(void);
 void sched_tick_resume(void);


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.