[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`



Basic idea is: cpu-affinity tells where a vcpu can run,
while node-affinity tells where (in terms of on what NUMA nodes,
but that of course imply a set of CPUs) all the vcpus of the
domain should/prefer to run... Problems starts when you have
both of them speaking at the same time!

Respecting vcpu-affinity should remain the primary concern, thus
what we do here is add some two-step logic here and there in the
scheduler (i.e., where vcpu-affinity related decisions are being
undertaken). During the first step, we check the `&&` of vcpu and
node affinity masks, aiming at giving some precedence to the CPUs
that are both suitable and preferrable for the domain. However,
if that fails in finding a valid CPU, the node-affinity is just
ignored and, in the second step, we just fallback to vcpu-affinity,
as the original behaviour was.

Both the introduced overhead and the benefits this provides has
to be investigated and compared against each other, possibly in
varying conditions and running different workloads.

Finally, although still at the RFC level, efforts have been made
to write the code in a flexible enough fashion so that, if we
ever want to introduce further balancing steps, it shouldn't be
too much of a pain.

TODO: * Better investigate and test interactions with cpupools.
      * Test, verify and benchmark. Then test, verify and benchmark
        again, and again and again. What I know as of know is that
        it does not explode that easily, but whether it works properly
        and, more important, brings any benefit, this has to be
        proven (and yes, I'm out running these tests and benchs already,
        but do not esitate to manifest your will for helping with
        that :-D).
      * Add some counter/stats, e.g., to serve the purpose outlined
        right above.

XXX I'm benchmarking this just right now, and peeking at the results
    they don't seem too bad. Numbers are mostly close to the case where
    the VM is already pinned to a node when created. I'll post the
    results as soon as I can, and I'll be happy to investigate any issue,
    if you feel like the approach could be the right one.

Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -117,6 +117,8 @@ will run on cpus 2 and 3).
 List of the NUMA nodes of the host the guest should be considered
 `affine` with. Being affine to a (set of) NUMA node(s) mainly means
 the guest's memory is going to be allocated on those node(s).
+Also, the scheduler will prefer (but not force) running the guest's
+vcpus on the cpus of the affine node(s).
 
 A list of nodes should be specified as follows: `nodes=["0", "3"]`
 for the guest to be affine with nodes 0 and 3.
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -101,6 +101,15 @@
 
 
 /*
+ * Node Balancing
+ */
+#define CSCHED_BALANCE_NODE_AFFINITY    1
+#define CSCHED_BALANCE_CPU_AFFINITY     0
+#define CSCHED_BALANCE_START_STEP       CSCHED_BALANCE_NODE_AFFINITY
+#define CSCHED_BALANCE_END_STEP         CSCHED_BALANCE_CPU_AFFINITY
+
+
+/*
  * Boot parameters
  */
 static bool_t __read_mostly sched_credit_default_yield;
@@ -150,6 +159,17 @@ struct csched_dom {
     struct list_head active_vcpu;
     struct list_head active_sdom_elem;
     struct domain *dom;
+    /*
+     * Let's cache domain's dom->node_affinity here as an
+     * optimization for a couple of hot paths. In fact,
+     * knowing  whether or not dom->node_affinity has changed
+     * would allow us to avoid rebuilding node_affinity_cpumask
+     * (below) duing node balancing and/or scheduling.
+     */
+    nodemask_t node_affinity_cache;
+    /* Basing on what dom->node_affinity says,
+     * on what CPUs would we like to run most? */
+    cpumask_t node_affinity_cpumask;
     uint16_t active_vcpu_count;
     uint16_t weight;
     uint16_t cap;
@@ -230,6 +250,87 @@ static inline void
     list_del_init(&svc->runq_elem);
 }
 
+#define csched_balance(__step) \
+    for ( (__step) = CSCHED_BALANCE_START_STEP; \
+          (__step) >= CSCHED_BALANCE_END_STEP; \
+          (__step)-- )
+
+/*
+ * Sort-of conversion between node-affinity and vcpu-affinity for the domain,
+ * i.e., a cpumask containing all the cpus from all the set nodes in the
+ * node-affinity mask of the domain.
+ *
+ * Notice that we completely forget about serializing this (both here and
+ * in the various sites where node_affinity_cpumask is used) against
+ * d->node_affinity_lock. That would be hard to do properly, as that lock
+ * is a non IRQ-safe one, and it would be nearly impossible to access it
+ * from the scheduling code. However, although this leaves some room for
+ * races with code paths that updates d->node_affinity, it all should still
+ * be fine, considering:
+ *  (1) d->node_affinity updates are going to be quite rare;
+ *  (2) this balancing logic is kind-of "best effort" anyway;
+ *  (3) given (1), any inconsistencies are likely to be fixed by the next
+ *      call to this same function without risking going into instability.
+ *
+ * XXX Validate (via testing/benchmarking) whether this is true, as it
+ *     likely sounds to be, or it causes unexpected issues.
+ */
+static void
+csched_build_balance_cpumask(struct csched_dom *sdom)
+{
+    struct domain *d = sdom->dom;
+    int node;
+
+    /* Check whether we need to refresh the node-affinity cache we keep */
+    if ( likely(nodes_equal(d->node_affinity, sdom->node_affinity_cache)) )
+        return;
+    else
+        nodes_copy(sdom->node_affinity_cache, sdom->dom->node_affinity);
+
+    /* Create the cpumask matching the current node-affinity mask */
+    cpumask_clear(&sdom->node_affinity_cpumask);
+    for_each_node_mask( node, sdom->node_affinity_cache )
+        cpumask_or(&sdom->node_affinity_cpumask, &sdom->node_affinity_cpumask,
+                   &node_to_cpumask(node));
+}
+
+/* Put together the cpumask we are going to use for this balancing step */
+static int
+csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
+{
+    struct domain *d = vc->domain;
+    struct csched_dom *sdom = CSCHED_DOM(d);
+
+    /*
+     * Only two possible steps exists for now: node and vcpu affinity.
+     * If this domain does not have a node-affinity or is affine to all the
+     * nodes, just return th vcpu-affinity mask (for *this* vcpu) and
+     * stop the balancing process.
+     */
+    if ( !d->has_node_affinity || nodes_full(d->node_affinity) ||
+         step == CSCHED_BALANCE_CPU_AFFINITY )
+    {
+        cpumask_copy(mask, vc->cpu_affinity);
+        return CSCHED_BALANCE_CPU_AFFINITY;
+    }
+
+    csched_build_balance_cpumask(sdom);
+
+    /*
+     * XXX As of now (with only two possible steps, this is easy and readable
+     *     enough. If more steps become necessary at some point in time, we
+     *     can keep an array of cpumask_t in dom_data and return the proper
+     *     element via step-indexing such an array. Of course, we can turn
+     *     this like that even right now... Thoughts?
+     */
+    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
+        cpumask_and(mask, &sdom->node_affinity_cpumask, vc->cpu_affinity);
+    else
+        cpumask_copy(mask, vc->cpu_affinity);
+
+    return step;
+}
+
 static void burn_credits(struct csched_vcpu *svc, s_time_t now)
 {
     s_time_t delta;
@@ -252,6 +353,20 @@ boolean_param("tickle_one_idle_cpu", opt
 DEFINE_PER_CPU(unsigned int, last_tickle_cpu);
 
 static inline void
+__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask)
+{
+    CSCHED_STAT_CRANK(tickle_idlers_some);
+    if ( opt_tickle_one_idle )
+    {
+        this_cpu(last_tickle_cpu) = 
+            cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask);
+        cpumask_set_cpu(this_cpu(last_tickle_cpu), mask);
+    }
+    else
+        cpumask_or(mask, mask, idle_mask);
+}
+
+static inline void
 __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
 {
     struct csched_vcpu * const cur =
@@ -289,22 +404,27 @@ static inline void
         }
         else
         {
-            cpumask_t idle_mask;
+            cpumask_t idle_mask, balance_mask;
+            int balance_step;
 
-            cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
-            if ( !cpumask_empty(&idle_mask) )
+            csched_balance(balance_step)
             {
-                CSCHED_STAT_CRANK(tickle_idlers_some);
-                if ( opt_tickle_one_idle )
-                {
-                    this_cpu(last_tickle_cpu) = 
-                        cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
-                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
-                }
-                else
-                    cpumask_or(&mask, &mask, &idle_mask);
+                /*
+                 * A each balancing step, look for any idler in the poper
+                 * affinity mask for the step itself (for new->vcpu).
+                 */
+                csched_balance_cpumask(new->vcpu, balance_step, &balance_mask);
+                cpumask_and(&idle_mask, prv->idlers, &balance_mask);
+
+                if ( !cpumask_empty(&idle_mask) )
+                    __cpumask_tickle(&mask, &idle_mask);
+
+                cpumask_and(&mask, &mask, &balance_mask);
+
+                /* We can quit balancing if we found someone to tickle */
+                if ( !cpumask_empty(&mask) )
+                    break;
             }
-            cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
         }
     }
 
@@ -445,7 +565,7 @@ static inline int
 }
 
 static inline int
-__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
+__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
 {
     /*
      * Don't pick up work that's in the peer's scheduling tail or hot on
@@ -453,27 +573,37 @@ static inline int
      */
     return !vc->is_running &&
            !__csched_vcpu_is_cache_hot(vc) &&
-           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
+           cpumask_test_cpu(dest_cpu, mask);
 }
 
 static int
 _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit)
 {
-    cpumask_t cpus;
+    cpumask_t cpus, start_cpus;
     cpumask_t idlers;
     cpumask_t *online;
+    struct csched_dom *sdom = CSCHED_DOM(vc->domain);
     struct csched_pcpu *spc = NULL;
     int cpu;
 
+    csched_build_balance_cpumask(sdom);
+
     /*
      * Pick from online CPUs in VCPU's affinity mask, giving a
      * preference to its current processor if it's in there.
+     *
+     * It's worth tying to take balancing into account,
+     * at least for selecting from which cpu to start looking
+     * around from. XXX ... Or is this going to be too much?
      */
     online = cpupool_scheduler_cpumask(vc->domain->cpupool);
     cpumask_and(&cpus, online, vc->cpu_affinity);
-    cpu = cpumask_test_cpu(vc->processor, &cpus)
+    cpumask_and(&start_cpus, &cpus, &sdom->node_affinity_cpumask);
+    if ( unlikely(cpumask_empty(&start_cpus)) )
+        cpumask_copy(&start_cpus, &cpus);
+    cpu = cpumask_test_cpu(vc->processor, &start_cpus)
             ? vc->processor
-            : cpumask_cycle(vc->processor, &cpus);
+            : cpumask_cycle(vc->processor, &start_cpus);
     ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
 
     /*
@@ -877,6 +1007,14 @@ csched_alloc_domdata(const struct schedu
     sdom->active_vcpu_count = 0;
     INIT_LIST_HEAD(&sdom->active_sdom_elem);
     sdom->dom = dom;
+    /*
+     *XXX This would be 'The Right Thing', but as it is still too
+     *    early and d->node_affinity has not settled yet, maybe we
+     *    can just init the two masks with something like all-nodes
+     *    and all-cpus and rely on the first balancing call for
+     *    having them updated?
+     */
+    csched_build_balance_cpumask(sdom);
     sdom->weight = CSCHED_DEFAULT_WEIGHT;
     sdom->cap = 0U;
 
@@ -1216,30 +1354,54 @@ csched_runq_steal(int peer_cpu, int cpu,
      */
     if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
     {
-        list_for_each( iter, &peer_pcpu->runq )
+        int balance_step;
+
+        /*
+         * Idea here is trying take NUMA affinity (if any) into account.
+         * While at it, try to be general enough, exploiting the balancing
+         * infrastructure as follows:
+         *  - we act in steps (two for now: node and vcpu affinity);
+         *  - at each step we check each item of peer_pcpu's runq
+         *    looking for a candidate vcpu against the proper CPU
+         *    mask, depending on the step itself;
+         *  - masks are (should be!) checked in order of preference,
+         *    as a match will cause the end of the balancing process;
+         *  - a preferred mask should not contain any 'illegal' CPU(s)
+         *    (wrt to, e.g., the domain's 'cpu_affinity', if any), as
+         *    this might cause the vcpu to be shipped there!
+         */
+        csched_balance(balance_step)
         {
-            speer = __runq_elem(iter);
+            list_for_each( iter, &peer_pcpu->runq )
+            {
+                cpumask_t balance_mask;
 
-            /*
-             * If next available VCPU here is not of strictly higher
-             * priority than ours, this PCPU is useless to us.
-             */
-            if ( speer->pri <= pri )
-                break;
+                speer = __runq_elem(iter);
 
-            /* Is this VCPU is runnable on our PCPU? */
-            vc = speer->vcpu;
-            BUG_ON( is_idle_vcpu(vc) );
+                /*
+                 * If next available VCPU here is not of strictly higher
+                 * priority than ours, this PCPU is useless to us.
+                 */
+                if ( speer->pri <= pri )
+                    break;
 
-            if (__csched_vcpu_is_migrateable(vc, cpu))
-            {
-                /* We got a candidate. Grab it! */
-                CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
-                CSCHED_STAT_CRANK(migrate_queued);
-                WARN_ON(vc->is_urgent);
-                __runq_remove(speer);
-                vc->processor = cpu;
-                return speer;
+                /* Is this VCPU is runnable on our PCPU? */
+                vc = speer->vcpu;
+                BUG_ON( is_idle_vcpu(vc) );
+
+                balance_step = csched_balance_cpumask(vc, balance_step,
+                                                      &balance_mask);
+
+                if (__csched_vcpu_is_migrateable(vc, cpu, &balance_mask))
+                {
+                    /* We got a candidate. Grab it! */
+                    CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
+                    CSCHED_STAT_CRANK(migrate_queued);
+                    WARN_ON(vc->is_urgent);
+                    __runq_remove(speer);
+                    vc->processor = cpu;
+                    return speer;
+                }
             }
         }
     }
diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
--- a/xen/include/xen/nodemask.h
+++ b/xen/include/xen/nodemask.h
@@ -195,6 +195,13 @@ static inline int __nodes_weight(const n
        return bitmap_weight(srcp->bits, nbits);
 }
 
+#define nodes_copy(dst, src) __nodes_copy(&(dst), &(src), MAX_NUMNODES)
+static inline void __nodes_copy(nodemask_t *dstp,
+                               const nodemask_t *srcp, int nbits)
+{
+       bitmap_copy(dstp->bits, srcp->bits, nbits);
+}
+
 #define nodes_shift_right(dst, src, n) \
                        __nodes_shift_right(&(dst), &(src), (n), MAX_NUMNODES)
 static inline void __nodes_shift_right(nodemask_t *dstp,

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.