[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity



On 19/12/12 19:07, Dario Faggioli wrote:
As vcpu-affinity tells where VCPUs must run, node-affinity tells
where they should or, better, prefer. While respecting vcpu-affinity
remains mandatory, node-affinity is not that strict, it only expresses
a preference, although honouring it is almost always true that will
bring significant performances benefit (especially as compared to
not having any affinity at all).

This change modifies the VCPU load balancing algorithm (for the
credit scheduler only), introducing a two steps logic.
During the first step, we use the node-affinity mask. The aim is
giving precedence to the CPUs where it is known to be preferable
for the domain to run. If that fails in finding a valid PCPU, the
node-affinity is just ignored and, in the second step, we fall
back to using cpu-affinity only.

Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

This one has a lot of structural comments; so I'm going to send a couple of different mails as I'm going through it, so we can parallize the discussion better. :-)

---
Changes from v1:
  * CPU masks variables moved off from the stack, as requested during
    review. As per the comments in the code, having them in the private
    (per-scheduler instance) struct could have been enough, but it would be
    racy (again, see comments). For that reason, use a global bunch of
    them of (via per_cpu());
  * George suggested a different load balancing logic during v1's review. I
    think he was right and then I changed the old implementation in a way
    that resembles exactly that. I rewrote most of this patch to introduce
    a more sensible and effective noda-affinity handling logic.

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -111,6 +111,33 @@


  /*
+ * Node Balancing
+ */
+#define CSCHED_BALANCE_CPU_AFFINITY     0
+#define CSCHED_BALANCE_NODE_AFFINITY    1
+#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY
[snip]
+#define for_each_csched_balance_step(__step) \
+    for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )

Why are we starting at the top and going down? Is there any good reason for it?

Every time you do anything unexpected, you add to the cognitive load of the person reading your code, leaving less spare processing power or memory for other bits of the code, and increasing (sligthly) the chance of making a mistake. The most natural thing would be for someone to expect that the steps start at 0 and go up; just reversing it means it's that little bit harder to understand. When you name it "LAST", it's even worse, because that would definitely imply that this step is going to be executed last.

So why not just have this be as follows?

for(step=0; step<CSCHED_BALANCE_MAX; step++)

+
+/*
+ * Each csched-balance step has to use its own cpumask. This function
+ * determines which one, given the step, and copies it in mask. Notice
+ * that, in case of node-affinity balancing step, it also filters out from
+ * the node-affinity mask the cpus that are not part of vc's cpu-affinity,
+ * as we do not want to end up running a vcpu where it would like, but
+ * is not allowed to!
+ *
+ * As an optimization, if a domain does not have any node-affinity at all
+ * (namely, its node affinity is automatically computed), not only the
+ * computed mask will reflect its vcpu-affinity, but we also return -1 to
+ * let the caller know that he can skip the step or quit the loop (if he
+ * wants).
+ */
+static int
+csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
+{
+    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
+    {
+        struct domain *d = vc->domain;
+        struct csched_dom *sdom = CSCHED_DOM(d);
+
+        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
+
+        if ( cpumask_full(sdom->node_affinity_cpumask) )
+            return -1;

There's no optimization in having this comparison done here. You're not reading something from a local variable that you've just calculated. But hiding this comparison inside this function, and disguising it as "returns -1", does increase the cognitive load on anybody trying to read and understand the code -- particularly, as how the return value is used is not really clear.

Also, when you use this value, effectively what you're doing is saying, "Actually, we just said we were doing the NODE_BALANCE step, but it turns out that the results of NODE_BALANCE and CPU_BALANCE will be the same, so we're just going to pretend that we've been doing the CPU_BALANCE step instead." (See for example, "balance_step == CSCHED_BALANCE_NODE_AFFINITY && !ret" -- why the !ret in this clause? Because if !ret then we're not actually doing NODE_AFFINITY now, but CPU_AFFINITY.) Another non-negligible chunk of cognitive load for someone reading the code to 1) figure out, and 2) keep in mind as she tries to analyze it.

I took a look at all the places which use this return value, and it seems like the best thing in each case would just be to have the *caller*, before getting into the loop, call cpumask_full(sdom->node_affinity_cpumask) and just skip the CSCHED_NODE_BALANCE step altogether if it's true. (Example below.)

@@ -266,67 +332,94 @@ static inline void
      struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
      struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
      cpumask_t mask, idle_mask;
-    int idlers_empty;
+    int balance_step, idlers_empty;

      ASSERT(cur);
-    cpumask_clear(&mask);
-
      idlers_empty = cpumask_empty(prv->idlers);

      /*
-     * If the pcpu is idle, or there are no idlers and the new
-     * vcpu is a higher priority than the old vcpu, run it here.
-     *
-     * If there are idle cpus, first try to find one suitable to run
-     * new, so we can avoid preempting cur.  If we cannot find a
-     * suitable idler on which to run new, run it here, but try to
-     * find a suitable idler on which to run cur instead.
+     * Node and vcpu-affinity balancing loop. To speed things up, in case
+     * no node-affinity at all is present, scratch_balance_mask reflects
+     * the vcpu-affinity, and ret is -1, so that we then can quit the
+     * loop after only one step.
       */
-    if ( cur->pri == CSCHED_PRI_IDLE
-         || (idlers_empty && new->pri > cur->pri) )
+    for_each_csched_balance_step( balance_step )
      {
-        if ( cur->pri != CSCHED_PRI_IDLE )
-            SCHED_STAT_CRANK(tickle_idlers_none);
-        cpumask_set_cpu(cpu, &mask);
-    }
-    else if ( !idlers_empty )
-    {
-        /* Check whether or not there are idlers that can run new */
-        cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
+        int ret, new_idlers_empty;
+
+        cpumask_clear(&mask);

          /*
-         * If there are no suitable idlers for new, and it's higher
-         * priority than cur, ask the scheduler to migrate cur away.
-         * We have to act like this (instead of just waking some of
-         * the idlers suitable for cur) because cur is running.
+         * If the pcpu is idle, or there are no idlers and the new
+         * vcpu is a higher priority than the old vcpu, run it here.
           *
-         * If there are suitable idlers for new, no matter priorities,
-         * leave cur alone (as it is running and is, likely, cache-hot)
-         * and wake some of them (which is waking up and so is, likely,
-         * cache cold anyway).
+         * If there are idle cpus, first try to find one suitable to run
+         * new, so we can avoid preempting cur.  If we cannot find a
+         * suitable idler on which to run new, run it here, but try to
+         * find a suitable idler on which to run cur instead.
           */
-        if ( cpumask_empty(&idle_mask) && new->pri > cur->pri )
+        if ( cur->pri == CSCHED_PRI_IDLE
+             || (idlers_empty && new->pri > cur->pri) )
          {
-            SCHED_STAT_CRANK(tickle_idlers_none);
-            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
-            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
-            SCHED_STAT_CRANK(migrate_kicked_away);
-            set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
+            if ( cur->pri != CSCHED_PRI_IDLE )
+                SCHED_STAT_CRANK(tickle_idlers_none);
              cpumask_set_cpu(cpu, &mask);
          }
-        else if ( !cpumask_empty(&idle_mask) )
+        else if ( !idlers_empty )
          {
-            /* Which of the idlers suitable for new shall we wake up? */
-            SCHED_STAT_CRANK(tickle_idlers_some);
-            if ( opt_tickle_one_idle )
+            /* Are there idlers suitable for new (for this balance step)? */
+            ret = csched_balance_cpumask(new->vcpu, balance_step,
+                                         &scratch_balance_mask);
+            cpumask_and(&idle_mask, prv->idlers, &scratch_balance_mask);
+            new_idlers_empty = cpumask_empty(&idle_mask);
+
+            /*
+             * Let's not be too harsh! If there aren't idlers suitable
+             * for new in its node-affinity mask, make sure we check its
+             * vcpu-affinity as well, before tacking final decisions.
+             */
+            if ( new_idlers_empty
+                 && (balance_step == CSCHED_BALANCE_NODE_AFFINITY && !ret) )
+                continue;
+
+            /*
+             * If there are no suitable idlers for new, and it's higher
+             * priority than cur, ask the scheduler to migrate cur away.
+             * We have to act like this (instead of just waking some of
+             * the idlers suitable for cur) because cur is running.
+             *
+             * If there are suitable idlers for new, no matter priorities,
+             * leave cur alone (as it is running and is, likely, cache-hot)
+             * and wake some of them (which is waking up and so is, likely,
+             * cache cold anyway).
+             */
+            if ( new_idlers_empty && new->pri > cur->pri )
              {
-                this_cpu(last_tickle_cpu) =
-                    cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
-                cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
+                SCHED_STAT_CRANK(tickle_idlers_none);
+                SCHED_VCPU_STAT_CRANK(cur, kicked_away);
+                SCHED_VCPU_STAT_CRANK(cur, migrate_r);
+                SCHED_STAT_CRANK(migrate_kicked_away);
+                set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
+                cpumask_set_cpu(cpu, &mask);
              }
-            else
-                cpumask_or(&mask, &mask, &idle_mask);
+            else if ( !new_idlers_empty )
+            {
+                /* Which of the idlers suitable for new shall we wake up? */
+                SCHED_STAT_CRANK(tickle_idlers_some);
+                if ( opt_tickle_one_idle )
+                {
+                    this_cpu(last_tickle_cpu) =
+                        cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
+                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
+                }
+                else
+                    cpumask_or(&mask, &mask, &idle_mask);
+            }
          }
+
+        /* Did we find anyone (or csched_balance_cpumask() says we're done)? */
+        if ( !cpumask_empty(&mask) || ret )
+            break;
      }

The whole logic here is really convoluted and hard to read. For example, if cur->pri==IDLE, then you will always just break of the loop after the first iteration. In that case, why have the if() inside the loop to begin with? And if idlers_empty is true but cur->pri >= new->pri, then you'll go through the loop two times, even though both times it will come up empty. And, of course, the whole thing about the node affinity mask being checked inside csched_balance_cpumask(), but not used until the very end.

A much more straighforward way to arrange it would be:

if(cur->pri=IDLE &c &c)
{
  foo;
}
else if(!idlers_empty)
{
  if(cpumask_full(sdom->node_affinity_cpumask)
    balance_step=CSCHED_BALANCE_CPU_AFFINITY;
  else
    balance_step=CSCHED_BALANCE_NODE_AFFINITY;

  for(; balance_step <= CSCHED_BALANCE_MAX; balance_step++)
 {
 ...
 }
}

That seems a lot clearer to me -- does that make sense?

[To be continued...]

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.