[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Thu, 20 Dec 2012 15:56:39 +0000
Cc: Marcus Granado <Marcus.Granado@xxxxxxxxxxxxx>, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Anil Madhavapeddy <anil@xxxxxxxxxx>, Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Daniel De Graaf <dgdegra@xxxxxxxxxxxxx>, Matt Wilson <msw@xxxxxxxxxx>
Delivery-date: Thu, 20 Dec 2012 16:03:35 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 19/12/12 19:07, Dario Faggioli wrote:

As vcpu-affinity tells where VCPUs must run, node-affinity tells
where they should or, better, prefer. While respecting vcpu-affinity
remains mandatory, node-affinity is not that strict, it only expresses
a preference, although honouring it is almost always true that will
bring significant performances benefit (especially as compared to
not having any affinity at all).

This change modifies the VCPU load balancing algorithm (for the
credit scheduler only), introducing a two steps logic.
During the first step, we use the node-affinity mask. The aim is
giving precedence to the CPUs where it is known to be preferable
for the domain to run. If that fails in finding a valid PCPU, the
node-affinity is just ignored and, in the second step, we fall
back to using cpu-affinity only.

Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

This one has a lot of structural comments; so I'm going to send a coupleof different mails as I'm going through it, so we can parallize thediscussion better. :-)

---
Changes from v1:
  * CPU masks variables moved off from the stack, as requested during
    review. As per the comments in the code, having them in the private
    (per-scheduler instance) struct could have been enough, but it would be
    racy (again, see comments). For that reason, use a global bunch of
    them of (via per_cpu());
  * George suggested a different load balancing logic during v1's review. I
    think he was right and then I changed the old implementation in a way
    that resembles exactly that. I rewrote most of this patch to introduce
    a more sensible and effective noda-affinity handling logic.

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -111,6 +111,33 @@


  /*
+ * Node Balancing
+ */
+#define CSCHED_BALANCE_CPU_AFFINITY     0
+#define CSCHED_BALANCE_NODE_AFFINITY    1
+#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY

[snip]

+#define for_each_csched_balance_step(__step) \
+    for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )

Why are we starting at the top and going down? Is there any good reasonfor it?

Every time you do anything unexpected, you add to the cognitive load ofthe person reading your code, leaving less spare processing power ormemory for other bits of the code, and increasing (sligthly) the chanceof making a mistake. The most natural thing would be for someone toexpect that the steps start at 0 and go up; just reversing it means it'sthat little bit harder to understand. When you name it "LAST", it'seven worse, because that would definitely imply that this step is goingto be executed last.


So why not just have this be as follows?

for(step=0; step<CSCHED_BALANCE_MAX; step++)

+
+/*
+ * Each csched-balance step has to use its own cpumask. This function
+ * determines which one, given the step, and copies it in mask. Notice
+ * that, in case of node-affinity balancing step, it also filters out from
+ * the node-affinity mask the cpus that are not part of vc's cpu-affinity,
+ * as we do not want to end up running a vcpu where it would like, but
+ * is not allowed to!
+ *
+ * As an optimization, if a domain does not have any node-affinity at all
+ * (namely, its node affinity is automatically computed), not only the
+ * computed mask will reflect its vcpu-affinity, but we also return -1 to
+ * let the caller know that he can skip the step or quit the loop (if he
+ * wants).
+ */
+static int
+csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
+{
+    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
+    {
+        struct domain *d = vc->domain;
+        struct csched_dom *sdom = CSCHED_DOM(d);
+
+        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
+
+        if ( cpumask_full(sdom->node_affinity_cpumask) )
+            return -1;

There's no optimization in having this comparison done here. You're notreading something from a local variable that you've just calculated.But hiding this comparison inside this function, and disguising it as"returns -1", does increase the cognitive load on anybody trying to readand understand the code -- particularly, as how the return value is usedis not really clear.

Also, when you use this value, effectively what you're doing is saying,"Actually, we just said we were doing the NODE_BALANCE step, but itturns out that the results of NODE_BALANCE and CPU_BALANCE will be thesame, so we're just going to pretend that we've been doing theCPU_BALANCE step instead." (See for example, "balance_step ==CSCHED_BALANCE_NODE_AFFINITY && !ret" -- why the !ret in this clause?Because if !ret then we're not actually doing NODE_AFFINITY now, butCPU_AFFINITY.) Another non-negligible chunk of cognitive load forsomeone reading the code to 1) figure out, and 2) keep in mind as shetries to analyze it.

I took a look at all the places which use this return value, and itseems like the best thing in each case would just be to have the*caller*, before getting into the loop, callcpumask_full(sdom->node_affinity_cpumask) and just skip theCSCHED_NODE_BALANCE step altogether if it's true. (Example below.)

@@ -266,67 +332,94 @@ static inline void
      struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
      struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
      cpumask_t mask, idle_mask;
-    int idlers_empty;
+    int balance_step, idlers_empty;

      ASSERT(cur);
-    cpumask_clear(&mask);
-
      idlers_empty = cpumask_empty(prv->idlers);

      /*
-     * If the pcpu is idle, or there are no idlers and the new
-     * vcpu is a higher priority than the old vcpu, run it here.
-     *
-     * If there are idle cpus, first try to find one suitable to run
-     * new, so we can avoid preempting cur.  If we cannot find a
-     * suitable idler on which to run new, run it here, but try to
-     * find a suitable idler on which to run cur instead.
+     * Node and vcpu-affinity balancing loop. To speed things up, in case
+     * no node-affinity at all is present, scratch_balance_mask reflects
+     * the vcpu-affinity, and ret is -1, so that we then can quit the
+     * loop after only one step.
       */
-    if ( cur->pri == CSCHED_PRI_IDLE
-         || (idlers_empty && new->pri > cur->pri) )
+    for_each_csched_balance_step( balance_step )
      {
-        if ( cur->pri != CSCHED_PRI_IDLE )
-            SCHED_STAT_CRANK(tickle_idlers_none);
-        cpumask_set_cpu(cpu, &mask);
-    }
-    else if ( !idlers_empty )
-    {
-        /* Check whether or not there are idlers that can run new */
-        cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
+        int ret, new_idlers_empty;
+
+        cpumask_clear(&mask);

          /*
-         * If there are no suitable idlers for new, and it's higher
-         * priority than cur, ask the scheduler to migrate cur away.
-         * We have to act like this (instead of just waking some of
-         * the idlers suitable for cur) because cur is running.
+         * If the pcpu is idle, or there are no idlers and the new
+         * vcpu is a higher priority than the old vcpu, run it here.
           *
-         * If there are suitable idlers for new, no matter priorities,
-         * leave cur alone (as it is running and is, likely, cache-hot)
-         * and wake some of them (which is waking up and so is, likely,
-         * cache cold anyway).
+         * If there are idle cpus, first try to find one suitable to run
+         * new, so we can avoid preempting cur.  If we cannot find a
+         * suitable idler on which to run new, run it here, but try to
+         * find a suitable idler on which to run cur instead.
           */
-        if ( cpumask_empty(&idle_mask) && new->pri > cur->pri )
+        if ( cur->pri == CSCHED_PRI_IDLE
+             || (idlers_empty && new->pri > cur->pri) )
          {
-            SCHED_STAT_CRANK(tickle_idlers_none);
-            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
-            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
-            SCHED_STAT_CRANK(migrate_kicked_away);
-            set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
+            if ( cur->pri != CSCHED_PRI_IDLE )
+                SCHED_STAT_CRANK(tickle_idlers_none);
              cpumask_set_cpu(cpu, &mask);
          }
-        else if ( !cpumask_empty(&idle_mask) )
+        else if ( !idlers_empty )
          {
-            /* Which of the idlers suitable for new shall we wake up? */
-            SCHED_STAT_CRANK(tickle_idlers_some);
-            if ( opt_tickle_one_idle )
+            /* Are there idlers suitable for new (for this balance step)? */
+            ret = csched_balance_cpumask(new->vcpu, balance_step,
+                                         &scratch_balance_mask);
+            cpumask_and(&idle_mask, prv->idlers, &scratch_balance_mask);
+            new_idlers_empty = cpumask_empty(&idle_mask);
+
+            /*
+             * Let's not be too harsh! If there aren't idlers suitable
+             * for new in its node-affinity mask, make sure we check its
+             * vcpu-affinity as well, before tacking final decisions.
+             */
+            if ( new_idlers_empty
+                 && (balance_step == CSCHED_BALANCE_NODE_AFFINITY && !ret) )
+                continue;
+
+            /*
+             * If there are no suitable idlers for new, and it's higher
+             * priority than cur, ask the scheduler to migrate cur away.
+             * We have to act like this (instead of just waking some of
+             * the idlers suitable for cur) because cur is running.
+             *
+             * If there are suitable idlers for new, no matter priorities,
+             * leave cur alone (as it is running and is, likely, cache-hot)
+             * and wake some of them (which is waking up and so is, likely,
+             * cache cold anyway).
+             */
+            if ( new_idlers_empty && new->pri > cur->pri )
              {
-                this_cpu(last_tickle_cpu) =
-                    cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
-                cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
+                SCHED_STAT_CRANK(tickle_idlers_none);
+                SCHED_VCPU_STAT_CRANK(cur, kicked_away);
+                SCHED_VCPU_STAT_CRANK(cur, migrate_r);
+                SCHED_STAT_CRANK(migrate_kicked_away);
+                set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
+                cpumask_set_cpu(cpu, &mask);
              }
-            else
-                cpumask_or(&mask, &mask, &idle_mask);
+            else if ( !new_idlers_empty )
+            {
+                /* Which of the idlers suitable for new shall we wake up? */
+                SCHED_STAT_CRANK(tickle_idlers_some);
+                if ( opt_tickle_one_idle )
+                {
+                    this_cpu(last_tickle_cpu) =
+                        cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
+                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
+                }
+                else
+                    cpumask_or(&mask, &mask, &idle_mask);
+            }
          }
+
+        /* Did we find anyone (or csched_balance_cpumask() says we're done)? */
+        if ( !cpumask_empty(&mask) || ret )
+            break;
      }

The whole logic here is really convoluted and hard to read. Forexample, if cur->pri==IDLE, then you will always just break of the loopafter the first iteration. In that case, why have the if() inside theloop to begin with? And if idlers_empty is true but cur->pri >=new->pri, then you'll go through the loop two times, even though bothtimes it will come up empty. And, of course, the whole thing about thenode affinity mask being checked inside csched_balance_cpumask(), butnot used until the very end.


A much more straighforward way to arrange it would be:

if(cur->pri=IDLE &c &c)
{
  foo;
}
else if(!idlers_empty)
{
  if(cpumask_full(sdom->node_affinity_cpumask)
    balance_step=CSCHED_BALANCE_CPU_AFFINITY;
  else
    balance_step=CSCHED_BALANCE_NODE_AFFINITY;

  for(; balance_step <= CSCHED_BALANCE_MAX; balance_step++)
 {
 ...
 }
}

That seems a lot clearer to me -- does that make sense?

[To be continued...]

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  - From: Dario Faggioli

References:
- [Xen-devel] [PATCH 00 of 10 v2] NUMA aware credit scheduling
  - From: Dario Faggioli
- [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  - From: Dario Faggioli

Prev by Date: Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Next by Date: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
Previous by thread: Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Next by thread: Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.