[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`



On 11/04/12 14:17, Dario Faggioli wrote:
Basic idea is: cpu-affinity tells where a vcpu can run,
while node-affinity tells where (in terms of on what NUMA nodes,
but that of course imply a set of CPUs) all the vcpus of the
domain should/prefer to run... Problems starts when you have
both of them speaking at the same time!

Respecting vcpu-affinity should remain the primary concern, thus
what we do here is add some two-step logic here and there in the
scheduler (i.e., where vcpu-affinity related decisions are being
undertaken). During the first step, we check the `&&` of vcpu and
node affinity masks, aiming at giving some precedence to the CPUs
that are both suitable and preferrable for the domain. However,
if that fails in finding a valid CPU, the node-affinity is just
ignored and, in the second step, we just fallback to vcpu-affinity,
as the original behaviour was.

Both the introduced overhead and the benefits this provides has
to be investigated and compared against each other, possibly in
varying conditions and running different workloads.

Finally, although still at the RFC level, efforts have been made
to write the code in a flexible enough fashion so that, if we
ever want to introduce further balancing steps, it shouldn't be
too much of a pain.

TODO: * Better investigate and test interactions with cpupools.
       * Test, verify and benchmark. Then test, verify and benchmark
         again, and again and again. What I know as of know is that
         it does not explode that easily, but whether it works properly
         and, more important, brings any benefit, this has to be
         proven (and yes, I'm out running these tests and benchs already,
         but do not esitate to manifest your will for helping with
         that :-D).
       * Add some counter/stats, e.g., to serve the purpose outlined
         right above.

XXX I'm benchmarking this just right now, and peeking at the results
     they don't seem too bad. Numbers are mostly close to the case where
     the VM is already pinned to a node when created. I'll post the
     results as soon as I can, and I'll be happy to investigate any issue,
     if you feel like the approach could be the right one.
Hey Dario,

Sorry for the long delay in reviewing this.

Overall I think the approach is good.  A few points:
  /*
+ * Node Balancing
+ */
+#define CSCHED_BALANCE_NODE_AFFINITY    1
+#define CSCHED_BALANCE_CPU_AFFINITY     0
+#define CSCHED_BALANCE_START_STEP       CSCHED_BALANCE_NODE_AFFINITY
+#define CSCHED_BALANCE_END_STEP         CSCHED_BALANCE_CPU_AFFINITY
+
+
This thing of defining "START_STEP" and "END_STEP" seems a bit fragile. I think it would be better to always start at 0, and go until CSCHED_BALANCE_MAX.
+    /*
+     * Let's cache domain's dom->node_affinity here as an
+     * optimization for a couple of hot paths. In fact,
+     * knowing  whether or not dom->node_affinity has changed
+     * would allow us to avoid rebuilding node_affinity_cpumask
+     * (below) duing node balancing and/or scheduling.
+     */
+    nodemask_t node_affinity_cache;
+    /* Basing on what dom->node_affinity says,
+     * on what CPUs would we like to run most? */
+    cpumask_t node_affinity_cpumask;
I think the comments here need to be more clear.  The main points are:
* node_affinity_cpumask is the dom->node_affinity translated from a nodemask into a cpumask * Because doing the nodemask -> cpumask translation may be expensive, node_affinity_cache stores the last translated value, so we can avoid doing the translation if nothing has changed.

+#define csched_balance(__step) \
+    for ( (__step) = CSCHED_BALANCE_START_STEP; \
+          (__step)>= CSCHED_BALANCE_END_STEP; \
+          (__step)-- )
I don't like this. For one, it's fragile; if for some reason you switched NODE_AFFINITY and CPU_AFFINITY above, suddenly this loop would go the wrong direction.

I don't think there's any reason not to just use "for(step=0; step < CSCHED_BALANCE_MAX; step++)" -- it's not ugly and it means you know exactly what's going on without having to go search for the macro.

+
+/*
+ * Sort-of conversion between node-affinity and vcpu-affinity for the domain,
+ * i.e., a cpumask containing all the cpus from all the set nodes in the
+ * node-affinity mask of the domain.
This needs to be clearer -- vcpu-affinity doesn't have anything to do with this function, and there's nothing "sort-of" about the conversion. :-)

I think you mean to say, "Create a cpumask from the node affinity mask."
+ *
+ * Notice that we completely forget about serializing this (both here and
+ * in the various sites where node_affinity_cpumask is used) against
+ * d->node_affinity_lock. That would be hard to do properly, as that lock
+ * is a non IRQ-safe one, and it would be nearly impossible to access it
+ * from the scheduling code. However, although this leaves some room for
+ * races with code paths that updates d->node_affinity, it all should still
+ * be fine, considering:
+ *  (1) d->node_affinity updates are going to be quite rare;
+ *  (2) this balancing logic is kind-of "best effort" anyway;
+ *  (3) given (1), any inconsistencies are likely to be fixed by the next
+ *      call to this same function without risking going into instability.
+ *
+ * XXX Validate (via testing/benchmarking) whether this is true, as it
+ *     likely sounds to be, or it causes unexpected issues.
Ack.
+/* Put together the cpumask we are going to use for this balancing step */
+static int
+csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
+{
+    struct domain *d = vc->domain;
+    struct csched_dom *sdom = CSCHED_DOM(d);
+
+    /*
+     * Only two possible steps exists for now: node and vcpu affinity.
+     * If this domain does not have a node-affinity or is affine to all the
+     * nodes, just return th vcpu-affinity mask (for *this* vcpu) and
+     * stop the balancing process.
+     */
+    if ( !d->has_node_affinity || nodes_full(d->node_affinity) ||
+         step == CSCHED_BALANCE_CPU_AFFINITY )
+    {
+        cpumask_copy(mask, vc->cpu_affinity);
+        return CSCHED_BALANCE_CPU_AFFINITY;
+    }
Hmm -- this mechanism seems kind of fragile. It seems like returning -1 or something, and having the caller call "continue", would be easier for future coders (potentially you or I) to reason about.
+    /*
+     * XXX As of now (with only two possible steps, this is easy and readable
+     *     enough. If more steps become necessary at some point in time, we
+     *     can keep an array of cpumask_t in dom_data and return the proper
+     *     element via step-indexing such an array. Of course, we can turn
+     *     this like that even right now... Thoughts?
+     */
Beware of early optimization. :-) Just make sure to mark this down for something to look at for profiling.
  static inline void
+__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask)
+{
+    CSCHED_STAT_CRANK(tickle_idlers_some);
+    if ( opt_tickle_one_idle )
+    {
+        this_cpu(last_tickle_cpu) =
+            cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask);
+        cpumask_set_cpu(this_cpu(last_tickle_cpu), mask);
+    }
+    else
+        cpumask_or(mask, mask, idle_mask);
+}
I don't see any reason to make this into a function -- it's only called once, and it's not that long. Unless you're concerned about too many indentations making the lines too short?
+            csched_balance(balance_step)
              {
Again, I'd make this a for() loop.
     sdom->dom = dom;
+    /*
+     *XXX This would be 'The Right Thing', but as it is still too
+     *    early and d->node_affinity has not settled yet, maybe we
+     *    can just init the two masks with something like all-nodes
+     *    and all-cpus and rely on the first balancing call for
+     *    having them updated?
+     */
+    csched_build_balance_cpumask(sdom);
We might as well do what you've got here, unless it's likely to produce garbage. This isn't exactly a hot path. :-)

Other than that, looks good -- Thanks!

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.