[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity



On Thu, Dec 20, 2012 at 4:56 PM, George Dunlap
<george.dunlap@xxxxxxxxxxxxx> wrote:
>> This change modifies the VCPU load balancing algorithm (for the
>> credit scheduler only), introducing a two steps logic.
>> During the first step, we use the node-affinity mask. The aim is
>> giving precedence to the CPUs where it is known to be preferable
>> for the domain to run. If that fails in finding a valid PCPU, the
>> node-affinity is just ignored and, in the second step, we fall
>> back to using cpu-affinity only.
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
>
>
> This one has a lot of structural comments; so I'm going to send a couple of
> different mails as I'm going through it, so we can parallize the discussion
> better. :-)
>
Ok.

>> +#define for_each_csched_balance_step(__step) \
>> +    for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )
>
>
> Why are we starting at the top and going down?  Is there any good reason for
> it?
>
You're totally right, it looked like this in the very first RFC, when
the whole set of
macros was different. I changed that but never considered this, but I
agree going
up would be more natural.

> So why not just have this be as follows?
>
> for(step=0; step<CSCHED_BALANCE_MAX; step++)
>
Will do.

>> +static int
>> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
>> +{
>> +    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
>> +    {
>> +        struct domain *d = vc->domain;
>> +        struct csched_dom *sdom = CSCHED_DOM(d);
>> +
>> +        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
>> +
>> +        if ( cpumask_full(sdom->node_affinity_cpumask) )
>> +            return -1;
>
>
> There's no optimization in having this comparison done here.  You're not
> reading something from a local variable that you've just calculated.  But
> hiding this comparison inside this function, and disguising it as "returns
> -1", does increase the cognitive load on anybody trying to read and
> understand the code -- particularly, as how the return value is used is not
> really clear.
>
Yes, I again agree this is ugly. The previous version was better, from
the caller
point of view, but had other downsides (IIRC it was mangling with the
loop control
variable directly) so we agreed on this '-1' thing.

> Also, when you use this value, effectively what you're doing is saying,
> "Actually, we just said we were doing the NODE_BALANCE step, but it turns
> out that the results of NODE_BALANCE and CPU_BALANCE will be the same, so
> we're just going to pretend that we've been doing the CPU_BALANCE step
> instead."  (See for example, "balance_step == CSCHED_BALANCE_NODE_AFFINITY
> && !ret" -- why the !ret in this clause?  Because if !ret then we're not
> actually doing NODE_AFFINITY now, but CPU_AFFINITY.)  Another non-negligible
> chunk of cognitive load for someone reading the code to 1) figure out, and
> 2) keep in mind as she tries to analyze it.
>
Totally agreed. :-)

> I took a look at all the places which use this return value, and it seems
> like the best thing in each case would just be to have the *caller*, before
> getting into the loop, call cpumask_full(sdom->node_affinity_cpumask) and
> just skip the CSCHED_NODE_BALANCE step altogether if it's true.  (Example
> below.)
>
I will think about it and see if I can find a nice solution for making
that happen
(the point being I'm not sure I like exposing and disseminating that
cpumask_full... thing that much, but I guess I can hide it under some more
macro-ing and stuff).

>> @@ -266,67 +332,94 @@ static inline void
>>       struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
>>       struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
>>       cpumask_t mask, idle_mask;
>> -    int idlers_empty;
>> +    int balance_step, idlers_empty;
>>
>>       ASSERT(cur);
>> -    cpumask_clear(&mask);
>> -
>>       idlers_empty = cpumask_empty(prv->idlers);
>>
>>       /*
>> -     * If the pcpu is idle, or there are no idlers and the new
>> -     * vcpu is a higher priority than the old vcpu, run it here.
>> -     *
>> -     * If there are idle cpus, first try to find one suitable to run
>> -     * new, so we can avoid preempting cur.  If we cannot find a
>> -     * suitable idler on which to run new, run it here, but try to
>> -     * find a suitable idler on which to run cur instead.
>> +     * Node and vcpu-affinity balancing loop. To speed things up, in case
>> +     * no node-affinity at all is present, scratch_balance_mask reflects
>> +     * the vcpu-affinity, and ret is -1, so that we then can quit the
>> +     * loop after only one step.
>>        */
>> [snip]
>
> The whole logic here is really convoluted and hard to read.  For example, if
> cur->pri==IDLE, then you will always just break of the loop after the first
> iteration.  In that case, why have the if() inside the loop to begin with?
> And if idlers_empty is true but cur->pri >= new->pri, then you'll go through
> the loop two times, even though both times it will come up empty.  And, of
> course, the whole thing about the node affinity mask being checked inside
> csched_balance_cpumask(), but not used until the very end.
>
I fear it looks convoluted and complex because, well, it _is_ quite complex!
However, I see you're point, and there definitely are chances for it
to, although
being complex, look more complex than it should. :-)

> A much more straighforward way to arrange it would be:
>
> if(cur->pri=IDLE &c &c)
> {
>   foo;
> }
> else if(!idlers_empty)
> {
>   if(cpumask_full(sdom->node_affinity_cpumask)
>     balance_step=CSCHED_BALANCE_CPU_AFFINITY;
>   else
>     balance_step=CSCHED_BALANCE_NODE_AFFINITY;
>
Yes, I think this can be taken out. I'll give some more thinking to this and
see if I can simplify the flow.

> [To be continued...]
>
Thanks for now :-)
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
---------------------------------------------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.