[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On 19/12/12 19:07, Dario Faggioli wrote: +static inline int +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers) +{ + /* + * Consent to migration if cpu is one of the idlers in the VCPU's + * affinity mask. In fact, if that is not the case, it just means it + * was some other CPU that was tickled and should hence come and pick + * VCPU up. Migrating it to cpu would only make things worse. + */ + return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask); } static int @@ -493,85 +599,98 @@ static int cpumask_t idlers; cpumask_t *online; struct csched_pcpu *spc = NULL; + int ret, balance_step; int cpu; - /* - * Pick from online CPUs in VCPU's affinity mask, giving a - * preference to its current processor if it's in there. - */ online = cpupool_scheduler_cpumask(vc->domain->cpupool); - cpumask_and(&cpus, online, vc->cpu_affinity); - cpu = cpumask_test_cpu(vc->processor, &cpus) - ? vc->processor - : cpumask_cycle(vc->processor, &cpus); - ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); + for_each_csched_balance_step( balance_step ) + { + /* Pick an online CPU from the proper affinity mask */ + ret = csched_balance_cpumask(vc, balance_step, &cpus); + cpumask_and(&cpus, &cpus, online); - /* - * Try to find an idle processor within the above constraints. - * - * In multi-core and multi-threaded CPUs, not all idle execution - * vehicles are equal! - * - * We give preference to the idle execution vehicle with the most - * idling neighbours in its grouping. This distributes work across - * distinct cores first and guarantees we don't do something stupid - * like run two VCPUs on co-hyperthreads while there are idle cores - * or sockets. - * - * Notice that, when computing the "idleness" of cpu, we may want to - * discount vc. That is, iff vc is the currently running and the only - * runnable vcpu on cpu, we add cpu to the idlers. - */ - cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); - if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) - cpumask_set_cpu(cpu, &idlers); - cpumask_and(&cpus, &cpus, &idlers); - cpumask_clear_cpu(cpu, &cpus); + /* If present, prefer vc's current processor */ + cpu = cpumask_test_cpu(vc->processor, &cpus) + ? vc->processor + : cpumask_cycle(vc->processor, &cpus); + ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); - while ( !cpumask_empty(&cpus) ) - { - cpumask_t cpu_idlers; - cpumask_t nxt_idlers; - int nxt, weight_cpu, weight_nxt; - int migrate_factor; + /* + * Try to find an idle processor within the above constraints. + * + * In multi-core and multi-threaded CPUs, not all idle execution + * vehicles are equal! + * + * We give preference to the idle execution vehicle with the most + * idling neighbours in its grouping. This distributes work across + * distinct cores first and guarantees we don't do something stupid + * like run two VCPUs on co-hyperthreads while there are idle cores + * or sockets. + * + * Notice that, when computing the "idleness" of cpu, we may want to + * discount vc. That is, iff vc is the currently running and the only + * runnable vcpu on cpu, we add cpu to the idlers. + */ + cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); + if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) + cpumask_set_cpu(cpu, &idlers); + cpumask_and(&cpus, &cpus, &idlers); + /* If there are idlers and cpu is still not among them, pick one */ + if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) ) + cpu = cpumask_cycle(cpu, &cpus); This seems to be an addition to the algorithm -- particularly hidden in this kind of "indent a big section that's almost exactly the same", I think this at least needs to be called out in the changelog message, perhaps put in a separate patch. Can you comment on to why you think it's necessary? Was there a particular problem you were seeing? + cpumask_clear_cpu(cpu, &cpus); - nxt = cpumask_cycle(cpu, &cpus); + while ( !cpumask_empty(&cpus) ) + { + cpumask_t cpu_idlers; + cpumask_t nxt_idlers; + int nxt, weight_cpu, weight_nxt; + int migrate_factor; - if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) - { - /* We're on the same socket, so check the busy-ness of threads. - * Migrate if # of idlers is less at all */ - ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); - migrate_factor = 1; - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu)); - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt)); - } - else - { - /* We're on different sockets, so check the busy-ness of cores. - * Migrate only if the other core is twice as idle */ - ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); - migrate_factor = 2; - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu)); - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); + nxt = cpumask_cycle(cpu, &cpus); + + if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) + { + /* We're on the same socket, so check the busy-ness of threads. + * Migrate if # of idlers is less at all */ + ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); + migrate_factor = 1; + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, + cpu)); + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, + nxt)); + } + else + { + /* We're on different sockets, so check the busy-ness of cores. + * Migrate only if the other core is twice as idle */ + ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); + migrate_factor = 2; + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu)); + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); + } + + weight_cpu = cpumask_weight(&cpu_idlers); + weight_nxt = cpumask_weight(&nxt_idlers); + /* smt_power_savings: consolidate work rather than spreading it */ + if ( sched_smt_power_savings ? + weight_cpu > weight_nxt : + weight_cpu * migrate_factor < weight_nxt ) + { + cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); + spc = CSCHED_PCPU(nxt); + cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers); + cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu)); + } + else + { + cpumask_andnot(&cpus, &cpus, &nxt_idlers); + } } - weight_cpu = cpumask_weight(&cpu_idlers); - weight_nxt = cpumask_weight(&nxt_idlers); - /* smt_power_savings: consolidate work rather than spreading it */ - if ( sched_smt_power_savings ? - weight_cpu > weight_nxt : - weight_cpu * migrate_factor < weight_nxt ) - { - cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); - spc = CSCHED_PCPU(nxt); - cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers); - cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu)); - } - else - { - cpumask_andnot(&cpus, &cpus, &nxt_idlers); - } + /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */ + if ( cpumask_test_cpu(cpu, &idlers) || ret ) + break; Right -- OK, I think everything looks good here, except the "return -1 from csched_balance_cpumask" thing. I think it would be better if we explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped the NODE step if that's the case. Also -- and sorry to have to ask this kind of thing, but after sorting through the placement algorithm my head hurts -- under what circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this point? It seems like the only possibility would be if: ( (vc->processor was not in the original &cpus [1]) || !IS_RUNQ_IDLE(vc->processor) ) && (there are no idlers in the original &cpus)Which I suppose probably matches the time when we want to move on from looking at NODE affinity and look for CPU affinity. [1] This could happen either if the vcpu/node affinity has changed, or if we're currently running outside our node affinity and we're doing the NODE step. OK -- I think I've convinced myself that this is OK as well (apart from the hidden check). I'll come back to look at your response to the load balancing thing tomorrow. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |