| On 02/03/17 10:38, Dario Faggioli
      wrote:
 
      During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.
If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).
On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
 - we go through all the pCPUs,
 - for each one, we (try to) take its runqueue locks,
 - we figure out there's actually nothing to be stolen!
To mitigate this, we introduce here the concept of
overloaded runqueues, and a cpumask where to record
what pCPUs are in such state.
An overloaded runqueue has at least runnable 2 vCPUs
(plus the idle one, which is always there). Typically,
this means 1 vCPU is running, and 1 is sitting in  the
runqueue, and can hence be stolen.
Then, in  csched_balance_load(), it is enough to go
over the overloaded pCPUs, instead than all non-idle
pCPUs, which is better.
signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
---
Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> Malcolm’s solution to this problem is
https://github.com/xenserver/xen-4.7.pg/commit/0f830b9f229fa6472accc9630ad16cfa42258966 
    This has been in 2 releases of XenServer now, and has a very visible
    improvement for aggregate multi-queue multi-vm intrahost network
    performance (although I can't find the numbers right now).
 
 The root of the performance problems is that pcpu_schedule_trylock()
      is expensive even for the local case, while cross-cpu locking is
      much worse.  Locking every single pcpu in turn is terribly
      expensive,  in terms of hot cacheline
      pingpong, and the lock is frequently contended.
 
 As a first opinion of this patch, you are adding another cpumask
      which is going to play hot cacheline pingpong.
 
 ~Andrew
 
 |