Xen project Mailing List

Re: [Xen-devel] [PATCH 3/6] xen: credit1: increase efficiency and scalability of load balancing.

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Thu, 6 Apr 2017 09:37:58 +0200

Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>

Delivery-date: Thu, 06 Apr 2017 07:38:22 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, 2017-03-02 at 11:06 +0000, Andrew Cooper wrote: > On 02/03/17 10:38, Dario Faggioli wrote: > > > > To mitigate this, we introduce here the concept of > > overloaded runqueues, and a cpumask where to record > > what pCPUs are in such state. > > > > An overloaded runqueue has at least runnable 2 vCPUs > > (plus the idle one, which is always there). Typically, > > this means 1 vCPU is running, and 1 is sitting in the > > runqueue, and can hence be stolen. > > > > Then, in csched_balance_load(), it is enough to go > > over the overloaded pCPUs, instead than all non-idle > > pCPUs, which is better. > > > Malcolm’s solution to this problem is https://github.com/xenserver/xe > n-4.7.pg/commit/0f830b9f229fa6472accc9630ad16cfa42258966 This has > been in 2 releases of XenServer now, and has a very visible > improvement for aggregate multi-queue multi-vm intrahost network > performance (although I can't find the numbers right now). > > The root of the performance problems is that pcpu_schedule_trylock() > is expensive even for the local case, while cross-cpu locking is much > worse. Locking every single pcpu in turn is terribly expensive, in > terms of hot cacheline pingpong, and the lock is frequently > contended. > BTW, both my patch in this series, and the patch linked above are _wrong_ in using __runq_insert() and __runq_remove() for counting the runnable vCPUs. In fact, in Credit1, during the main scheduling function (csched_schedule()), we call runqueue insert for temporarily putting the running vCPU. This increments the counter, making all the other pCPUs think that there is a vCPU available for stealing in there, while that: 1) may not be true, if we end up choosing to run the same vCPU again 2) even if true, they'll always fail on the trylock, unless until we're out of csched_schedule(), as it holds the runqueue lock itself. So, yeah, it's not really a matter of correctness, but there's more overhead cut. In v2 of this series, that I'm about to send, I've "fixed" this (i.e., I'm only modifying the counter when really necessary). > As a first opinion of this patch, you are adding another cpumask > which is going to play hot cacheline pingpong. > Yeah, well, despite liking the cpumask based approach, I agree it's overkill in this case. In v2, I got rid of it, and I am doing something even more similar to Malcolm's patch above. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.