Xen project Mailing List

Re: [PATCH v2 1/2] credit: Limit load balancing to once per millisecond

To: George Dunlap <george.dunlap@xxxxxxxxx>

Date: Fri, 22 Sep 2023 15:49:48 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ns+rH3fRRXfd/M6YwkHbVDpDhOXYUvOlcZ6sLQnx8xs=; b=VWTW3wpLe0mnJuBNK/jQojzvwMhFZq/6t3KTps3z7AxhbVJw+w+zr2siCWGr/YkB/Uv0xXjVb0XX0xGOvxIuUcztpbA0EVCw+QaLADkJ4XsYWWrhkbLIkR6jswgwFTfUuOzHs+oibUR6g+WWs7cVUgkRlFowTJQjzj8K+eSEcisc/BVd6jao5QphG0wkThVDS2CoLJoXYtQr6BWquMnvuN5DqVsfdkYQ9pHk0bDJCu0+g+N2IszA+esf2hbaMqtsuJCKNkQWsm9I+Lwrmnxm33eSRX+Y027JkrsNQIRnjB+Obb2YDrXQXaZ9BioT3b+++nj3loI3StMz4ixPcpapqw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=GWseXgAHbxE93V6TDC0z7VfXYDZcslx4fFC3Y5duj5SNiFzf3HEiKJV3i0dYGkNsRsZqAf0qXDpc6daOgK1NvR+joMOcio270i744hYwrUfPV1jlDslYGPjFqqIS3qgc6sYcGF7BKUxjdMs26dfYjmOprzNH3D5R2A9OHWZ4lrS1rV4cYNY4AYcYasbvNQTioTOVFlzYiW+CEtE4REmMh6Y+IASaSdPB7zY/WxVeu5v5V9sxyLLrfJ9jWdG5cdJWW2j73dKrboCX0XXQD3rWrit6mb4JtRfLsl8NsMCo5g/o1lQl4zocayxMffhzyUDeRbxO6qXC+HAsBBPOf+zMtg==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Dario Faggioli <dfaggioli@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Juergen Gross <jgross@xxxxxxxx>

Delivery-date: Fri, 22 Sep 2023 13:50:03 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 22.09.2023 14:20, George Dunlap wrote: > On Thu, Sep 21, 2023 at 2:12 PM Juergen Gross <jgross@xxxxxxxx> wrote: >> >> On 21.09.23 14:23, George Dunlap wrote: >>> The credit scheduler tries as hard as it can to ensure that it always >>> runs scheduling units with positive credit (PRI_TS_UNDER) before >>> running those with negative credit (PRI_TS_OVER). If the next >>> runnable scheduling unit is of priority OVER, it will always run the >>> load balancer, which will scour the system looking for another >>> scheduling unit of the UNDER priority. >>> >>> Unfortunately, as the number of cores on a system has grown, the cost >>> of the work-stealing algorithm has dramatically increased; a recent >>> trace on a system with 128 cores showed this taking over 50 >>> microseconds. >>> >>> Add a parameter, load_balance_ratelimit, to limit the frequency of >>> load balance operations on a given pcpu. Default this to 1 >>> millisecond. >>> >>> Invert the load balancing conditional to make it more clear, and line >>> up more closely with the comment above it. >>> >>> Overall it might be cleaner to have the last_load_balance checking >>> happen inside csched_load_balance(), but that would require either >>> passing both now and spc into the function, or looking them up again; >>> both of which seemed to be worse than simply checking and setting the >>> values before calling it. >>> >>> On a system with a vcpu:pcpu ratio of 2:1, running Windows guests >>> (which will end up calling YIELD during spinlock contention), this >>> patch increased performance significantly. >>> >>> Signed-off-by: George Dunlap <george.dunlap@xxxxxxxxx> >>> --- >>> Changes since v1: >>> - Fix editing mistake in commit message >>> - Improve documentation >>> - global var is __ro_after_init >>> - Remove sysctl, as it's not used. Define max value in credit.c. >>> - Fix some style issues >>> - Move comment tweak to the right patch >>> - In the event that the commandline-parameter value is too high, clip >>> to the maximum value rather than setting to the default. >>> >>> CC: Dario Faggioli <dfaggioli@xxxxxxxx> >>> CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> >>> CC: George Dunlap <george.dunlap@xxxxxxxxxx> >>> CC: Jan Beulich <jbeulich@xxxxxxxx> >>> CC: Julien Grall <julien@xxxxxxx> >>> CC: Stefano Stabellini <sstabellini@xxxxxxxxxx> >>> CC: Wei Liu <wl@xxxxxxx> >>> --- >>> docs/misc/xen-command-line.pandoc | 8 ++++++ >>> xen/common/sched/credit.c | 47 +++++++++++++++++++++++++------ >>> 2 files changed, 46 insertions(+), 9 deletions(-) >>> >>> diff --git a/docs/misc/xen-command-line.pandoc >>> b/docs/misc/xen-command-line.pandoc >>> index f88e6a70ae..9c3c72a7f9 100644 >>> --- a/docs/misc/xen-command-line.pandoc >>> +++ b/docs/misc/xen-command-line.pandoc >>> @@ -1884,6 +1884,14 @@ By default, Xen will use the INVPCID instruction for >>> TLB management if >>> it is available. This option can be used to cause Xen to fall back to >>> older mechanisms, which are generally slower. >>> >>> +### load-balance-ratelimit >>> +> `= <integer>` >>> + >>> +The minimum interval between load balancing events on a given pcpu, in >>> +microseconds. A value of '0' will disable rate limiting. Maximum >>> +value 1 second. At the moment only credit honors this parameter. >>> +Default 1ms. >>> + >>> ### noirqbalance (x86) >>> > `= <boolean>` >>> >>> diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c >>> index f2cd3d9da3..5c06f596d2 100644 >>> --- a/xen/common/sched/credit.c >>> +++ b/xen/common/sched/credit.c >>> @@ -50,6 +50,10 @@ >>> #define CSCHED_TICKS_PER_TSLICE 3 >>> /* Default timeslice: 30ms */ >>> #define CSCHED_DEFAULT_TSLICE_MS 30 >>> +/* Default load balancing ratelimit: 1ms */ >>> +#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000 >>> +/* Max load balancing ratelimit: 1s */ >>> +#define CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US 1000000 >>> #define CSCHED_CREDITS_PER_MSEC 10 >>> /* Never set a timer shorter than this value. */ >>> #define CSCHED_MIN_TIMER XEN_SYSCTL_SCHED_RATELIMIT_MIN >>> @@ -153,6 +157,7 @@ struct csched_pcpu { >>> >>> unsigned int idle_bias; >>> unsigned int nr_runnable; >>> + s_time_t last_load_balance; >>> >>> unsigned int tick; >>> struct timer ticker; >>> @@ -218,7 +223,7 @@ struct csched_private { >>> >>> /* Period of master and tick in milliseconds */ >>> unsigned int tick_period_us, ticks_per_tslice; >>> - s_time_t ratelimit, tslice, unit_migr_delay; >>> + s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit; >>> >>> struct list_head active_sdom; >>> uint32_t weight; >>> @@ -612,6 +617,8 @@ init_pdata(struct csched_private *prv, struct >>> csched_pcpu *spc, int cpu) >>> BUG_ON(!is_idle_unit(curr_on_cpu(cpu))); >>> cpumask_set_cpu(cpu, prv->idlers); >>> spc->nr_runnable = 0; >>> + >>> + spc->last_load_balance = NOW(); >>> } >>> >>> static void cf_check >>> @@ -1676,9 +1683,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, >>> int balance_step) >>> return NULL; >>> } >>> >>> +/* >>> + * Minimum delay, in microseconds, between load balance operations. >>> + * This prevents spending too much time doing load balancing, particularly >>> + * when the system has a high number of YIELDs due to spinlock priority >>> inversion. >>> + */ >>> +static unsigned int __ro_after_init load_balance_ratelimit_us = >>> CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US; >>> +integer_param("load-balance-ratelimit", load_balance_ratelimit_us); >>> + >>> static struct csched_unit * >>> csched_load_balance(struct csched_private *prv, int cpu, >>> - struct csched_unit *snext, bool *stolen) >>> + struct csched_unit *snext, bool *stolen) >>> { >>> const struct cpupool *c = get_sched_res(cpu)->cpupool; >>> struct csched_unit *speer; >>> @@ -1958,15 +1973,19 @@ static void cf_check csched_schedule( >>> /* >>> * SMP Load balance: >>> * >>> - * If the next highest priority local runnable UNIT has already >>> eaten >>> - * through its credits, look on other PCPUs to see if we have more >>> - * urgent work... If not, csched_load_balance() will return snext, >>> but >>> - * already removed from the runq. >>> + * If the next highest priority local runnable UNIT has >>> + * already eaten through its credits (and we're below the >>> + * balancing ratelimit), look on other PCPUs to see if we have >>> + * more urgent work... If we don't, csched_load_balance() will >>> + * return snext, but already removed from the runq. >>> */ >>> - if ( snext->pri > CSCHED_PRI_TS_OVER ) >>> - __runq_remove(snext); >>> - else >>> + if ( snext->pri <= CSCHED_PRI_TS_OVER >>> + && now - spc->last_load_balance > >>> prv->load_balance_ratelimit) { >> >> ^ Just found a style issue (after Andrew pointing out the ones in patch 2). > > Just checking, you mean the space before the closing `)`? And the placement of the opening {. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.