Xen project Mailing List

Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Thu, 9 Apr 2026 10:13:02 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=LaoX9jGnhXkmSCpuDAE3cL2moGYtgRYSFZ+piolDyUA=; b=Ev6iEZaD+MWA3xzkpss1RT4/SYCSV6q/mJUsPJTqaI8xEd/5JDB0dVhv6YMdBKrvySmX0Lo8bw5RlavsOVsISeRAullcr7uNWBcSjQF1qzC+PX7o2CYKeFMOzpAoenqX01XXGWC+So0L/9/ZHufBO51zhqT8hSuK/UI/BKICGV2yXn6++2oKGfDZExEb7TNbmVttQmnShKQhpy+9469HrK4JHt3tN2lkywlSpnnnTR8mLCfFHdpeiWSyQ16D1LaU0HQHwB7n60lh6T6w1V88cR3nKwlSF4ilh5lOZjW2rCxS+HSmSM4Bosvqv9gPsJwh6yYsA92bEjqij/UvALeWeQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=UHofA7o4lKouj7X1ySxBCzeFgeCXsApTCQohxmvY9Vmb9thkqfI/nsBTAKUqq5WgDGltUGS2afPSNrv8rD1piWwSNcbrMC/SJm7rf1JY7yNgwHc5gV/AZ1+8A3ICkcU78VXLAJco2eEQrK71wvMGhbBbV9zjFpmfr8MFx2f3vPIrq3VCmb6OrZJnQOA1IHD0QKQNJZN3wLw4Kr/KOHFiHYigUucvtaJwvXxuCINzdEpOtmCuAjvAK2cb5yH21gUMwHmFFe4ktk9YR53d6/bKdm3BzVHnSqPa8dyaK8YvEklTF83GuFFOn73DRdxCNmjOBHoZiv2c7krin0LAcbEjJw==

Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=citrix.com header.i="@citrix.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Thu, 09 Apr 2026 08:13:14 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, Apr 02, 2026 at 01:57:00PM +0200, Andrew Cooper wrote: > On 02/04/2026 12:57 pm, Ross Lagerwall wrote: > > On 4/2/26 9:49 AM, Jan Beulich wrote: > >> On 02.04.2026 10:40, Ross Lagerwall wrote: > >>> On 4/2/26 7:09 AM, Jan Beulich wrote: > >>>> On 01.04.2026 18:35, Ross Lagerwall wrote: > >>>>> We have observed that the TLB flush lock can be a point of > >>>>> contention for > >>>>> certain workloads, e.g. migrating 10 VMs off a host during a host > >>>>> evacuation. > >>>>> > >>>>> Performance numbers: > >>>>> > >>>>> I wrote a synthetic benchmark to measure the performance. The > >>>>> benchmark has one > >>>>> or more CPUs in Xen calling on_selected_cpus() with between 1 and > >>>>> 64 CPUs in > >>>>> the selected mask. The executed function simply delays for 500 > >>>>> microseconds. > >>>>> > >>>>> The table below shows the % change in execution time of > >>>>> on_selected_cpus(): > >>>>> > >>>>> 1 thread 2 threads 4 threads > >>>>> 1 CPU in mask 0.02 -35.23 -51.18 > >>>>> 2 CPUs in mask 0.01 -47.20 -69.27 > >>>>> 4 CPUs in mask -0.02 -42.40 -66.55 > >>>>> 8 CPUs in mask -0.03 -47.82 -68.39 > >>>>> 16 CPUs in mask 0.12 -41.95 -58.26 > >>>>> 32 CPUs in mask 0.02 -25.43 -39.35 > >>>>> 64 CPUs in mask 0.00 -24.70 -37.83 > >>>>> > >>>>> With 1 thread (i.e. no contention), there is no regression in > >>>>> execution time. > >>>>> With multiple threads, as expected there is a significant > >>>>> improvement in > >>>>> execution time. > >>>>> > >>>>> As a more practical benchmark to simulate host evacuation, I > >>>>> measured the > >>>>> memory dirtying rate across 10 VMs after enabling log dirty (on an > >>>>> AMD system, > >>>>> so without PML). The rate increased by 16% with this patch series, > >>>>> even > >>>>> after the recent deferred TLB flush changes. > >>>> > >>>> Is this a positive thing though? In the context of some related > >>>> work something > >>>> similar was mentioned iirc, accompanied by stating that this is > >>>> actually > >>>> problematic. A guest in log-dirty mode generally wants to be making > >>>> progress, > >>>> but also wants to be throttled enough to limit re-dirtying, such that > >>>> subsequent iterations (in particular the final one) of page contents > >>>> migration won't have to process overly many pages a 2nd time. > >>> > >>> In the context of a real migration, both the process copying the pages > >>> out of the guest and the guest itself will be hitting the TLB flush > >>> lock > >>> so reducing that bottleneck may increase throughput on both sides. > >>> Whether or not the overall migration time increases or decreases > >>> depends > >>> on many factors (number of migrations in parallel, the rate the > >>> guest is > >>> dirtying memory, the line speed of the NIC, whether PML is used, ...) > >>> which is why I measured a more controlled scenario to demonstrate the > >>> change. > >>> > >>> IMO throttling of a guest during a migration should be something > >>> intentional and controlled by userspace policy rather than a side > >>> effect > >>> of some internal global locks. > >> > >> I definitely agree here, but side effects going away may make it > >> necessary to > >> add such explicit throttling. > >> > > > > Explicit throttling is much more important for the already existing > > case of Intel systems with PML. With log dirty enabled, a VM on an Intel > > system can dirty memory an order of magnitude faster than an AMD system > > without PML. > > > > As an aside, for the same test an Intel machine without PML is still a > > lot faster than AMD so there is probably something to improve in this > > area for AMD machines. > > AMD have PML on the way. > https://docs.amd.com/v/u/en-US/69208_1.00_AMD64_PML_PUB > > There is a mis-step with how support for Intel's PML is done, meaning > that draining the vCPU's PML buffers is extraordinarily expensive even > when there's no action to take. (Specifically, the remote VMCS acquire) > > A better option is this: When logdirty is active, any VMExit will drain > the PML buffer into the logdirty bitmap before processing the main exit > reason. This way, you drain all the PML buffers by just IPI-ing the > domain dirty mask. Seems like a good and easy to implement optimization. However we are already too fast when using PML in the sense that the toolstack cannot keep up with the rate of dirtied memory :). Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.