[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing


  • To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Thu, 9 Apr 2026 10:13:02 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=LaoX9jGnhXkmSCpuDAE3cL2moGYtgRYSFZ+piolDyUA=; b=Ev6iEZaD+MWA3xzkpss1RT4/SYCSV6q/mJUsPJTqaI8xEd/5JDB0dVhv6YMdBKrvySmX0Lo8bw5RlavsOVsISeRAullcr7uNWBcSjQF1qzC+PX7o2CYKeFMOzpAoenqX01XXGWC+So0L/9/ZHufBO51zhqT8hSuK/UI/BKICGV2yXn6++2oKGfDZExEb7TNbmVttQmnShKQhpy+9469HrK4JHt3tN2lkywlSpnnnTR8mLCfFHdpeiWSyQ16D1LaU0HQHwB7n60lh6T6w1V88cR3nKwlSF4ilh5lOZjW2rCxS+HSmSM4Bosvqv9gPsJwh6yYsA92bEjqij/UvALeWeQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=UHofA7o4lKouj7X1ySxBCzeFgeCXsApTCQohxmvY9Vmb9thkqfI/nsBTAKUqq5WgDGltUGS2afPSNrv8rD1piWwSNcbrMC/SJm7rf1JY7yNgwHc5gV/AZ1+8A3ICkcU78VXLAJco2eEQrK71wvMGhbBbV9zjFpmfr8MFx2f3vPIrq3VCmb6OrZJnQOA1IHD0QKQNJZN3wLw4Kr/KOHFiHYigUucvtaJwvXxuCINzdEpOtmCuAjvAK2cb5yH21gUMwHmFFe4ktk9YR53d6/bKdm3BzVHnSqPa8dyaK8YvEklTF83GuFFOn73DRdxCNmjOBHoZiv2c7krin0LAcbEjJw==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=citrix.com header.i="@citrix.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Thu, 09 Apr 2026 08:13:14 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, Apr 02, 2026 at 01:57:00PM +0200, Andrew Cooper wrote:
> On 02/04/2026 12:57 pm, Ross Lagerwall wrote:
> > On 4/2/26 9:49 AM, Jan Beulich wrote:
> >> On 02.04.2026 10:40, Ross Lagerwall wrote:
> >>> On 4/2/26 7:09 AM, Jan Beulich wrote:
> >>>> On 01.04.2026 18:35, Ross Lagerwall wrote:
> >>>>> We have observed that the TLB flush lock can be a point of
> >>>>> contention for
> >>>>> certain workloads, e.g. migrating 10 VMs off a host during a host
> >>>>> evacuation.
> >>>>>
> >>>>> Performance numbers:
> >>>>>
> >>>>> I wrote a synthetic benchmark to measure the performance. The
> >>>>> benchmark has one
> >>>>> or more CPUs in Xen calling on_selected_cpus() with between 1 and
> >>>>> 64 CPUs in
> >>>>> the selected mask. The executed function simply delays for 500
> >>>>> microseconds.
> >>>>>
> >>>>> The table below shows the % change in execution time of
> >>>>> on_selected_cpus():
> >>>>>
> >>>>>                     1 thread   2 threads    4 threads
> >>>>> 1 CPU in mask     0.02       -35.23       -51.18
> >>>>> 2 CPUs in mask    0.01       -47.20       -69.27
> >>>>> 4 CPUs in mask    -0.02      -42.40       -66.55
> >>>>> 8 CPUs in mask    -0.03      -47.82       -68.39
> >>>>> 16 CPUs in mask   0.12       -41.95       -58.26
> >>>>> 32 CPUs in mask   0.02       -25.43       -39.35
> >>>>> 64 CPUs in mask   0.00       -24.70       -37.83
> >>>>>
> >>>>> With 1 thread (i.e. no contention), there is no regression in
> >>>>> execution time.
> >>>>> With multiple threads, as expected there is a significant
> >>>>> improvement in
> >>>>> execution time.
> >>>>>
> >>>>> As a more practical benchmark to simulate host evacuation, I
> >>>>> measured the
> >>>>> memory dirtying rate across 10 VMs after enabling log dirty (on an
> >>>>> AMD system,
> >>>>> so without PML). The rate increased by 16% with this patch series,
> >>>>> even
> >>>>> after the recent deferred TLB flush changes.
> >>>>
> >>>> Is this a positive thing though? In the context of some related
> >>>> work something
> >>>> similar was mentioned iirc, accompanied by stating that this is
> >>>> actually
> >>>> problematic. A guest in log-dirty mode generally wants to be making
> >>>> progress,
> >>>> but also wants to be throttled enough to limit re-dirtying, such that
> >>>> subsequent iterations (in particular the final one) of page contents
> >>>> migration won't have to process overly many pages a 2nd time.
> >>>
> >>> In the context of a real migration, both the process copying the pages
> >>> out of the guest and the guest itself will be hitting the TLB flush
> >>> lock
> >>> so reducing that bottleneck may increase throughput on both sides.
> >>> Whether or not the overall migration time increases or decreases
> >>> depends
> >>> on many factors (number of migrations in parallel, the rate the
> >>> guest is
> >>> dirtying memory, the line speed of the NIC, whether PML is used, ...)
> >>> which is why I measured a more controlled scenario to demonstrate the
> >>> change.
> >>>
> >>> IMO throttling of a guest during a migration should be something
> >>> intentional and controlled by userspace policy rather than a side
> >>> effect
> >>> of some internal global locks.
> >>
> >> I definitely agree here, but side effects going away may make it
> >> necessary to
> >> add such explicit throttling.
> >>
> >
> > Explicit throttling is much more important for the already existing
> > case of Intel systems with PML. With log dirty enabled, a VM on an Intel
> > system can dirty memory an order of magnitude faster than an AMD system
> > without PML.
> >
> > As an aside, for the same test an Intel machine without PML is still a
> > lot faster than AMD so there is probably something to improve in this
> > area for AMD machines. 
> 
> AMD have PML on the way. 
> https://docs.amd.com/v/u/en-US/69208_1.00_AMD64_PML_PUB
> 
> There is a mis-step with how support for Intel's PML is done, meaning
> that draining the vCPU's PML buffers is extraordinarily expensive even
> when there's no action to take.  (Specifically, the remote VMCS acquire)
> 
> A better option is this:  When logdirty is active, any VMExit will drain
> the PML buffer into the logdirty bitmap before processing the main exit
> reason.  This way, you drain all the PML buffers by just IPI-ing the
> domain dirty mask.

Seems like a good and easy to implement optimization.  However we are
already too fast when using PML in the sense that the toolstack cannot
keep up with the rate of dirtied memory :).

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.