|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing
On Thu, Apr 02, 2026 at 01:57:00PM +0200, Andrew Cooper wrote: > On 02/04/2026 12:57 pm, Ross Lagerwall wrote: > > On 4/2/26 9:49 AM, Jan Beulich wrote: > >> On 02.04.2026 10:40, Ross Lagerwall wrote: > >>> On 4/2/26 7:09 AM, Jan Beulich wrote: > >>>> On 01.04.2026 18:35, Ross Lagerwall wrote: > >>>>> We have observed that the TLB flush lock can be a point of > >>>>> contention for > >>>>> certain workloads, e.g. migrating 10 VMs off a host during a host > >>>>> evacuation. > >>>>> > >>>>> Performance numbers: > >>>>> > >>>>> I wrote a synthetic benchmark to measure the performance. The > >>>>> benchmark has one > >>>>> or more CPUs in Xen calling on_selected_cpus() with between 1 and > >>>>> 64 CPUs in > >>>>> the selected mask. The executed function simply delays for 500 > >>>>> microseconds. > >>>>> > >>>>> The table below shows the % change in execution time of > >>>>> on_selected_cpus(): > >>>>> > >>>>> 1 thread 2 threads 4 threads > >>>>> 1 CPU in mask 0.02 -35.23 -51.18 > >>>>> 2 CPUs in mask 0.01 -47.20 -69.27 > >>>>> 4 CPUs in mask -0.02 -42.40 -66.55 > >>>>> 8 CPUs in mask -0.03 -47.82 -68.39 > >>>>> 16 CPUs in mask 0.12 -41.95 -58.26 > >>>>> 32 CPUs in mask 0.02 -25.43 -39.35 > >>>>> 64 CPUs in mask 0.00 -24.70 -37.83 > >>>>> > >>>>> With 1 thread (i.e. no contention), there is no regression in > >>>>> execution time. > >>>>> With multiple threads, as expected there is a significant > >>>>> improvement in > >>>>> execution time. > >>>>> > >>>>> As a more practical benchmark to simulate host evacuation, I > >>>>> measured the > >>>>> memory dirtying rate across 10 VMs after enabling log dirty (on an > >>>>> AMD system, > >>>>> so without PML). The rate increased by 16% with this patch series, > >>>>> even > >>>>> after the recent deferred TLB flush changes. > >>>> > >>>> Is this a positive thing though? In the context of some related > >>>> work something > >>>> similar was mentioned iirc, accompanied by stating that this is > >>>> actually > >>>> problematic. A guest in log-dirty mode generally wants to be making > >>>> progress, > >>>> but also wants to be throttled enough to limit re-dirtying, such that > >>>> subsequent iterations (in particular the final one) of page contents > >>>> migration won't have to process overly many pages a 2nd time. > >>> > >>> In the context of a real migration, both the process copying the pages > >>> out of the guest and the guest itself will be hitting the TLB flush > >>> lock > >>> so reducing that bottleneck may increase throughput on both sides. > >>> Whether or not the overall migration time increases or decreases > >>> depends > >>> on many factors (number of migrations in parallel, the rate the > >>> guest is > >>> dirtying memory, the line speed of the NIC, whether PML is used, ...) > >>> which is why I measured a more controlled scenario to demonstrate the > >>> change. > >>> > >>> IMO throttling of a guest during a migration should be something > >>> intentional and controlled by userspace policy rather than a side > >>> effect > >>> of some internal global locks. > >> > >> I definitely agree here, but side effects going away may make it > >> necessary to > >> add such explicit throttling. > >> > > > > Explicit throttling is much more important for the already existing > > case of Intel systems with PML. With log dirty enabled, a VM on an Intel > > system can dirty memory an order of magnitude faster than an AMD system > > without PML. > > > > As an aside, for the same test an Intel machine without PML is still a > > lot faster than AMD so there is probably something to improve in this > > area for AMD machines. > > AMD have PML on the way. > https://docs.amd.com/v/u/en-US/69208_1.00_AMD64_PML_PUB > > There is a mis-step with how support for Intel's PML is done, meaning > that draining the vCPU's PML buffers is extraordinarily expensive even > when there's no action to take. (Specifically, the remote VMCS acquire) > > A better option is this: When logdirty is active, any VMExit will drain > the PML buffer into the logdirty bitmap before processing the main exit > reason. This way, you drain all the PML buffers by just IPI-ing the > domain dirty mask. Seems like a good and easy to implement optimization. However we are already too fast when using PML in the sense that the toolstack cannot keep up with the rate of dirtied memory :). Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |