[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH RESEND] Performance regression due to XSA-336
On 3/31/21 5:53 AM, Hongyan Xia wrote: > Thank you for the patch. We observed similar issues. > > However, would you mind sharing more details on the test setup? We > tested the hypervisor after injecting XSA-336. There were regressions > in some benchmarks but far from being enough to raise a serious alarm. > In particular, we benchmarked CPU, disk I/O, network and timer > performance on VMs with 64 and even 128 vCPUs and the noticeable > regression was a 60% drop in the ACPI PM timer when all vCPUs are > actively accessing the timer. We did not see concerns in real-world > benchmarks so I am curious how the -40% was seen. We tested two guest kernel versions and observed this regression only on one (older) kernel. The newer one showed almost no change so this appears to be quite sensitive to guest kernel. Unfortunately it is not easy to get access to the test rig so I wasn't able to do a more thorough investigation of kernels' behavior. This was mostly done by staring at the code, making some changes and getting in line to have the test run. > > Is it possible that the guest itself was using ACPI timer as the > clocksource? No, we never use it. > That could explain a lot. It might also be useful to know > what OS that was and other specs around the -40%. Guest OS is Oracle Linux 6 (I think we also tried OL7 with same results), the kernel is a modified 4.1. Skylake processor, few hundred GB of memory. Benchmark is TPCC. -boris > > Thank you, > Hongyan > > On Wed, 2021-03-24 at 17:05 -0400, Boris Ostrovsky wrote: >> (Re-sending with Stephen added) >> >> >> While running performance tests with recent XSAs backports to our >> product we've >> discovered significant regression in TPCC performance. With a >> particular guest >> kernel the numbers dropped by as much as 40%. >> >> We've narrowed that down to XSA-336 patch, specifically to the >> pt_migrate rwlock, >> and even more specifically to this lock being taken in >> pt_update_irq(). >> >> We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the >> theory is >> that lock's cnts atomic is starting to cause lots of coherence >> traffic. As a >> quick test of this replacing pt_vcpu_lock() in pt_update_irq() with >> just >> spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance >> back. >> >> Stephen Brennan came up with new locking algorithm, I just coded it >> up. >> >> A couple of notes: >> >> * We have only observed the problem and tested this patch for >> performance on >> a fairly old Xen version. However, vpt code is almost identical and >> I expect >> upstream to suffer from the same issue. >> >> * Stephen provided the following (slightly edited by me) writeup >> explaining the >> locking algorithm. I would like to include it somewhere but not >> sure what the >> right place would be. Commit message perhaps? >> >> >> Currently, every periodic_time is protected by locking the vcpu it is >> on. You >> can think of the per-vCPU lock (tm_lock) as protecting the fields of >> every >> periodic_time which is attached to that vCPU, as well as the list >> itself, and so >> it must be held when read or written, or when an object is added or >> removed >> to/from the list. >> >> It seems that there are three types of access to the peridic_time >> objects: >> >> 1. Functions which read (maybe write) all periodic_time instances >> attached to a >> particular vCPU. These are functions which use pt_vcpu_lock() >> after the >> commit, such as pt_restore_timer(), pt_save_timer(), etc. >> 2. Functions which want to modify a particular periodic_time object. >> These guys >> lock whichever vCPU the periodic_time is attached to, but since >> the vCPU >> could be modified without holding any lock, they are vulnerable to >> the bug. >> Functions in this group use pt_lock(), such as pt_timer_fn() or >> destroy_periodic_time(). >> 3. Functions which not only want to modify the periodic_time, but >> also would >> like to modify the =vcpu= fields. These are create_periodic_time() >> or >> pt_adjust_vcpu(). They create the locking imbalance bug for group >> 2, but we >> can't simply hold 2 vcpu locks due to the deadlock risk. >> >> My proposed option is to add a per-periodic_time spinlock, which >> protects only >> the periodic_time.vcpu field. Whenever reading the vcpu field of a >> periodic_time >> struct, you must first take that lock. The critical sections of group >> 1 (your >> "fast path" functions) would look like this: >> >> 1. lock vcpu >> 2. do whatever you want with pts currently on the vcpu. It is safe to >> read or write >> fields of pt, because the vcpu lock protects those fields. You >> simply cannot >> write pt->vcpu, because somebody holding the pt lock may already >> be spinning >> waiting for your vcpu lock. >> 3. unlock vcpu >> >> >> Note that there is no additional locking in this fast path. For group >> 2 >> functions (which are attempting to lock an individual periodic_time), >> the >> critical section would look like this: >> >> 1. lock pt lock (stabilizing the vcpu field) >> 2. lock vcpu >> 3. feel free to modify any field of the periodic_time >> 4. unlock vcpu (due to the mutual exclusion of the pt lock, we know >> that we are >> unlocking the correct vcpu -- we have not been migrated) >> 5. unlock pt >> >> For functions in group 3, the critical section would be: >> >> 1. lock pt (stabilizing the vcpu field) >> 2. lock current vcpu >> 3. remove from vcpu list >> 4. unlock vcpu. At this point, you're guaranteed that the vcpu >> functions >> (callers of pt_vcpu_lock()) are not accessing your pt. >> 5. assign pt->vcpu (we still have mutual exclusion against group 2 >> functions) >> 6. lock destination vcpu >> 7. add to vcpu list >> 8. unlock destination vcpu >> 9. unlock pt >> >> If functions from group 2 and 3 are less frequent, then you won't see >> too much >> added lock overhead in this situation! Plus, even if group 2 and 3 >> are somewhat >> common, the performance overhead of an uncontented fine-grained lock >> is muuch >> smaller than the overhead of a heavily contended coarse-grained lock, >> like the >> per-domain rw lock. >> >> >> Boris Ostrovsky (1): >> x86/vpt: Replace per-guest pt_migrate lock with per pt lock >> >> xen/arch/x86/emul-i8254.c | 2 + >> xen/arch/x86/hvm/hpet.c | 1 + >> xen/arch/x86/hvm/hvm.c | 2 - >> xen/arch/x86/hvm/rtc.c | 1 + >> xen/arch/x86/hvm/vlapic.c | 1 + >> xen/arch/x86/hvm/vpt.c | 122 +++++++++++++++++++++++--------- >> ---------- >> xen/include/asm-x86/hvm/vpt.h | 9 +--- >> 7 files changed, 74 insertions(+), 64 deletions(-) >>
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |