[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RESEND] Performance regression due to XSA-336



Thank you for the patch. We observed similar issues.

However, would you mind sharing more details on the test setup? We
tested the hypervisor after injecting XSA-336. There were regressions
in some benchmarks but far from being enough to raise a serious alarm.
In particular, we benchmarked CPU, disk I/O, network and timer
performance on VMs with 64 and even 128 vCPUs and the noticeable
regression was a 60% drop in the ACPI PM timer when all vCPUs are
actively accessing the timer. We did not see concerns in real-world
benchmarks so I am curious how the -40% was seen.

Is it possible that the guest itself was using ACPI timer as the
clocksource? That could explain a lot. It might also be useful to know
what OS that was and other specs around the -40%.

Thank you,
Hongyan

On Wed, 2021-03-24 at 17:05 -0400, Boris Ostrovsky wrote:
> (Re-sending with Stephen added)
> 
> 
> While running performance tests with recent XSAs backports to our
> product we've
> discovered significant regression in TPCC performance. With a
> particular guest
> kernel the numbers dropped by as much as 40%.
> 
> We've narrowed that down to XSA-336 patch, specifically to the
> pt_migrate rwlock,
> and even more specifically to this lock being taken in
> pt_update_irq().
> 
> We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the
> theory is
> that lock's cnts atomic is starting to cause lots of coherence
> traffic. As a
> quick test of this replacing pt_vcpu_lock() in pt_update_irq() with
> just
> spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance
> back.
> 
> Stephen Brennan came up with new locking algorithm, I just coded it
> up.
> 
> A couple of notes:
> 
> * We have only observed the problem and tested this patch for
> performance on
>   a fairly old Xen version. However, vpt code is almost identical and
> I expect
>   upstream to suffer from the same issue.
> 
> * Stephen provided the following (slightly edited by me) writeup
> explaining the
>   locking algorithm. I would like to include it somewhere but not
> sure what the
>   right place would be. Commit message perhaps?
> 
> 
> Currently, every periodic_time is protected by locking the vcpu it is
> on. You
> can think of the per-vCPU lock (tm_lock) as protecting the fields of
> every
> periodic_time which is attached to that vCPU, as well as the list
> itself, and so
> it must be held when read or written, or when an object is added or
> removed
> to/from the list.
> 
> It seems that there are three types of access to the peridic_time
> objects:
> 
> 1. Functions which read (maybe write) all periodic_time instances
> attached to a
>    particular vCPU. These are functions which use pt_vcpu_lock()
> after the
>    commit, such as pt_restore_timer(), pt_save_timer(), etc.
> 2. Functions which want to modify a particular periodic_time object.
> These guys
>    lock whichever vCPU the periodic_time is attached to, but since
> the vCPU
>    could be modified without holding any lock, they are vulnerable to
> the bug.
>    Functions in this group use pt_lock(), such as pt_timer_fn() or
>    destroy_periodic_time().
> 3. Functions which not only want to modify the periodic_time, but
> also would
>    like to modify the =vcpu= fields. These are create_periodic_time()
> or
>    pt_adjust_vcpu(). They create the locking imbalance bug for group
> 2, but we
>    can't simply hold 2 vcpu locks due to the deadlock risk.
> 
> My proposed option is to add a per-periodic_time spinlock, which
> protects only
> the periodic_time.vcpu field. Whenever reading the vcpu field of a
> periodic_time
> struct, you must first take that lock. The critical sections of group
> 1 (your
> "fast path" functions) would look like this:
> 
> 1. lock vcpu
> 2. do whatever you want with pts currently on the vcpu. It is safe to
> read or write
>    fields of pt, because the vcpu lock protects those fields. You
> simply cannot
>    write pt->vcpu, because somebody holding the pt lock may already
> be spinning
>    waiting for your vcpu lock.
> 3. unlock vcpu
> 
> 
> Note that there is no additional locking in this fast path. For group
> 2
> functions (which are attempting to lock an individual periodic_time),
> the
> critical section would look like this:
> 
> 1. lock pt lock (stabilizing the vcpu field)
> 2. lock vcpu
> 3. feel free to modify any field of the periodic_time
> 4. unlock vcpu (due to the mutual exclusion of the pt lock, we know
> that we are
>    unlocking the correct vcpu -- we have not been migrated)
> 5. unlock pt
> 
> For functions in group 3, the critical section would be:
> 
> 1. lock pt (stabilizing the vcpu field)
> 2. lock current vcpu
> 3. remove from vcpu list
> 4. unlock vcpu. At this point, you're guaranteed that the vcpu
> functions
>    (callers of pt_vcpu_lock()) are not accessing your pt.
> 5. assign pt->vcpu  (we still have mutual exclusion against group 2
> functions)
> 6. lock destination vcpu
> 7. add to vcpu list
> 8. unlock destination vcpu
> 9. unlock pt
> 
> If functions from group 2 and 3 are less frequent, then you won't see
> too much
> added lock overhead in this situation! Plus, even if group 2 and 3
> are somewhat
> common, the performance overhead of an uncontented fine-grained lock
> is muuch
> smaller than the overhead of a heavily contended coarse-grained lock,
> like the
> per-domain rw lock.
> 
> 
> Boris Ostrovsky (1):
>   x86/vpt: Replace per-guest pt_migrate lock with per pt lock
> 
>  xen/arch/x86/emul-i8254.c     |   2 +
>  xen/arch/x86/hvm/hpet.c       |   1 +
>  xen/arch/x86/hvm/hvm.c        |   2 -
>  xen/arch/x86/hvm/rtc.c        |   1 +
>  xen/arch/x86/hvm/vlapic.c     |   1 +
>  xen/arch/x86/hvm/vpt.c        | 122 +++++++++++++++++++++++---------
> ----------
>  xen/include/asm-x86/hvm/vpt.h |   9 +---
>  7 files changed, 74 insertions(+), 64 deletions(-)
> 




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.