[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [PATCH] Performance regression due to XSA-336
While running performance tests with recent XSAs backports to our product we've discovered significant regression in TPCC performance. With a particular guest kernel the numbers dropped by as much as 40%. We've narrowed that down to XSA-336 patch, specifically to the pt_migrate rwlock, and even more specifically to this lock being taken in pt_update_irq(). We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the theory is that lock's cnts atomic is starting to cause lots of coherence traffic. As a quick test of this replacing pt_vcpu_lock() in pt_update_irq() with just spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance back. Stephen Brennan came up with new locking algorithm, I just coded it up. A couple of notes: * We have only observed the problem and tested this patch for performance on a fairly old Xen version. However, vpt code is almost identical and I expect upstream to suffer from the same issue. * Stephen provided the following (slightly edited by me) writeup explaining the locking algorithm. I would like to include it somewhere but not sure what the right place would be. Commit message perhaps? Currently, every periodic_time is protected by locking the vcpu it is on. You can think of the per-vCPU lock (tm_lock) as protecting the fields of every periodic_time which is attached to that vCPU, as well as the list itself, and so it must be held when read or written, or when an object is added or removed to/from the list. It seems that there are three types of access to the peridic_time objects: 1. Functions which read (maybe write) all periodic_time instances attached to a particular vCPU. These are functions which use pt_vcpu_lock() after the commit, such as pt_restore_timer(), pt_save_timer(), etc. 2. Functions which want to modify a particular periodic_time object. These guys lock whichever vCPU the periodic_time is attached to, but since the vCPU could be modified without holding any lock, they are vulnerable to the bug. Functions in this group use pt_lock(), such as pt_timer_fn() or destroy_periodic_time(). 3. Functions which not only want to modify the periodic_time, but also would like to modify the =vcpu= fields. These are create_periodic_time() or pt_adjust_vcpu(). They create the locking imbalance bug for group 2, but we can't simply hold 2 vcpu locks due to the deadlock risk. My proposed option is to add a per-periodic_time spinlock, which protects only the periodic_time.vcpu field. Whenever reading the vcpu field of a periodic_time struct, you must first take that lock. The critical sections of group 1 (your "fast path" functions) would look like this: 1. lock vcpu 2. do whatever you want with pts currently on the vcpu. It is safe to read or write fields of pt, because the vcpu lock protects those fields. You simply cannot write pt->vcpu, because somebody holding the pt lock may already be spinning waiting for your vcpu lock. 3. unlock vcpu Note that there is no additional locking in this fast path. For group 2 functions (which are attempting to lock an individual periodic_time), the critical section would look like this: 1. lock pt lock (stabilizing the vcpu field) 2. lock vcpu 3. feel free to modify any field of the periodic_time 4. unlock vcpu (due to the mutual exclusion of the pt lock, we know that we are unlocking the correct vcpu -- we have not been migrated) 5. unlock pt For functions in group 3, the critical section would be: 1. lock pt (stabilizing the vcpu field) 2. lock current vcpu 3. remove from vcpu list 4. unlock vcpu. At this point, you're guaranteed that the vcpu functions (callers of pt_vcpu_lock()) are not accessing your pt. 5. assign pt->vcpu (we still have mutual exclusion against group 2 functions) 6. lock destination vcpu 7. add to vcpu list 8. unlock destination vcpu 9. unlock pt If functions from group 2 and 3 are less frequent, then you won't see too much added lock overhead in this situation! Plus, even if group 2 and 3 are somewhat common, the performance overhead of an uncontented fine-grained lock is muuch smaller than the overhead of a heavily contended coarse-grained lock, like the per-domain rw lock. Boris Ostrovsky (1): x86/vpt: Replace per-guest pt_migrate lock with per pt lock xen/arch/x86/emul-i8254.c | 2 + xen/arch/x86/hvm/hpet.c | 1 + xen/arch/x86/hvm/hvm.c | 2 - xen/arch/x86/hvm/rtc.c | 1 + xen/arch/x86/hvm/vlapic.c | 1 + xen/arch/x86/hvm/vpt.c | 122 +++++++++++++++++++++++------------------- xen/include/asm-x86/hvm/vpt.h | 9 +--- 7 files changed, 74 insertions(+), 64 deletions(-) -- 1.8.3.1
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |