[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RESEND] Performance regression due to XSA-336


  • To: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Thu, 25 Mar 2021 12:51:32 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Vfh8iklu78et7J8VloVPeU+e7QaHw19wXC9LMVyyrxs=; b=MAqtkdbsz3FPQUTy31BsnCkFru8+W7MuXi14jtJ4iv/Q71M1XUqyk8min7NP8bnfvUsPpHU048m9WqcIJaME1sXgTMv6e1XolhkefoEBuvKmN9jzHf2JEG5D1ZgbLzt112P7eDTJdBPluzMjAYStTTatxNW/QOKsPJmjD9/+cn/Ly14UFegzv77YJrTEbau7X6BOPL+K6RjpxjqXAOT/4+zH79788vOx3ynxYkZ2Dz7BgunbhQVr2ynnIjvuw930goPf6XGNUf1L8b3pAJQjMaWWKU6fLEJXQfHxB9CbUIpMT65XTEinQ8/0doJM9ezpkWXJ1KGumZMkU/IgS9ArdA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=SynJ7qsm00XBz/zFzHZhB5Dz8yC+MxlsCybMfrgHXinMHynDwvdf2shQ7cMT2OsMKE6OBRbz3hjQe0JUbuiZIr/AeNr7MzayjiG+KA5yeyFyIA2OSwimKfZiFxvXjsnP8xr+Jb11WgYlM9Onntrp/s60nX238xAF74RJglAvzOo9GOO0Q3mzXpcilSw3qI7ZpXNls3quwOlbn7IJ4PRp4IcR0Wi4gv0TwpKIPAP+6spTt25fd1P0exTH737XVXx6b1+g7DPGk85ZAEJrr5RjUANLoc8iKbuWMQNM1fS8J9hOx9dEZVvTW9Na4f6Exl6S4+yrRoPypTOs9eNz8oTd+g==
  • Authentication-results: esa4.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: <xen-devel@xxxxxxxxxxxxxxxxxxxx>, <jbeulich@xxxxxxxx>, <andrew.cooper3@xxxxxxxxxx>, <wl@xxxxxxx>, <stephen.s.brennan@xxxxxxxxxx>
  • Delivery-date: Thu, 25 Mar 2021 11:51:56 +0000
  • Ironport-hdrordr: A9a23:YcJVXq2TAnLz/iMq5mnBmAqjBSF3eYIsi2QD101hICF9Wvez0+ izgfUW0gL1gj4NWHcm3euNIrWEXGm0z/BIyKErF/OHUBP9sGWlaLtj44zr3iH6F0TFmdJ1/Z xLN5JzANiYNzRHpO7n/Qi1FMshytGb8KauwdzT1WtpUBsCUcFdxi1SYzzrdHFebg9AGJY/Cd 6w5tBfoSChZHQQaa2AdwQ4dsLEoMDGk4+jXA4eC3ccmXOzpB6LyJq/KRiX2R8CTyhCqI1CzU HpmxH0j5/T1s2T5QTb0wbonvBrsfvnjuBOHcmdzvUSQw+c9jqAQKREd/m8sCsuoOepgWxa4O Xkhxs7Jcx85zfwUwiO0GPQ8jLt2jov9HPuoGXw6RCIzL2bNVBKefZpvo5XfgDU7EAtprhHod l29lmUqoZNClf4lDn9juK4Ji1CrFa+onYpjIco/gVieLYZAYUhyrA3zQd+FZcNGz/C84Y3EO ViJ9G03ocpTXqqK1/epWVh29qqQzAaGQqHWFELvoiv3yFRh20R9TpV+OUv2lM7sL4tQZhN4O rJdoxuibF1V8cTKYZwHv0IT8ebAnHEKCi8f166EBDCLuUqKnjNo5n47PEe/+exYqEFy5M0hd DoTE5YnXRaQTOvNeS+mLlwtjzdSmS0WjrgjutE4YJih7H6TL33dQWeVVEVlde6qfl3OLybZ9 +DfLZtR9PzJ2rnHohEmyfkXYNJFHUYWMoJ/vkhXVajpd/KN53KuuTXfO27HsuuLR8UHkfERl cTVjn6I8tNqmqxXGXjvRTXU3TxPmzzlKgAVZTyzqw28swgJ4dMug8ahRCS/ceQMwBPtaQwYQ 9bKLPjmaWrmHmu8Q/zniFUEysYKnwQzKTrUntMqwNPGVjza6w/t9KWfn0X+3ebOBllTYfzHB REr1p6vYK7RqbgixwKOpaCCCa3nnETrHWFQ9M3gauY//rofZs+E9IBQ6x+FQLCEjRvggZ0oG J/aAsJL3WvVQ/GuOGAttg5Fevff95zjEOAOshPs0/Ssk2auIUSXHcBZiWvVsSWmA4qYDJRij RKgugiqYvFvQzqBXo0gew+PlEJTGiMGrpJAD6IY5hulqnxdBt9SnqLgjKmmwg+E1CahHk6ty jEF2m5aPvLCl1StjRj3qHm/EhdW0+dc0hzA0oK+rFVJCDjgDJewOWLbq283y+tcVME2PgaKy yASyAVOBlSy9e+0wO1lD6OGW49/IgnOvXQAd0YAvfu80LoDLfNubANHvdS8pogCcvntfUTV/ mDPyCSNzH1BooSqnqoj0dgHBMxjnYqkfnlgkK4qEe52WMyGvrULhBNQaoBL9SV8mjjQLKp3f xC/KYIlNr1Fl+0TNiMjZzzRXpkDDj4pGatVeEmqZxOp8sJxfNONqiedQGN7W1N2RU1Edz9m0 wfSplq+bypAP4bQ+UiPwZiumcznNuBLEEXohX7L+83c1YqlWLaNbqyks31gItqJk2Kvw3rP1 aDtwVb4vfeRiOGvIRqQZ4YECBzaEIm7m5l8/7HX4rMCB+yf+UG2FahKHeyfPt8T6eCcI9g4y pS0pWtn+WNcTD/1x2VlTxnIrhW+2LiePiMOmu3aKd12u3/H0+NjKus6NOyizmyaQLTUTVmua R1MWoKbspCjTE+ipYQySbacN2vnn4Y
  • Ironport-sdr: 3aPfJmzcR5u4Ud5V+v06M8KtaJMIuqoJPzQOaJEwZgZINDvQG8rSPpA5mPf6DlUXw9g1Yi0zyn 8rb8UFTG9sc2F1fPnolmTKYtjPGdfHgmlFCH73EkItWW1K5SxsD93CKNWP37YqdNteGMQp1fmB pS6eqDa2jL1fBdiUlqJNzGEtrMBNgOAl/dbr8oX7etl13RAhLcEGzwOUxcsoEbd/i78WT30ytk Unk8zGMYBjXQjRbCgVYPCSrNdxIGR8Op3R+kxuMDTaVJSy/ysWlB8qecQ4VM4AFh1CevmhD7jd 26M=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, Mar 24, 2021 at 05:05:04PM -0400, Boris Ostrovsky wrote:
> 
> (Re-sending with Stephen added)
> 
> 
> While running performance tests with recent XSAs backports to our product 
> we've
> discovered significant regression in TPCC performance. With a particular guest
> kernel the numbers dropped by as much as 40%.
> 
> We've narrowed that down to XSA-336 patch, specifically to the pt_migrate 
> rwlock,
> and even more specifically to this lock being taken in pt_update_irq().
> 
> We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the theory is
> that lock's cnts atomic is starting to cause lots of coherence traffic. As a
> quick test of this replacing pt_vcpu_lock() in pt_update_irq() with just
> spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance back.

Right, we where kind of worried about this when doing the original
fix, but no one reported such performance regressions, I guess Oracle
is likely the only one running such big VMs.

I have a pending patch series to remove all this vpt logic from the vm
entry path, as it's not working properly except for very simple
scenarios:

https://lore.kernel.org/xen-devel/20200930104108.35969-1-roger.pau@xxxxxxxxxx/

I hope I will find time to post a new version of the series soon-ish.

> Stephen Brennan came up with new locking algorithm, I just coded it up.
> 
> A couple of notes:
> 
> * We have only observed the problem and tested this patch for performance on
>   a fairly old Xen version. However, vpt code is almost identical and I expect
>   upstream to suffer from the same issue.
> 
> * Stephen provided the following (slightly edited by me) writeup explaining 
> the
>   locking algorithm. I would like to include it somewhere but not sure what 
> the
>   right place would be. Commit message perhaps?
> 
> 
> Currently, every periodic_time is protected by locking the vcpu it is on. You
> can think of the per-vCPU lock (tm_lock) as protecting the fields of every
> periodic_time which is attached to that vCPU, as well as the list itself, and 
> so
> it must be held when read or written, or when an object is added or removed
> to/from the list.
> 
> It seems that there are three types of access to the peridic_time objects:
> 
> 1. Functions which read (maybe write) all periodic_time instances attached to 
> a
>    particular vCPU. These are functions which use pt_vcpu_lock() after the
>    commit, such as pt_restore_timer(), pt_save_timer(), etc.
> 2. Functions which want to modify a particular periodic_time object. These 
> guys
>    lock whichever vCPU the periodic_time is attached to, but since the vCPU
>    could be modified without holding any lock, they are vulnerable to the bug.
>    Functions in this group use pt_lock(), such as pt_timer_fn() or
>    destroy_periodic_time().
> 3. Functions which not only want to modify the periodic_time, but also would
>    like to modify the =vcpu= fields. These are create_periodic_time() or
>    pt_adjust_vcpu(). They create the locking imbalance bug for group 2, but we
>    can't simply hold 2 vcpu locks due to the deadlock risk.
> 
> My proposed option is to add a per-periodic_time spinlock, which protects only
> the periodic_time.vcpu field.

I wonder whether we really need a per-periodic_time spinlock, it seems
like functions using access type 1 are the only ones that suffer from
the contention caused by the global rwlock, so maybe we could adapt
this proposal to still use a per-domain lock, seeing that type 1
access are likely safe by just holding the vcpu lock.

Not that using a per-periodic_time spinlock is wrong, but it's likely
too fine grained (and adds more memory usage) as type 2 and 3 accesses
shouldn't be common anyway.

Let me make some comments on the patch itself.

> Whenever reading the vcpu field of a periodic_time
> struct, you must first take that lock. The critical sections of group 1 (your
> "fast path" functions) would look like this:
> 
> 1. lock vcpu
> 2. do whatever you want with pts currently on the vcpu. It is safe to read or 
> write
>    fields of pt, because the vcpu lock protects those fields. You simply 
> cannot
>    write pt->vcpu, because somebody holding the pt lock may already be 
> spinning
>    waiting for your vcpu lock.
> 3. unlock vcpu
> 
> 
> Note that there is no additional locking in this fast path. For group 2
> functions (which are attempting to lock an individual periodic_time), the
> critical section would look like this:
> 
> 1. lock pt lock (stabilizing the vcpu field)
> 2. lock vcpu
> 3. feel free to modify any field of the periodic_time
> 4. unlock vcpu (due to the mutual exclusion of the pt lock, we know that we 
> are
>    unlocking the correct vcpu -- we have not been migrated)
> 5. unlock pt
> 
> For functions in group 3, the critical section would be:
> 
> 1. lock pt (stabilizing the vcpu field)
> 2. lock current vcpu
> 3. remove from vcpu list
> 4. unlock vcpu. At this point, you're guaranteed that the vcpu functions
>    (callers of pt_vcpu_lock()) are not accessing your pt.
> 5. assign pt->vcpu  (we still have mutual exclusion against group 2 functions)
> 6. lock destination vcpu
> 7. add to vcpu list
> 8. unlock destination vcpu
> 9. unlock pt
> 
> If functions from group 2 and 3 are less frequent, then you won't see too much
> added lock overhead in this situation! Plus, even if group 2 and 3 are 
> somewhat
> common, the performance overhead of an uncontented fine-grained lock is muuch
> smaller than the overhead of a heavily contended coarse-grained lock, like the
> per-domain rw lock.

Thanks, that's a very good description of the different locking
accesses by vpt. The original fix already aimed to make this
difference by introducing the pt_vcpu_{un}lock and pt_{un}nlock
helpers.

This all stems from a very bad design decision of making vpts tied to
a vCPU, which is a broken assumption.

Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.