[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH 6/9] qspinlock: Use a simple write to grab the lock

To: Waiman.Long@xxxxxx
From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Mon, 16 Mar 2015 14:16:19 +0100
Cc: raghavendra.kt@xxxxxxxxxxxxxxxxxx, kvm@xxxxxxxxxxxxxxx, peterz@xxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, hpa@xxxxxxxxx, boris.ostrovsky@xxxxxxxxxx, linux-arch@xxxxxxxxxxxxxxx, x86@xxxxxxxxxx, mingo@xxxxxxxxxx, doug.hatch@xxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, paulmck@xxxxxxxxxxxxxxxxxx, riel@xxxxxxxxxx, scott.norton@xxxxxx, paolo.bonzini@xxxxxxxxx, tglx@xxxxxxxxxxxxx, virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx, oleg@xxxxxxxxxx, luto@xxxxxxxxxxxxxx, david.vrabel@xxxxxxxxxx, torvalds@xxxxxxxxxxxxxxxxxxxx
Delivery-date: Mon, 16 Mar 2015 13:36:08 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

From: Waiman Long <Waiman.Long@xxxxxx>

Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

                [Standalone/Embedded - same node]
  # of tasks    Before patch    After patch     %Change
  ----------    -----------     ----------      -------
       3         2324/2321      2248/2265        -3%/-2%
       4         2890/2896      2819/2831        -2%/-2%
       5         3611/3595      3522/3512        -2%/-2%
       6         4281/4276      4173/4160        -3%/-3%
       7         5018/5001      4875/4861        -3%/-3%
       8         5759/5750      5563/5568        -3%/-3%

                [Standalone/Embedded - different nodes]
  # of tasks    Before patch    After patch     %Change
  ----------    -----------     ----------      -------
       3        12242/12237     12087/12093      -1%/-1%
       4        10688/10696     10507/10521      -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

                AIM7 XFS Disk Test
  kernel                 JPM    Real Time   Sys Time    Usr Time
  -----                  ---    ---------   --------    --------
  ticketlock            5678233    3.17       96.61       5.81
  qspinlock             5750799    3.13       94.83       5.97

                AIM7 EXT4 Disk Test
  kernel                 JPM    Real Time   Sys Time    Usr Time
  -----                  ---    ---------   --------    --------
  ticketlock            1114551   16.15      509.72       7.11
  qspinlock             2184466    8.24      232.99       6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The "ebizzy -m" test was also run with the following results:

  kernel               records/s  Real Time   Sys Time    Usr Time
  -----                ---------  ---------   --------    --------
  ticketlock             2075       10.00      216.35       3.49
  qspinlock              3023       10.00      198.20       4.80

Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
Cc: David Vrabel <david.vrabel@xxxxxxxxxx>
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Cc: Scott J Norton <scott.norton@xxxxxx>
Cc: Paolo Bonzini <paolo.bonzini@xxxxxxxxx>
Cc: Douglas Hatch <doug.hatch@xxxxxx>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Raghavendra K T <raghavendra.kt@xxxxxxxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Signed-off-by: Waiman Long <Waiman.Long@xxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Link: 
http://lkml.kernel.org/r/1421784755-21945-7-git-send-email-Waiman.Long@xxxxxx
---
 kernel/locking/qspinlock.c |   61 +++++++++++++++++++++++++++++++++------------
 1 file changed, 45 insertions(+), 16 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -105,24 +105,33 @@ static inline struct mcs_spinlock *decod
  * By using the whole 2nd least significant byte for the pending bit, we
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
        union {
                atomic_t val;
-               struct {
 #ifdef __LITTLE_ENDIAN
+               u8       locked;
+               struct {
                        u16     locked_pending;
                        u16     tail;
+               };
 #else
+               struct {
                        u16     tail;
                        u16     locked_pending;
-#endif
                };
+               struct {
+                       u8      reserved[3];
+                       u8      locked;
+               };
+#endif
        };
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -209,6 +218,19 @@ static __always_inline u32 xchg_tail(str
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,*,0 -> *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+       struct __qspinlock *l = (void *)lock;
+
+       WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -343,8 +365,13 @@ void queue_spin_lock_slowpath(struct qsp
         * go away.
         *
         * *,x,y -> *,0,0
+        *
+        * this wait loop must use a load-acquire such that we match the
+        * store-release that clears the locked bit and create lock
+        * sequentiality; this is because the set_locked() function below
+        * does not imply a full barrier.
         */
-       while ((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK)
+       while ((val = smp_load_acquire(&lock->val.counter)) & 
_Q_LOCKED_PENDING_MASK)
                cpu_relax();
 
        /*
@@ -352,15 +379,19 @@ void queue_spin_lock_slowpath(struct qsp
         *
         * n,0,0 -> 0,0,1 : lock, uncontended
         * *,0,0 -> *,0,1 : lock, contended
+        *
+        * If the queue head is the only one in the queue (lock value == tail),
+        * clear the tail code and grab the lock. Otherwise, we only need
+        * to grab the lock.
         */
        for (;;) {
-               new = _Q_LOCKED_VAL;
-               if (val != tail)
-                       new |= val;
-
-               old = atomic_cmpxchg(&lock->val, val, new);
-               if (old == val)
+               if (val != tail) {
+                       set_locked(lock);
                        break;
+               }
+               old = atomic_cmpxchg(&lock->val, val, _Q_LOCKED_VAL);
+               if (old == val)
+                       goto release;   /* No contention */
 
                val = old;
        }
@@ -368,12 +399,10 @@ void queue_spin_lock_slowpath(struct qsp
        /*
         * contended path; wait for next, release.
         */
-       if (new != _Q_LOCKED_VAL) {
-               while (!(next = READ_ONCE(node->next)))
-                       cpu_relax();
+       while (!(next = READ_ONCE(node->next)))
+               cpu_relax();
 
-               arch_mcs_spin_unlock_contended(&next->locked);
-       }
+       arch_mcs_spin_unlock_contended(&next->locked);
 
 release:
        /*



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

References:
- [Xen-devel] [PATCH 0/9] qspinlock stuff -v15
  - From: Peter Zijlstra

Prev by Date: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15
Next by Date: [Xen-devel] [PATCH 2/9] qspinlock, x86: Enable x86-64 to use queue spinlock
Previous by thread: [Xen-devel] [PATCH 5/9] qspinlock: Optimize for smaller NR_CPUS
Next by thread: [Xen-devel] [PATCH 2/9] qspinlock, x86: Enable x86-64 to use queue spinlock
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.