[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/2] rwlock: add per-cpu reader-writer locks

On 17/11/15 17:00, Jan Beulich wrote:
>>>> On 03.11.15 at 18:58, <malcolm.crossley@xxxxxxxxxx> wrote:
>> Per-cpu read-write locks allow for the fast path read case to have low 
>> overhead
>> by only setting/clearing a per-cpu variable for using the read lock.
>> The per-cpu read fast path also avoids locked compare swap operations which 
>> can
>> be particularly slow on coherent multi-socket systems, particularly if there 
>> is
>> heavy usage of the read lock itself.
>> The per-cpu reader-writer lock uses a global variable to control the read 
>> lock
>> fast path. This allows a writer to disable the fast path and ensure the 
>> readers
>> use the underlying read-write lock implementation.
>> Once the writer has taken the write lock and disabled the fast path, it must
>> poll the per-cpu variable for all CPU's which have entered the critical 
>> section
>> for the specific read-write lock the writer is attempting to take. This 
>> design
>> allows for a single per-cpu variable to be used for read/write locks 
>> belonging
>> to seperate data structures as long as multiple per-cpu read locks are not
>> simultaneously held by one particular cpu. This also means per-cpu
>> reader-writer locks are not recursion safe.
> This sounds like pretty severe a restriction, and something to easily
> get wrong once more than a handful of users got introduced.

I can add an ASSERT to detect the recursion for the debug hypervisor but you 
are right
that it is a restriction. I believe that the tickets locks are not recursion 
either and rely on additional infrastructure to prevent the recursion occurring 
the lock itself.

>> Slow path readers which are unblocked set the per-cpu variable and drop the
>> read lock. This simplifies the implementation and allows for fairness in the
>> underlying read-write lock to be taken advantage of.
>> There may be slightly more overhead on the per-cpu write lock path due to
>> checking each CPUs fast path read variable but this overhead is likely be 
>> hidden
>> by the required delay of waiting for readers to exit the critical section.
> Hmm, the "slightly" here seems to depend on the number of CPUs.

It also depends on the lock contention between the readers and the writers.

The worst case is the uncontended case and only writers are using the lock.
Each time the writer takes the lock they will also read all the percpu areas.

The best case is the read lock in both the contended and uncontended case which
simply checks a global variable and then update a per cpu area.

The reader/writer contended case can have variable overhead which depends on
the time it takes the readers to exit the critical section. If the last reader
are just about the exit the critical section when the writer attempts to take
the lock then the overhead will be similar to the worst case. However, if the
time it takes the readers to exit the critical section is equal to the time
the writer takes to do a first pass read of all the percpu areas then the writer
overhead much lower. Potentially, if there is a single "lagging" reader exiting
the critical section then the writer will spinning reading their percpu area 
and the writer lock overhead may be lower than the cmpxchg operation a standard
rwlock would have.

>> --- a/xen/common/spinlock.c
>> +++ b/xen/common/spinlock.c
>> @@ -492,6 +492,38 @@ int _rw_is_write_locked(rwlock_t *lock)
>>      return (lock->lock == RW_WRITE_FLAG); /* writer in critical section? */
>>  }
>> +void percpu_write_lock(rwlock_t **per_cpudata, bool_t *writer_activating,
>> +            rwlock_t *rwlock)
>> +{
>> +    int cpu;
> unsigned int

Will fix.
>> +    cpumask_t active_readers;
> Did you consider ways to avoid such a large variable to be put on
> the stack, and none turned out suitable?

I wasn't aware the cpumask_t variable at 32 bytes was considered problematic
for the stack. I can't think of any way around the issue without causing
further restrictions (a percpu area will prevent taking two different percpu
rwlocks at the same time).

>> +    cpumask_copy(&active_readers, &cpu_online_map);
> While we have other cases not fully hotplug safe, I don't think we
> can allow this in basic locking infrastructure code. You need to
> protect against cpu_online_map changing.

I could move this copy to after the point the global barrier has been raised.
This should make it safe for cpu online/offline because new online CPU's will
have to take the read lock. CPU's which go offline should have exited the
read critical section and so should not cause any problems.

>> +    /* First take the write lock to protect against other writers */
> Most if not all of your comments are missing a full stop.

OK, I'll fix this up.
>> +    write_lock(rwlock);
>> +
>> +    /* Now set the global variable so that readers start using read_lock */
>> +    *writer_activating = true;
>> +    smp_mb();
>> +
>> +    /* Check if there are any percpu readers in progress on our rwlock*/
>> +    do
>> +    {
>> +        for_each_cpu(cpu, &active_readers)
>> +        {
>> +            /* Remove any percpu readers not contending 
>> +             * from our check mask */
> Comment style.

>> +            if ( per_cpu_ptr(per_cpudata, cpu) != rwlock )
>> +                cpumask_clear_cpu(cpu, &active_readers);
> This is a LOCKed op, and with the subject of the patch being to lower
> the load on the fabric, I wonder whether time isn't ripe for introducing
> a non-LOCKed variant of set, clear, and their test_and_ variants.

As long as the stack is allocated per-cpu then I would hope there should be
no fabric cost for operations on stack variables. Non-locked variants would
be more efficient.

>> +        }
>> +        /* Check if we've cleared all percpu readers */
>> +        if ( cpumask_empty(&active_readers) )
>> +            break;
>> +        /* Give the coherency fabric a break */
>> +        cpu_relax();
>> +    } while ( 1 );
> While certainly a matter of taste, could I talk you into using
> for ( ; ; ) instead?

That's fine by me.

>> --- a/xen/include/xen/spinlock.h
>> +++ b/xen/include/xen/spinlock.h
>> @@ -3,6 +3,7 @@
>>  #include <asm/system.h>
>>  #include <asm/spinlock.h>
>> +#include <xen/stdbool.h>
> No inclusion of this header in code other than such shared with the
> tools please.

Will fixup to <asm/types.h>

>> @@ -261,4 +262,40 @@ int _rw_is_write_locked(rwlock_t *lock);
>>  #define rw_is_locked(l)               _rw_is_locked(l)
>>  #define rw_is_write_locked(l)         _rw_is_write_locked(l)
>> +static inline void percpu_read_lock(rwlock_t **per_cpudata, bool_t 
>> *writer_activating,
>> +            rwlock_t *rwlock)
>> +{
>> +    /* Indicate this cpu is reading */
>> +    this_cpu_ptr(per_cpudata) = rwlock;
>> +    smp_mb();
>> +    /* Check if a writer is waiting */
>> +    if ( unlikely(*writer_activating) )
>> +    {
>> +        /* Let the waiting writer know we aren't holding the lock */
>> +        this_cpu_ptr(per_cpudata) = NULL;
>> +        /* Wait using the read lock to keep the lock fair */
>> +        read_lock(rwlock);
>> +        /* Set the per CPU data again and continue*/
>> +        this_cpu_ptr(per_cpudata) = rwlock;
>> +        /* Drop the read lock because we don't need it anymore */
>> +        read_unlock(rwlock);
>> +    }
>> +}
> So am I getting it right that the whole point of the introduction of
> this lock flavor is to replace the cmpxchg() in read_lock() with just
> the write+read above? If so I'm sure you considered alternatives,
> namely of modifying the plain rw locks - would you mind sharing
> the outcome thereof? 

I attempted to use Linux style queued rwlocks but they have the same
limitation of maintaining read lock state in a shared data variable.
This shared data variable will always suffer from cache line bouncing
and so a data parallel technique was the only way to avoid the cache
line bouncing.

I also tried using an atomic counter and a global barrier in the
grant table specific use case so that the grant table version could be
changed safely but there shared state with multiple writers still
suffered from the cache line bouncing problem.

> I admit I can't see alternatives, but the price
> - as said above - seems quite high. Or maybe I'm misunderstanding
> the "as long as multiple per-cpu read locks are not simultaneously
> held by one particular cpu", as having made it here I can't right
> away see where that restriction comes from?

I've tried to detail the overhead earlier in my reply. The price is
mainly paid when writers are using an uncontended lock. The grant
table write lock is almost never taken in the typical use case and
so for particular algorithms/code flows the price may be worth paying.

>> +static inline void percpu_read_unlock(rwlock_t **per_cpudata)
>> +{
>> +    this_cpu_ptr(per_cpudata) = NULL;
>> +    smp_wmb();
>> +}
>> +
>> +/* Don't inline percpu write lock as it's a complex function */
>> +void percpu_write_lock(rwlock_t **per_cpudata, bool_t *writer_activating,
>> +            rwlock_t *rwlock);
>> +
>> +static inline void percpu_write_unlock(bool_t *writer_activating, rwlock_t 
>> *rwlock)
>> +{
>> +    *writer_activating = false;
>> +    write_unlock(rwlock);
>> +}
> Considering that the arguments one needs to pass to any of the
> functions here taking multiple arguments need to be in close sync,
> I'm missing abstracting macros here that need to be passed just
> a single argument (plus a declaration and a definition one).

I agree the per_cpudata and writer_activating arguments are both linked
"global" state and could be abstracted with macro's. The rwlock_t is a
per resource state and would need to be passed as a seperate argument.

So I have some high level questions:

With the current restrictions on the percpu rwlocks (no recursion and no
accessing two percpu rwlocks resources simultaneously on the same PCPU,
Do you think it is worth creating a generic implementation?

Or should I just add a grant table specific implementation because the grant
table code conforms to the above restrictions?

What requirements would there be for a generic percpu rwlock implementation
to be accepted?

Is infrastructure allowing percpu areas to be used in data structures required?

Does the "no recursion" restriction need to be removed?

Does the "no accessing two percpu rwlock resources simultaneously on the same 
restriction need to be removed?

FYI, I will add assertions to my next patch set so that the violation of the 
restrictions will be caught at run time.

Thanks for the review feedback and I appreciate the time you spent to review 
this code.


> Jan

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.