[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/2] rwlock: add per-cpu reader-writer locks

On 18/11/15 14:15, Jan Beulich wrote:
>>>> On 18.11.15 at 14:49, <malcolm.crossley@xxxxxxxxxx> wrote:
>> On 17/11/15 17:00, Jan Beulich wrote:
>>>>>> On 03.11.15 at 18:58, <malcolm.crossley@xxxxxxxxxx> wrote:
>>>> Per-cpu read-write locks allow for the fast path read case to have low 
>>>> overhead
>>>> by only setting/clearing a per-cpu variable for using the read lock.
>>>> The per-cpu read fast path also avoids locked compare swap operations 
>>>> which can
>>>> be particularly slow on coherent multi-socket systems, particularly if 
>>>> there is
>>>> heavy usage of the read lock itself.
>>>> The per-cpu reader-writer lock uses a global variable to control the read 
>>>> lock
>>>> fast path. This allows a writer to disable the fast path and ensure the 
>>>> readers
>>>> use the underlying read-write lock implementation.
>>>> Once the writer has taken the write lock and disabled the fast path, it 
>>>> must
>>>> poll the per-cpu variable for all CPU's which have entered the critical 
>>>> section
>>>> for the specific read-write lock the writer is attempting to take. This 
>>>> design
>>>> allows for a single per-cpu variable to be used for read/write locks 
>>>> belonging
>>>> to seperate data structures as long as multiple per-cpu read locks are not
>>>> simultaneously held by one particular cpu. This also means per-cpu
>>>> reader-writer locks are not recursion safe.
>>> This sounds like pretty severe a restriction, and something to easily
>>> get wrong once more than a handful of users got introduced.
>> I can add an ASSERT to detect the recursion for the debug hypervisor but you 
>> are right
>> that it is a restriction. I believe that the tickets locks are not recursion 
>> safe
>> either and rely on additional infrastructure to prevent the recursion 
>> occurring on
>> the lock itself.
> No ordinary spin lock can be acquired recursively. The restriction here
> is more severe: No two spin locks protecting the same kind of resource
> can't be used recursively.

I have a suggestion to remove the "no accessing two percpu rwlock resources
simultaneously on the same PCPU" restriction:

On accessing the second resource, we can detect that the percpu read area is 
in use, if this is the case then we can just take standard read_lock on the 
resource instead of using the percpu read area. This makes the percpu rw lock 
to use on multiple resources simulatenously but it falls back to standard rw 
performance for taking a two or more percpu rw lock simultaneously.

I'm suggesting the above option instead of per resource (NR_DOM_IDS), percpu 
data areas
because there may be cases where the number of resources is very large and so
difficult/expensive to preallocate as percpu data areas.

>>>> +    cpumask_t active_readers;
>>> Did you consider ways to avoid such a large variable to be put on
>>> the stack, and none turned out suitable?
>> I wasn't aware the cpumask_t variable at 32 bytes was considered problematic
>> for the stack. I can't think of any way around the issue without causing
>> further restrictions (a percpu area will prevent taking two different percpu
>> rwlocks at the same time).
> 32 bytes it is only for 256 CPU builds. What about a 4095 CPU config?
> Wouldn't it be possible (either excluding use of such locks from
> interrupt context, or by disabling interrupts for the duration of
> the mask's use) to have a per-CPU mask?

That sounds possible, I'd prefer preventing the use of the lock in
irq context so that the common case does have the penalty of disabling IRQS.

It may also take some time for the readers to exit their critical section
so disabling IRQ's may cause significant interrupt latency.

The trade off here is between IRQ latency and stack usage.
Worst case stack usage would occur with many nested interrupts.
The only way I can think of to preserve IRQ latency and stop stack usage
would be to have preallocated percpu cpumask for the maximum level of interrupt

I believe for x86 the maximum level of interrupt nesting is 16 (one per vector 

With 4095 CPU's a cpumask_t would be 512 bytes and so 16 nested interrupts 
would use all 8K
of stack, where as with 256 CPU we have 32 bytes and so only use a maximum of 
512 bytes.

With there being so many users of cpumask_t maybe we should scale the stack 
size with number
of NR_CPU's instead?

Or I can preallocate "MAX_NR_NESTING_LEVELS" percpu array of cpumask_t?

Or just disable interrupts and take the hit on IRQ latency?

>>> I admit I can't see alternatives, but the price
>>> - as said above - seems quite high. Or maybe I'm misunderstanding
>>> the "as long as multiple per-cpu read locks are not simultaneously
>>> held by one particular cpu", as having made it here I can't right
>>> away see where that restriction comes from?
>> I've tried to detail the overhead earlier in my reply. The price is
>> mainly paid when writers are using an uncontended lock. The grant
>> table write lock is almost never taken in the typical use case and
>> so for particular algorithms/code flows the price may be worth paying.
> With "price" I was really referring to the usage restriction, not any
> performance effect.

I thought you meant overhead when you mentioned "price" :) I should have
picked up the context you provided after the "price" statement however.

>>>> +static inline void percpu_read_unlock(rwlock_t **per_cpudata)
>>>> +{
>>>> +    this_cpu_ptr(per_cpudata) = NULL;
>>>> +    smp_wmb();
>>>> +}
>>>> +
>>>> +/* Don't inline percpu write lock as it's a complex function */
>>>> +void percpu_write_lock(rwlock_t **per_cpudata, bool_t *writer_activating,
>>>> +          rwlock_t *rwlock);
>>>> +
>>>> +static inline void percpu_write_unlock(bool_t *writer_activating, 
>>>> rwlock_t *rwlock)
>>>> +{
>>>> +    *writer_activating = false;
>>>> +    write_unlock(rwlock);
>>>> +}
>>> Considering that the arguments one needs to pass to any of the
>>> functions here taking multiple arguments need to be in close sync,
>>> I'm missing abstracting macros here that need to be passed just
>>> a single argument (plus a declaration and a definition one).
>> I agree the per_cpudata and writer_activating arguments are both linked
>> "global" state and could be abstracted with macro's. The rwlock_t is a
>> per resource state and would need to be passed as a seperate argument.
>> So I have some high level questions:
>> With the current restrictions on the percpu rwlocks (no recursion and no
>> accessing two percpu rwlocks resources simultaneously on the same PCPU,
>> Do you think it is worth creating a generic implementation?
> Yes, namely with Andrew already hinting at the p2m lock also wanting
> to be converted to this model.

Converting the p2m lock is definitely on the list to be converted due to it's
use in grant operations as well.

>> Or should I just add a grant table specific implementation because the grant
>> table code conforms to the above restrictions?
>> What requirements would there be for a generic percpu rwlock implementation
>> to be accepted?
> At least the multiple resource lock issue should be avoided, and I think
> I've pointed out before how I think this can be achieved without any
> fundamental changes.

Does the suggestion above satisfy this requirement?

>> Is infrastructure allowing percpu areas to be used in data structures 
>> required?
> At some point it seems this would be desirable, but I think I've
> explained a way to avoid this getting overly complex for now.
>> Does the "no recursion" restriction need to be removed?
> Preferably yes, but I think this one can be easier lived with.
>> Does the "no accessing two percpu rwlock resources simultaneously on the 
>> same PCPU"
>> restriction need to be removed?
> See above - yes.

Thanks again for the feedback


> Jan

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.