[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/2] rwlock: add per-cpu reader-writer locks

>>> On 18.11.15 at 14:49, <malcolm.crossley@xxxxxxxxxx> wrote:
> On 17/11/15 17:00, Jan Beulich wrote:
>>>>> On 03.11.15 at 18:58, <malcolm.crossley@xxxxxxxxxx> wrote:
>>> Per-cpu read-write locks allow for the fast path read case to have low 
>>> overhead
>>> by only setting/clearing a per-cpu variable for using the read lock.
>>> The per-cpu read fast path also avoids locked compare swap operations which 
>>> can
>>> be particularly slow on coherent multi-socket systems, particularly if 
>>> there is
>>> heavy usage of the read lock itself.
>>> The per-cpu reader-writer lock uses a global variable to control the read 
>>> lock
>>> fast path. This allows a writer to disable the fast path and ensure the 
>>> readers
>>> use the underlying read-write lock implementation.
>>> Once the writer has taken the write lock and disabled the fast path, it must
>>> poll the per-cpu variable for all CPU's which have entered the critical 
>>> section
>>> for the specific read-write lock the writer is attempting to take. This 
>>> design
>>> allows for a single per-cpu variable to be used for read/write locks 
>>> belonging
>>> to seperate data structures as long as multiple per-cpu read locks are not
>>> simultaneously held by one particular cpu. This also means per-cpu
>>> reader-writer locks are not recursion safe.
>> This sounds like pretty severe a restriction, and something to easily
>> get wrong once more than a handful of users got introduced.
> I can add an ASSERT to detect the recursion for the debug hypervisor but you 
> are right
> that it is a restriction. I believe that the tickets locks are not recursion 
> safe
> either and rely on additional infrastructure to prevent the recursion 
> occurring on
> the lock itself.

No ordinary spin lock can be acquired recursively. The restriction here
is more severe: No two spin locks protecting the same kind of resource
can't be used recursively.

>>> +    cpumask_t active_readers;
>> Did you consider ways to avoid such a large variable to be put on
>> the stack, and none turned out suitable?
> I wasn't aware the cpumask_t variable at 32 bytes was considered problematic
> for the stack. I can't think of any way around the issue without causing
> further restrictions (a percpu area will prevent taking two different percpu
> rwlocks at the same time).

32 bytes it is only for 256 CPU builds. What about a 4095 CPU config?

Wouldn't it be possible (either excluding use of such locks from
interrupt context, or by disabling interrupts for the duration of
the mask's use) to have a per-CPU mask?

>> I admit I can't see alternatives, but the price
>> - as said above - seems quite high. Or maybe I'm misunderstanding
>> the "as long as multiple per-cpu read locks are not simultaneously
>> held by one particular cpu", as having made it here I can't right
>> away see where that restriction comes from?
> I've tried to detail the overhead earlier in my reply. The price is
> mainly paid when writers are using an uncontended lock. The grant
> table write lock is almost never taken in the typical use case and
> so for particular algorithms/code flows the price may be worth paying.

With "price" I was really referring to the usage restriction, not any
performance effect.

>>> +static inline void percpu_read_unlock(rwlock_t **per_cpudata)
>>> +{
>>> +    this_cpu_ptr(per_cpudata) = NULL;
>>> +    smp_wmb();
>>> +}
>>> +
>>> +/* Don't inline percpu write lock as it's a complex function */
>>> +void percpu_write_lock(rwlock_t **per_cpudata, bool_t *writer_activating,
>>> +           rwlock_t *rwlock);
>>> +
>>> +static inline void percpu_write_unlock(bool_t *writer_activating, rwlock_t 
>>> *rwlock)
>>> +{
>>> +    *writer_activating = false;
>>> +    write_unlock(rwlock);
>>> +}
>> Considering that the arguments one needs to pass to any of the
>> functions here taking multiple arguments need to be in close sync,
>> I'm missing abstracting macros here that need to be passed just
>> a single argument (plus a declaration and a definition one).
> I agree the per_cpudata and writer_activating arguments are both linked
> "global" state and could be abstracted with macro's. The rwlock_t is a
> per resource state and would need to be passed as a seperate argument.
> So I have some high level questions:
> With the current restrictions on the percpu rwlocks (no recursion and no
> accessing two percpu rwlocks resources simultaneously on the same PCPU,
> Do you think it is worth creating a generic implementation?

Yes, namely with Andrew already hinting at the p2m lock also wanting
to be converted to this model.

> Or should I just add a grant table specific implementation because the grant
> table code conforms to the above restrictions?
> What requirements would there be for a generic percpu rwlock implementation
> to be accepted?

At least the multiple resource lock issue should be avoided, and I think
I've pointed out before how I think this can be achieved without any
fundamental changes.

> Is infrastructure allowing percpu areas to be used in data structures 
> required?

At some point it seems this would be desirable, but I think I've
explained a way to avoid this getting overly complex for now.

> Does the "no recursion" restriction need to be removed?

Preferably yes, but I think this one can be easier lived with.

> Does the "no accessing two percpu rwlock resources simultaneously on the 
> same PCPU"
> restriction need to be removed?

See above - yes.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.