Xen project Mailing List

Re: [Xen-devel] [PATCH 1/2] rwlock: add per-cpu reader-writer locks

On 18/11/15 14:15, Jan Beulich wrote: >>>> On 18.11.15 at 14:49, <malcolm.crossley@xxxxxxxxxx> wrote: >> On 17/11/15 17:00, Jan Beulich wrote: >>>>>> On 03.11.15 at 18:58, <malcolm.crossley@xxxxxxxxxx> wrote: >>>> Per-cpu read-write locks allow for the fast path read case to have low >>>> overhead >>>> by only setting/clearing a per-cpu variable for using the read lock. >>>> The per-cpu read fast path also avoids locked compare swap operations >>>> which can >>>> be particularly slow on coherent multi-socket systems, particularly if >>>> there is >>>> heavy usage of the read lock itself. >>>> >>>> The per-cpu reader-writer lock uses a global variable to control the read >>>> lock >>>> fast path. This allows a writer to disable the fast path and ensure the >>>> readers >>>> use the underlying read-write lock implementation. >>>> >>>> Once the writer has taken the write lock and disabled the fast path, it >>>> must >>>> poll the per-cpu variable for all CPU's which have entered the critical >>>> section >>>> for the specific read-write lock the writer is attempting to take. This >>>> design >>>> allows for a single per-cpu variable to be used for read/write locks >>>> belonging >>>> to seperate data structures as long as multiple per-cpu read locks are not >>>> simultaneously held by one particular cpu. This also means per-cpu >>>> reader-writer locks are not recursion safe. >>> >>> This sounds like pretty severe a restriction, and something to easily >>> get wrong once more than a handful of users got introduced. >> >> I can add an ASSERT to detect the recursion for the debug hypervisor but you >> are right >> that it is a restriction. I believe that the tickets locks are not recursion >> safe >> either and rely on additional infrastructure to prevent the recursion >> occurring on >> the lock itself. > > No ordinary spin lock can be acquired recursively. The restriction here > is more severe: No two spin locks protecting the same kind of resource > can't be used recursively. I have a suggestion to remove the "no accessing two percpu rwlock resources simultaneously on the same PCPU" restriction: On accessing the second resource, we can detect that the percpu read area is already in use, if this is the case then we can just take standard read_lock on the second resource instead of using the percpu read area. This makes the percpu rw lock safe to use on multiple resources simulatenously but it falls back to standard rw lock performance for taking a two or more percpu rw lock simultaneously. I'm suggesting the above option instead of per resource (NR_DOM_IDS), percpu data areas because there may be cases where the number of resources is very large and so difficult/expensive to preallocate as percpu data areas. > >>>> + cpumask_t active_readers; >>> >>> Did you consider ways to avoid such a large variable to be put on >>> the stack, and none turned out suitable? >> >> I wasn't aware the cpumask_t variable at 32 bytes was considered problematic >> for the stack. I can't think of any way around the issue without causing >> further restrictions (a percpu area will prevent taking two different percpu >> rwlocks at the same time). > > 32 bytes it is only for 256 CPU builds. What about a 4095 CPU config? > > Wouldn't it be possible (either excluding use of such locks from > interrupt context, or by disabling interrupts for the duration of > the mask's use) to have a per-CPU mask? That sounds possible, I'd prefer preventing the use of the lock in irq context so that the common case does have the penalty of disabling IRQS. It may also take some time for the readers to exit their critical section so disabling IRQ's may cause significant interrupt latency. The trade off here is between IRQ latency and stack usage. Worst case stack usage would occur with many nested interrupts. The only way I can think of to preserve IRQ latency and stop stack usage would be to have preallocated percpu cpumask for the maximum level of interrupt nesting. I believe for x86 the maximum level of interrupt nesting is 16 (one per vector nibble) With 4095 CPU's a cpumask_t would be 512 bytes and so 16 nested interrupts would use all 8K of stack, where as with 256 CPU we have 32 bytes and so only use a maximum of 512 bytes. With there being so many users of cpumask_t maybe we should scale the stack size with number of NR_CPU's instead? Or I can preallocate "MAX_NR_NESTING_LEVELS" percpu array of cpumask_t? Or just disable interrupts and take the hit on IRQ latency? > >>> I admit I can't see alternatives, but the price >>> - as said above - seems quite high. Or maybe I'm misunderstanding >>> the "as long as multiple per-cpu read locks are not simultaneously >>> held by one particular cpu", as having made it here I can't right >>> away see where that restriction comes from? >> >> I've tried to detail the overhead earlier in my reply. The price is >> mainly paid when writers are using an uncontended lock. The grant >> table write lock is almost never taken in the typical use case and >> so for particular algorithms/code flows the price may be worth paying. > > With "price" I was really referring to the usage restriction, not any > performance effect. I thought you meant overhead when you mentioned "price" :) I should have picked up the context you provided after the "price" statement however. > >>>> +static inline void percpu_read_unlock(rwlock_t **per_cpudata) >>>> +{ >>>> + this_cpu_ptr(per_cpudata) = NULL; >>>> + smp_wmb(); >>>> +} >>>> + >>>> +/* Don't inline percpu write lock as it's a complex function */ >>>> +void percpu_write_lock(rwlock_t **per_cpudata, bool_t *writer_activating, >>>> + rwlock_t *rwlock); >>>> + >>>> +static inline void percpu_write_unlock(bool_t *writer_activating, >>>> rwlock_t *rwlock) >>>> +{ >>>> + *writer_activating = false; >>>> + write_unlock(rwlock); >>>> +} >>> >>> Considering that the arguments one needs to pass to any of the >>> functions here taking multiple arguments need to be in close sync, >>> I'm missing abstracting macros here that need to be passed just >>> a single argument (plus a declaration and a definition one). >> >> I agree the per_cpudata and writer_activating arguments are both linked >> "global" state and could be abstracted with macro's. The rwlock_t is a >> per resource state and would need to be passed as a seperate argument. >> >> >> So I have some high level questions: >> >> With the current restrictions on the percpu rwlocks (no recursion and no >> accessing two percpu rwlocks resources simultaneously on the same PCPU, >> Do you think it is worth creating a generic implementation? > > Yes, namely with Andrew already hinting at the p2m lock also wanting > to be converted to this model. Converting the p2m lock is definitely on the list to be converted due to it's use in grant operations as well. > >> Or should I just add a grant table specific implementation because the grant >> table code conforms to the above restrictions? >> >> What requirements would there be for a generic percpu rwlock implementation >> to be accepted? > > At least the multiple resource lock issue should be avoided, and I think > I've pointed out before how I think this can be achieved without any > fundamental changes. Does the suggestion above satisfy this requirement? > >> Is infrastructure allowing percpu areas to be used in data structures >> required? > > At some point it seems this would be desirable, but I think I've > explained a way to avoid this getting overly complex for now. > >> Does the "no recursion" restriction need to be removed? > > Preferably yes, but I think this one can be easier lived with. > >> Does the "no accessing two percpu rwlock resources simultaneously on the >> same PCPU" >> restriction need to be removed? > > See above - yes. Thanks again for the feedback Malcolm > > Jan > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.