Xen project Mailing List

Re: [Xen-devel] [PATCHv2 0/3] Implement per-cpu reader-writer locks

On 24/11/15 18:16, George Dunlap wrote: > On 20/11/15 16:03, Malcolm Crossley wrote: >> This patch series adds per-cpu reader-writer locks as a generic lock >> implementation and then converts the grant table and p2m rwlocks to >> use the percpu rwlocks, in order to improve multi-socket host performance. >> >> CPU profiling has revealed the rwlocks themselves suffer from severe cache >> line bouncing due to the cmpxchg operation used even when taking a read lock. >> Multiqueue paravirtualised I/O results in heavy contention of the grant table >> and p2m read locks of a specific domain and so I/O throughput is bottlenecked >> by the overhead of the cache line bouncing itself. >> >> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data >> area to record a CPU has taken the read lock. Correctness is enforced for >> the >> write lock by using a per lock barrier which forces the per-cpu read lock >> to revert to using a standard read lock. The write lock then polls all >> the percpu data area until active readers for the lock have exited. >> >> Removing the cache line bouncing on a multi-socket Haswell-EP system >> dramatically improves performance, with 16 vCPU network IO performance going >> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 >> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better >> IO improvement. > Impressive -- thanks for doing this work. > > One question: Your description here sounds like you've tested with a > single large domain, but what happens with multiple domains? The test was with two domU's, doing network traffic between them. > > It looks like the "per-cpu-rwlock" is shared by *all* locks of a > particular type (e.g., all domains share the per-cpu p2m rwlock). > (Correct me if I'm wrong here.) The per-pcpu pointer is shared for all locks of a particular type, but allows the individual lock to be distinguished by following the pointer back. Therefore, the locks are not actually shared. > > Which means two things: > > 1) *Any* writer will have to wait for the rwlock of that type to be > released on *all* domains before being able to write. Is there any risk > that on a busy system, this will be an unusually long wait? No. The write-locker looks through all pcpus and ignores any whose per-cpu pointer is not pointing at the intended percpu_rwlock. From _percpu_write_lock(): /* * Remove any percpu readers not contending on this rwlock * from our check mask. */ if ( per_cpu_ptr(per_cpudata, cpu) != percpu_rwlock ) cpumask_clear_cpu(cpu, &this_cpu(percpu_rwlock_readers)); As a result, two calls to _percpu_write_lock() against different percpu_rwlock_t's will not interact. > > 2) *All* domains will have to take the slow path for reading when a > *any* domain has or is trying to acquire the write lock. What is the > probability that on a busy system that turns out to be "most of the time"? _percpu_read_lock() will only take the slow path if: 1) The intended lock is (or is in the process of being) taken for writing 2) The callpath has nested _percpu_read_lock() of the same type of lock By observation, 2) does not occur currently in the code. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.