|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v10 03/19] qspinlock: Add pending bit
On 05/14/2014 03:13 PM, Radim KrÄmÃÅ wrote: 2014-05-14 19:00+0200, Peter Zijlstra:On Wed, May 14, 2014 at 06:51:24PM +0200, Radim KrÄmÃÅ wrote:Ok. I've seen merit in pvqspinlock even with slightly slower first-waiter, so I would have happily sacrificed those horrible branches. (I prefer elegant to optimized code, but I can see why we want to be strictly better than ticketlock.) Peter mentioned that we are focusing on bare-metal patches, so I'll withold my other paravirt rants until they are polished.(It was an ambiguous sentence, I have comments for later patches.)Well, paravirt must happen too, but comes later in this series, patch 3 which we're replying to is still very much in the bare metal part of the series.(I think that bare metal spans the first 7 patches.)I've not had time yet to decode all that Waiman has done to make paravirt work. But as a general rule I like patches that start with something simple and working and then optimize it, this series doesn't seem to quite grasp that. Assuming that taking a spinlock is fairly frequent in the kernel, the node structure cacheline won't be so cold after all. So traditional locks like test-and-test and the ticket lock only ever access the spinlock word itsef, this MCS style queueing lock has a second (and, see my other rants in this thread, when done wrong more than 2) cacheline to touch. That said, all our benchmarking is pretty much for the cache-hot case, so I'm not entirely convinced yet that the one pending bit makes up for it, it does in the cache-hot case.Yeah, we probably use the faster pre-lock quite a lot. Cover letter states that queue depth 1-3 is a bit slower than ticket spinlock, so we might not be losing if we implemented a faster in-word-lock of this capacity. (Not that I'm a fan of the hybrid lock.) I had tried an experimental patch with 2 pending bits. However, the benchmark test that I used show the performance is even worse than without any pending bit. I probably need to revisit that later as to why this is the case. As for now, I will focus on just having one pending bit. If we could find a way to get better performance out of more than 1 pending bit later on, we could always submit another patch to do that. But... writing cache-cold benchmarks is _hard_ :/Wouldn't clflush of the second cacheline before trying for the lock give us a rough estimate? clflush is a very expensive operation and I doubt that it will be indicative of real life performance at all. BTW, there is no way to write a cache-cold benchmark for that 2nd cacheline as any call to spin_lock will likely to access it if there is enough contention. -Longman _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |