[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks

To: Raghavendra K T <raghavendra.kt@xxxxxxxxxxxxxxxxxx>
From: Attilio Rao <attilio.rao@xxxxxxxxxx>
Date: Wed, 21 Mar 2012 13:04:25 +0000
Cc: Marcelo Tosatti <mtosatti@xxxxxxxxxx>, KVM <kvm@xxxxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>, the arch/x86 maintainers <x86@xxxxxxxxxx>, LKML <linux-kernel@xxxxxxxxxxxxxxx>, Virtualization <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx>, Andi Kleen <andi@xxxxxxxxxxxxxx>, Srivatsa Vaddagiri <vatsa@xxxxxxxxxxxxxxxxxx>, Avi Kivity <avi@xxxxxxxxxx>, Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Ingo Molnar <mingo@xxxxxxx>, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>, Xen Devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, Stephan Diestelhorst <stephan.diestelhorst@xxxxxxx>
Delivery-date: Wed, 21 Mar 2012 13:04:45 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 21/03/12 10:20, Raghavendra K T wrote:

From: Jeremy Fitzhardinge<jeremy.fitzhardinge@xxxxxxxxxx>

Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation but add
a couple of pvops hooks on the slow patch (long spin on lock, unlocking
a contended lock).

Ticket locks have a number of nice properties, but they also have some
surprising behaviours in virtual environments.  They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the hypervisor
scheduler does not schedule the cpus in the correct order, the system can
waste a huge amount of time spinning until the next cpu can take the lock.

(See Thomas Friebel's talk "Prevent Guests from Spinning Around"
http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)

To address this, we add two hooks:
  - __ticket_spin_lock which is called after the cpu has been
    spinning on the lock for a significant number of iterations but has
    failed to take the lock (presumably because the cpu holding the lock
    has been descheduled).  The lock_spinning pvop is expected to block
    the cpu until it has been kicked by the current lock holder.
  - __ticket_spin_unlock, which on releasing a contended lock
    (there are more cpus with tail tickets), it looks to see if the next
    cpu is blocked and wakes it if so.

When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.

I've made some real world benchmarks based on this serie of patchesapplied on top of a vanilla Linux-3.3-rc6 (commit4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with bothCONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versionscompared:

* vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK + patch

(you can check out the monolithic kernel configuration I used, andverify the sole difference, here):

http://xenbits.xen.org/people/attilio/jeremy-spinlock/kernel-configs/

Tests, information and results are summarized below.

== System used information:
* Machine is a XEON x3450, 2.6GHz, 8-ways system:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/dmesg
* System version, a Debian Squeeze 6.0.4:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/debian-version
* gcc version, 4.4.5:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/gcc-version

== Tests performed

* pgbench based on PostgreSQL 9.2 (development version) as it has a lotof scalability improvements in it:

http://www.postgresql.org/docs/devel/static/install-getsource.html

I used a stock installation, with only this simple configuration change:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/postsgresql.conf.patch

For collecting data I used this simple scripts, which runs the test 10times every time with a different set of threads (from 1 to 64). Pleasenote that the first 8 runs cache all the data in memory in order toavoid subsequent I/O, thus they are discarded in sampling and calculation:

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pgbench_script

Here is the crude data (please remind this is tps, thus the higher thebetter):

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pgbench-crude-datas/

And here are data chartered with ministat tool, comparing all the 4kernel configuration for every thread configuration:

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pgbench-9.2-total.bench

As you can see, the patch doesn't really show a statistically meaningfuldifference for this workload, excluding the single-thread run for thepatched + CONFIG_PARAVIRT_SPINLOCK=y case, which seems 5% faster.

* pbzip2, which is a parallel version of bzip2, supposed to reproduce aCPU-intensive, multithreaded, application.The file choosen for compression is 1GB sized, got from /dev/urandom(this is not published but I may have it, so if you need it for moretests please just ask), and all the I/O is done on a tmpfs volume inorder to avoid I/O floaty effects.

For collecting data I used this simple scripts, which runs the test 10times every time with a different set of threads (from 1 to 64):

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pbzip2bench_script

Here is the crude data (please remind this is time(1) output, thus thelower the better):

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pbzip2-crude-datas/

And here are data chartered with ministat tool, comparing all the 4kernel configuration for every thread configuration:

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pbzip2-1.1.1-total.bench

As you can see, the patch doesn't really show a statistically meaningfuldifference for this workload.

* kernbench-0.50 run, doing I/O on a 10GB tmpfs volume (thus no actualI/O involved), with the following invokation:

./kernbench -n10 -s -c16 -M -f

(I had to do that because kernbench wasn't getting a good maximum valueat all, thus I disabled default maximum and forced for 16 threads).

Here is the crude data (please remind this is time(1) output, thus thelower the better):

http://xenbits.xen.org/people/attilio/jeremy-spinlock/kernbench-crude-datas/

Please note that kernbench already calculates std deviation for them.However I also wanted a ministat summary in order to quickly display anypossible difference, thus I just replicated 3 times any value (theminimum requested by ministat) and charted them:

http://xenbits.xen.org/people/attilio/jeremy-spinlock/kernbench-0.50-total.bench

Again, it doesn't seem to be any meaningful statistical difference.

== Results

This test points in the direction that Jeremy's rebased patches don'tintroduce a peformance penalty at all, but also that we could likelyconsider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on bydefault and suggest disabling just on very old CPUs (assuming aperformance regression can be proven there).


If you have questions please let me know.

Thanks,
Attilio

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks
  - From: Stephan Diestelhorst

References:
- [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  - From: Raghavendra K T
- [Xen-devel] [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks
  - From: Raghavendra K T

Prev by Date: Re: [Xen-devel] [PATCH 20/20] libxl: ao: Convert libxl_run_bootloader
Next by Date: Re: [Xen-devel] [PATCH 20/20] libxl: ao: Convert libxl_run_bootloader
Previous by thread: [Xen-devel] [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks
Next by thread: Re: [Xen-devel] [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.