Xen project Mailing List

Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: Dante Cinco <dantecinco@xxxxxxxxx>

Date: Thu, 18 Nov 2010 10:43:57 -0800

Cc: Jeremy Fitzhardinge <jeremy@xxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, mathieu.desnoyers@xxxxxxxxxx, andrew.thomas@xxxxxxxxxx, keir.fraser@xxxxxxxxxxxxx, chris.mason@xxxxxxxxxx

Delivery-date: Thu, 18 Nov 2010 10:44:52 -0800

Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=MWDP9bRvnyS4K9w0RS3mL41/yfgbF8vAMkPhcIEFivIcMxdYYn4etdx9VfxTuW4rXN ZPE8wqhqahtR5t8P9dGiyS8vD+ewzyictM5wGDHvhUDDlGaALPwvDq7X0DKQf2cQYO1w OoU0IxpvkZo1zI3WhvUZmEPtJyJU1VK02sf8Q=

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On Thu, Nov 18, 2010 at 9:19 AM, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote: > Keir, Dan, Mathieu, Chris, Mukesh, > > This fellow is passing in a PCI device to his Xen PV guest and trying > to get high IOPS. The kernel he is using is a 2.6.36 with tglx's > sparse_irq rework. > >> I wanted to confirm that bounce buffering was indeed occurring so I >> modified swiotlb.c in the kernel and added printks in the following >> functions: >> swiotlb_bounce >> swiotlb_tbl_map_single >> swiotlb_tbl_unmap_single >> Sure enough we were calling all 3 five times per I/O. We took your >> suggestion and replaced pci_map_single with pci_pool_alloc. The >> swiotlb calls were gone but the I/O performance only improved 6% (29k >> IOPS to 31k IOPS) which is still abysmal. > > Hey! 6% that is nothing to sneeze at. When we were using an HVM kernel (2.6.32.15+drm33.5), our IOPS was at least 20x (~700k IOPS). > >> >> Any suggestions on where to look next? I have one question about the > > So since you are talking IOPS I figured you must be using fio to run those > numbers. And since you mentioned HVM at some point, you are not running > this PV domain as a back-end for another PV guest. You are probably going > to run some form of iSCSI target and stuff those down the PCI device. Our setup is pure Fibre Channel. We're using a physically separate system (Linux-based also) to initiate the SCSI I/Os. > > Couple of things that pop in my head.. but lets first address your question. > >> P2M array: Does the P2M lookup occur every DMA or just during the >> allocation? What I'm getting at is this: Is the Xen-SWIOTLB a central > > It only occurs during allocation. Also since you are bypassing the > bounce buffer those calls are done without any spinlock. The lookup > of P2M is bitshifting, division - and are constant - so O(1). > >> resource that could be a bottleneck? > > Doubt it. Your best bet to figure this out is to play with ftrace, or > perf trace. But I don't know how well they work with Xen nowadays - Jeremy > and Mathieu Desnoyers poked it a bit and I think I overheard that Mathieu got > it working? > > So the next couple of possiblities are: > 1). you are hitting the spinlock issues on 'struct request' or any of > the paths on the I/O. Oracle did a lot of work on those - and one > way to find this out is to look at tracing and see where the contention > is. > I don't know where or if those patches have been posted upstream.. but as > said, > if you are seeing the spinlock usage high - that might be it. > 1b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. Otherwise I checked the config file and it is enabled: CONFIG_PARAVIRT_SPINLOCKS=y > you are going to hit dreadfull conditions. > 2). You are hitting the 64-bit syscall wall. Basically your user-mode > application (fio) is doing a write(), which used to be int 0x80 but now > is a syscall. The syscall gets trapped in the hypervisor which has to > call in your PV kernel. You get hit with two context switches for each > 'write()' call. The solution is to use a 32-bit DomU as the guest user > application and guest kernel run in different rings. There is no user space application that is involved with the I/O. It's all kernel driver code that handles the I/O. > 3). Xen CPU pools. You didn't say where the application that sends the IOs > is located. But if it was in a seperate domain then you will want to use > Xen CPU pools. Basically this way you can get gang-scheduling where the > guest that submits the I/O and the guest that picks up the I/O are running > right after each other. I don't know much more details, but this is what > I understand it does. > 4). CPU/MSI-X affinity. I think you already did this, but make sure you pin > your guest to specific CPUs and also pin the MSI-X (vectors) to the proper > destination. You can use the 'xm debug-keys i' to see the MSI-X affinity > - it > is a mask and basically see if it overlays the CPUs you are running your > guest > at. Not sure how to actually set the MSI-X affinity ... now that I think. > Keir or some of the Intel folks might know better. There 16 devices (multi-function) that are PCI-passed through to domU. There are 16 VCPUs in domU and all are pinned to individual PCPUs (24-CPU platform). Each IRQ in domU is affinitized to a CPU. This strategy has worked well for us with the HVM kernel. Here's the output of 'xm debug-keys i' (XEN) IRQ: 67 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7a type=PCI-MSI status=00000010 in-flight=0 domain-list=1:127(----), (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000200 vec:43 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:126(----), (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000400 vec:83 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:125(----), (XEN) IRQ: 70 affinity:00000000,00000000,00000000,00000800 vec:4b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:124(----), (XEN) IRQ: 71 affinity:00000000,00000000,00000000,00001000 vec:8b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:123(----), (XEN) IRQ: 72 affinity:00000000,00000000,00000000,00002000 vec:53 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:122(----), (XEN) IRQ: 73 affinity:00000000,00000000,00000000,00004000 vec:93 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:121(----), (XEN) IRQ: 74 affinity:00000000,00000000,00000000,00008000 vec:5b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:120(----), (XEN) IRQ: 75 affinity:00000000,00000000,00000000,00010000 vec:9b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:119(----), (XEN) IRQ: 76 affinity:00000000,00000000,00000000,00020000 vec:63 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:118(----), (XEN) IRQ: 77 affinity:00000000,00000000,00000000,00040000 vec:a3 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:117(----), (XEN) IRQ: 78 affinity:00000000,00000000,00000000,00080000 vec:6b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:116(----), (XEN) IRQ: 79 affinity:00000000,00000000,00000000,00100000 vec:ab type=PCI-MSI status=00000010 in-flight=0 domain-list=1:115(----), (XEN) IRQ: 80 affinity:00000000,00000000,00000000,00200000 vec:73 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:114(----), (XEN) IRQ: 81 affinity:00000000,00000000,00000000,00400000 vec:b3 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:113(----), (XEN) IRQ: 82 affinity:00000000,00000000,00000000,00800000 vec:7b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:112(----), > 5). Andrew, Mukesh, Keir, Dan, any other ideas? > We're also trying Chris' 4 things to try and will consider Mathieu's LTT suggestion. - Dante _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.