[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Is: pci=assign-busses blows up Xen 4.4 Was:Re: [PATCH] x86/msi: Validate the guest-identified PCI devices in pci_prepare_msix()



On Fri, Jan 24, 2014 at 12:43:49PM -0500, Konrad Rzeszutek Wilk wrote:
> On Fri, Jan 24, 2014 at 04:19:15PM +0000, Jan Beulich wrote:
> > >>> On 24.01.14 at 16:01, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> 
> > >>> wrote:
> > > I built the kernel without the igb driver just to eliminate it being
> > > the culprit. Now I can boot without issues and this is what lspci
> > > reports:
> > > 
> > > -bash-4.1# lspci -s 02:00.0 -v
> > > 02:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network 
> > > Connection (rev 01)
> > >         Subsystem: Intel Corporation Gigabit ET Dual Port Server Adapter
> > >         Flags: bus master, fast devsel, latency 0, IRQ 10
> > >         Memory at f1420000 (32-bit, non-prefetchable) [size=128K]
> > >         Memory at f1000000 (32-bit, non-prefetchable) [size=4M]
> > >         I/O ports at e020 [size=32]
> > >         Memory at f1444000 (32-bit, non-prefetchable) [size=16K]
> > >         Expansion ROM at f0c00000 [disabled] [size=4M]
> > >         Capabilities: [40] Power Management version 3
> > >         Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> > >         Capabilities: [70] MSI-X: Enable- Count=10 Masked-
> > 
> > So here's a patch to figure out why we don't find this.
> 
> Thank you!
> 
> See attached log. The corresponding xen-syms is compressed and
> updated at : http://darnok.org/xen/xen-syms.gz
> 
> The interesting bit is:
> 
> (XEN) 02:00.0: status=0010 (alloc_pdev+0xb4/0x2e9 wants 11)
> (XEN) 02:00.0: pos=40
> (XEN) 02:00.0: id=01
> (XEN) 02:00.0: pos=50
> (XEN) 02:00.0: id=05
> (XEN) 02:00.0: pos=70
> (XEN) 02:00.0: id=11
> (XEN) 02:00.1: status=0010 (alloc_pdev+0xb4/0x2e9 wants 11)
> (XEN) 02:00.1: pos=40
> (XEN) 02:00.1: id=01
> (XEN) 02:00.1: pos=50
> (XEN) 02:00.1: id=05
> (XEN) 02:00.1: pos=70
> (XEN) 02:00.1: id=11

You were right on the idea that it might be the device not having
the right capabilities, but it was the wrong BDF. I instrumented
the faulting operation to make sure I knew which BDF it was:

(XEN) 02:00.0: alloced (179)
(XEN) 02:00.0: alloced (189) ffff830239467f70,pdev ffff8302394660d0
(XEN) 02:00.1: alloced (179)
(XEN) 02:00.1: alloced (189) ffff830239466250,pdev ffff830239466190
(XEN) 04:00.0: alloced (179)
(XEN) 04:00.0: alloced (189) ffff830239466520,pdev ffff830239466460
(XEN) 05:00.0: status=0010 (alloc_pdev+0xb7/0x360 wants 11)
(XEN) 05:00.0: pos=60
(XEN) 05:00.0: id=0d
(XEN) 05:00.0: pos=a0
(XEN) 05:00.0: id=01
(XEN) 05:00.0: pos=00
(XEN) 05:00.0: no cap 11
(XEN) 08:00.0: alloced (179)
(XEN) 08:00.0: alloced (189) ffff830239466eb0,pdev ffff830239466df0

(XEN) [2014-01-25 03:42:08] msix_capability_init:759 for 05:00.0:, msix:0 
dev:ffff8302394665b0
(XEN) [2014-01-25 03:42:08] ----[ Xen-4.4-rc2  x86_64  debug=y  Tainted:    C 
]----
(XEN) [2014-01-25 03:42:08] CPU:    0
(XEN) [2014-01-25 03:42:08] RIP:    e008:[<ffff82d0801683d6>] 
msix_capability_init+0x210/0x63e
... snip..
(XEN) [2014-01-25 03:42:08] Xen call trace:
(XEN) [2014-01-25 03:42:08]    [<ffff82d0801683d6>] 
msix_capability_init+0x210/0x63e
(XEN) [2014-01-25 03:42:08]    [<ffff82d0801689c2>] pci_enable_msi+0x1be/0x4d7
(XEN) [2014-01-25 03:42:08]    [<ffff82d08016c68c>] map_domain_pirq+0x222/0x5ad
(XEN) [2014-01-25 03:42:08]    [<ffff82d08017f134>] physdev_map_pirq+0x507/0x5d1
(XEN) [2014-01-25 03:42:08]    [<ffff82d08017f844>] do_physdev_op+0x646/0x1232
(XEN) [2014-01-25 03:42:08]    [<ffff82d0802223ab>] syscall_enter+0xeb/0x145
(XEN) [2014-01-25 03:42:08] 
(XEN) [2014-01-25 03:42:08] Pagetable walk from 0000000000000004:
(XEN) [2014-01-25 03:42:08]  L4[0x000] = 0000000000000000 ffffffffffffffff
(XEN) [2014-01-25 03:42:08] 
(XEN) [2014-01-25 03:42:08] ****************************************
(XEN) [2014-01-25 03:42:08] Panic on CPU 0:
(XEN) [2014-01-25 03:42:08] FATAL PAGE FAULT
(XEN) [2014-01-25 03:42:08] [error_code=0000]
(XEN) [2014-01-25 03:42:08] Faulting linear address: 0000000000000004
(XEN) [2014-01-25 03:42:08] ****************************************
(XEN) [2014-01-25 03:42:08] 
(XEN) [2014-01-25 03:42:08] Manual reset required ('noreboot' specified)

lspci shows (baremetal kernel, with said driver):

bash-4.1# lspci -s 05:00.0 -v 
05:00.0 Ethernet controller: Intel Corporation Device 1533 (rev 03)
        Subsystem: Super Micro Computer Inc Device 1533
        Flags: bus master, fast devsel, latency 0, IRQ 19
        Memory at f1900000 (32-bit, non-prefetchable) [size=512K]
        I/O ports at c000 [size=32]
        Memory at f1980000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 00-25-90-ff-ff-86-be-f1
        Capabilities: [1a0] #17
        Kernel driver in use: igb

aka, Intel I210 

lspci shows (Xen, kernel does not have igb built-in):

-bash-4.1# lspci -s 05:00.0 -v
05:00.0 Ethernet controller: Intel Corporation Device 1533 (rev 03)
        Subsystem: Super Micro Computer Inc Device 1533
        Flags: bus master, fast devsel, latency 0, IRQ 11
        Memory at f1900000 (32-bit, non-prefetchable) [size=512K]
        I/O ports at c000 [size=32]
        Memory at f1980000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable- Count=5 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 00-25-90-ff-ff-86-be-f1
        Capabilities: [1a0] #17

And with -xxx:

bash-4.1# lspci -s 05:00.0 -xxx
05:00.0 Ethernet controller: Intel Corporation Device 1533 (rev 03)
00: 86 80 33 15 07 00 10 00 03 00 00 02 10 00 00 00
10: 00 00 90 f1 00 00 00 00 01 c0 00 00 00 00 98 f1
20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 33 15
30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 00 00
40: 01 50 23 c8 08 20 00 00 00 00 00 00 00 00 00 00
50: 05 70 80 01 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 11 a0 04 00 03 00 00 00 03 20 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff
a0: 10 00 02 00 c2 8c 00 10 07 28 19 00 11 5c 42 00
b0: 40 00 11 10 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00
d0: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Which would imply that we should start with '50' offset, not
'60'!


If I boot baremetal with 'pci=earlydump' I get:

[    0.000000] pci 0000:05:00.0 config space:
[    0.000000]   00: e3 10 13 81 07 00 10 00 01 01 04 06 00 00 01 00
[    0.000000]   10: 00 00 00 00 00 00 00 00 05 06 07 20 f1 01 a0 22
[    0.000000]   20: 50 f1 60 f1 f1 ff 01 00 00 00 00 00 00 00 00 00
[    0.000000]   30: ff 00 00 00 60 00 00 00 00 00 00 00 ff 00 10 00
[    0.000000]   40: 00 aa 00 00 00 19 90 7d 80 01 00 00 07 03 00 00
[    0.000000]   50: 68 89 09 80 00 1f 00 00 00 01 00 00 00 00 00 00
[    0.000000]   60: 0d a0 00 00 d9 15 05 08 00 00 00 00 00 00 00 00
[    0.000000]   70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    0.000000]   80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    0.000000]   90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    0.000000]   a0: 01 00 03 f8 08 00 00 00 00 00 00 00 00 00 00 00
[    0.000000]   b0: 00 00 00 00 40 00 00 00 00 00 00 00 ef fb be 07
[    0.000000]   c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Which does indeed show that at bootup the PCI configuration
space is different. 

<blink>And the driver id does not match!

If I look at one that has it:
[    0.000000] pci 0000:04:00.0 config space:
[    0.000000]   00: 86 80 33 15 07 00 10 00 03 00 00 02 10 00 00 00
[    0.000000]   10: 00 00 90 f1 00 00 00 00 01 c0 00 00 00 00 98 f1
[    0.000000]   20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 33 15
[    0.000000]   30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 00 00
[    0.000000]   40: 01 50 23 c8 08 20 00 00 00 00 00 00 00 00 00 00
[    0.000000]   50: 05 70 80 01 00 00 00 00 00 00 00 00 00 00 00 00
[    0.000000]   60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    0.000000]   70: 11 a0 04 00 03 00 00 00 03 20 00 00 00 00 00 00
[    0.000000]   80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    0.000000]   90: 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff
[    0.000000]   a0: 10 00 02 00 c2 8c 00 10 07 28 19 00 11 5c 42 00
[    0.000000]   b0: 42 00 11 10 00 00 00 00 00 00 00 00 00 00 00 00
[    0.000000]   c0: 00 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

That matches more of the reality and 04:00.0 is actually 05:00.0.

The reason that is happening is probably because of:

-bash-4.1# cat /proc/cmdline 
initrd=initramfs.cpio.gz console=ttyS0,115200 kgdboc=ttyS0 pci=assign-busses 
pci=earlydump BOOT_IMAGE=vmlinuz 
-bash-4.1# 

The 'assign-busses' which is needed for SR-IOV to work.

If don't use that paremeter Linux kernel (baremetal and with Xen)
tells me:


-bash-4.1# cat /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/sriov_numvfs
0
-bash-4.1# cat /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/sriov_totalvfs
7
-bash-4.1# echo 7 > 
/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/sriov_numvfs
-bash: echo: write error: Cannot allocate memory
-bash-4.1# dmesg | tail
[  241.874349] random: sshd urandom read with 63 bits of entropy available
[  242.918267] Loading iSCSI transport class v2.0-870.
[  242.926046] iscsi: registered transport (tcp)
[  244.689798] scsi8 : iSCSI Initiator over TCP/IP
[  244.709799]  connection1:0: detected conn error (1020)
[  244.969450] device-mapper: ioctl: 4.27.0-ioctl (2013-10-30) initialised: 
dm-devel@xxxxxxxxxx
[  244.980434] device-mapper: multipath: version 1.6.0 loaded
[  250.027291] random: nonblocking pool is initialized
[  256.282312] switch: port 1(eth0) entered forwarding state
[  365.468641] igb 0000:02:00.0: SR-IOV: bus number out of range


And sure enough if I boot Xen without 'pci=assign-busses' it works just
fine.

Ugh.

I wonder how Xen 4.3 would actually do the PCI passthrough - it booted with
the 'assign-busses' - but I hadn't tried to do PCI passthrough of the
PF device (the I210).

If do pass in '05:00.0' (new bus number) I wonder if it will use IOMMU context
with whatever '05:00.0' was _before_ the bus re-assigment  aka:

05:00.0 PCI bridge: Tundra Semiconductor Corp. Device 8113 (rev 01) (prog-if 01 
[Subtractive decode])
        Flags: bus master, fast devsel, latency 0
        Bus: primary=05, secondary=06, subordinate=07, sec-latency=32
        Memory behind bridge: f1500000-f16fffff
        Capabilities: [60] Subsystem: Super Micro Computer Inc Device 0805
        Capabilities: [a0] Power Management version 3

Which I think would confuse Xen as this is clearly labeled as bridge
not a PCI device.


The reason for me using 'pci=assign-busses' is that it looks to be
the only option to use SR-IOV.

Which I suppose makes sense as it tries to create VFs right after its own bus 
id:


  +-01.1-[02-03]--+-[0000:03]-+-10.0  Intel Corporation 82576 Virtual Function
           |               |           +-10.1  Intel Corporation 82576 Virtual 
Function
           |               |           +-10.2  Intel Corporation 82576 Virtual 
Function
           |               |           +-10.3  Intel Corporation 82576 Virtual 
Function
           |               |           +-10.4  Intel Corporation 82576 Virtual 
Function
           |               |           +-10.5  Intel Corporation 82576 Virtual 
Function
           |               |           +-10.6  Intel Corporation 82576 Virtual 
Function
           |               |           +-10.7  Intel Corporation 82576 Virtual 
Function
           |               |           +-11.0  Intel Corporation 82576 Virtual 
Function
           |               |           +-11.1  Intel Corporation 82576 Virtual 
Function
           |               |           +-11.2  Intel Corporation 82576 Virtual 
Function
           |               |           +-11.3  Intel Corporation 82576 Virtual 
Function
           |               |           +-11.4  Intel Corporation 82576 Virtual 
Function
           |               |           \-11.5  Intel Corporation 82576 Virtual 
Function
           |               \-[0000:02]-+-00.0  Intel Corporation 82576 Gigabit 
Network Connection
           |                           \-00.1  Intel Corporation 82576 Gigabit 
Network Connection


But why does it have to have the bus _right_ after its own? Can't it
use one at the end of the its bus-space? The bus is after it is occupied
by another card (if I boot without 'pci=assign-busses').

I do recall using this particular SR-IOV card on a different hardware
a year ago or so. And it did work. I think that might be because
there were no PCI cards _after_ the SR-IOV card.

For posterity, with pci=assign-busses under baremetal (with SR-IOV enabled):
02:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev 01)
02:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev 01)
03:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:10.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:10.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:10.3 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:10.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:10.5 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:10.6 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:10.7 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:11.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:11.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:11.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:11.3 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
03:11.5 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
04:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
04:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
05:00.0 Ethernet controller: Intel Corporation Device 1533 (rev 03)
06:00.0 PCI bridge: Tundra Semiconductor Corp. Device 8113 (rev 01)
07:01.0 PCI bridge: Hint Corp HB6 Universal PCI-PCI bridge (non-transparent 
mode) (rev 11)
07:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
08:08.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture 
(rev 11)
08:08.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 
11)
08:09.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture 
(rev 11)
08:09.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 
11)
08:0a.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture 
(rev 11)
08:0a.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 
11)
08:0b.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture 
(rev 11)
08:0b.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 
11)
09:00.0 USB Controller: Renesas Technology Corp. Device 0015 (rev 02)
0a:00.0 SATA controller: Device 1b21:0612 (rev 01)

Without 'pci=assign-busses' under baremetal:
02:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev 01)
02:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev 01)
03:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
03:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
04:00.0 Ethernet controller: Intel Corporation Device 1533 (rev 03)
05:00.0 PCI bridge: Tundra Semiconductor Corp. Device 8113 (rev 01)
06:01.0 PCI bridge: Hint Corp HB6 Universal PCI-PCI bridge (non-transparent 
mode) (rev 11)
06:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
07:08.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture 
(rev 11)
07:08.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 
11)
07:09.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture 
(rev 11)
07:09.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 
11)
07:0a.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture 
(rev 11)
07:0a.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 
11)
07:0b.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture 
(rev 11)
07:0b.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 
11)
08:00.0 USB Controller: Renesas Technology Corp. Device 0015 (rev 02)
09:00.0 SATA controller: Device 1b21:0612 (rev 01)


This problem with SR-IOV bus seems to have been solved in 2009:

commit a28724b0fb909d247229a70761c90bb37b13366a
Author: Yu Zhao <yu.zhao@xxxxxxxxx>
Date:   Fri Mar 20 11:25:13 2009 +0800

    PCI: reserve bus range for SR-IOV device
    
    Reserve the bus number range used by the Virtual Function when
    pcibios_assign_all_busses() returns true.

And pcibios_assign_all_busses() is the one that returns true if 
'pci=assign-busses'
is set.

Attachment: tst035-jan-debug-2.txt
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.