[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)



On 12/11/14 09:24, Jan Beulich wrote:
>>>> On 12.11.14 at 02:37, <konrad.wilk@xxxxxxxxxx> wrote:
>> When we PCI insert an device, the BARs are not set at all - and hence
>> the Linux kernel is the one that tries to set the BARs in. The
>> reason it cannot fit the device in the MMIO region is due to the
>> _CRS only having certain ranges (even thought the MMIO region can
>> cover 2GB). See:
>>
>> Without any devices (and me doing PCI insertion after that):
>> # dmesg | grep "bus resource"
>> [    0.366000] pci_bus 0000:00: root bus resource [bus 00-ff]
>> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
>> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
>> [    0.366000] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
>> [    0.366000] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff]
>>
>> With the device (my GPU card) inserted so that hvmloader can enumerate it:
>>  dmesg | grep 'resource'     
>> [    0.455006] pci_bus 0000:00: root bus resource [bus 00-ff]
>> [    0.459006] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
>> [    0.462006] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
>> [    0.466006] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
>> [    0.469006] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff]
>>
>> I chatted with Bjorn and Rafeal on IRC about how PCI insertion works
>> on baremetal and it sounds like Thunderbolt device insertion is an
>> interesting case. The SMM sets the BAR regions to fit within the MMIO
>> (which is advertised by the _CRS) and it then pokes the OS to enumerate
>> the BARs. The OS is free to use what the firmware has set or renumber
>> it. The end result is that since the SMM 'fits' the BAR inside the
>> pre-set _CRS window it all works. We do not do that.
> 
> Who does the BAR assignment is pretty much orthogonal to the
> problem at hand: If the region reserved for MMIO is too small,
> no-one will be able to fit a device in there. Plus, what is being
> reported as root bus resource doesn't have to have a
> connection to the ranges usable for MMIO at all, at least if I
> assume that the (Dell) system I'm right now looking at isn't
> completely screwed:
> 
> pci_bus 0000:00: root bus resource [bus 00-ff]
> pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> pci_bus 0000:00: root bus resource [mem 0x00000000-0x3fffffffff]
> 
> (i.e. it simply reports the full usable 38 bits wide address space)
> 
> Looking at another (Intel) one, there is no mention of regions
> above the 4G boundary at all:
> 
> pci_bus 0000:00: root bus resource [bus 00-3d]
> pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000cbfff]
> pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfedfffff]
> pci_bus 0000:00: root bus resource [mem 0xd0000000-0xf7ffffff]
> 
> Not sure how the OS would know it is safe to assign BARs above
> 4Gb here.
> 
> In any event, what you need is an equivalent of the frequently
> seen BIOS option controlling the size of the space to be reserved
> for MMIO (often allowing it to be 1, 2, or 3 Gb). I.e. an alternative
> (or extension) to the dynamic lowering of pci_mem_start in
> hvmloader.
> 

I agree with Jan. By using xl pci-attach you are effectively hotplugging
a PCI device (in the bare metal case). The only way this will work
reliably is if you reserve some MMIO space for the device you are about
to attach. You cannot just use space above the 4G boundary because the
PCI device may have 32 bit only BAR's and thus it's MMIO cannot be
placed at addresses above 4G.

The problem you have is that you cannot predict how much MMIO space to
reserve because you don't know in advance how many PCI device's you are
going to hotplug and how much MMIO space is required per device.

As for the CRS regions: These typically describe the BIOS set limits in
hardware configuration for the MMIO hole itself. On single socket
systems anything which isn't RAM or another predefined region decodes to
MMIO. This is probably why Jan's Dell system has a CRS region which
covers the entire address space.

On multi socket systems the CRS is very important because the chipset is
configured to only decode certain regions to the PCI express ports, if
you use an address out side of those regions then accessing that address
will go "nowhere" and the machine will crash.

Typically you will see a separate high MMIO CRS region if 64bit BAR
support is enabled in BIOS.


To do HVM pci hotplug properly we need to reserve MMIO space below 4G
and emulate a PCI hotplug capable PCI-PCI bridge device. The bridge
device will know the maximum size of the MMIO behind it (as allocated at
boot time) and so we can calculate if the device we are hotplugging can
fit. If it doesn't fit then we fail the hotplug otherwise we allow it
and the OS will correct allocate the BAR behind the bridge.

BTW, calculating the required MMIO for multi BAR PCI device's is not
easy because all the BAR's need to be aligned to their size (naturally
aligned).

Malcolm


> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.