[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)



On Wed, Nov 12, 2014 at 10:01:28AM +0000, Malcolm Crossley wrote:
> On 12/11/14 09:24, Jan Beulich wrote:
> >>>> On 12.11.14 at 02:37, <konrad.wilk@xxxxxxxxxx> wrote:
> >> When we PCI insert an device, the BARs are not set at all - and hence
> >> the Linux kernel is the one that tries to set the BARs in. The
> >> reason it cannot fit the device in the MMIO region is due to the
> >> _CRS only having certain ranges (even thought the MMIO region can
> >> cover 2GB). See:
> >>
> >> Without any devices (and me doing PCI insertion after that):
> >> # dmesg | grep "bus resource"
> >> [    0.366000] pci_bus 0000:00: root bus resource [bus 00-ff]
> >> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> >> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> >> [    0.366000] pci_bus 0000:00: root bus resource [mem 
> >> 0x000a0000-0x000bffff]
> >> [    0.366000] pci_bus 0000:00: root bus resource [mem 
> >> 0xf0000000-0xfbffffff]
> >>
> >> With the device (my GPU card) inserted so that hvmloader can enumerate it:
> >>  dmesg | grep 'resource'     
> >> [    0.455006] pci_bus 0000:00: root bus resource [bus 00-ff]
> >> [    0.459006] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> >> [    0.462006] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> >> [    0.466006] pci_bus 0000:00: root bus resource [mem 
> >> 0x000a0000-0x000bffff]
> >> [    0.469006] pci_bus 0000:00: root bus resource [mem 
> >> 0xe0000000-0xfbffffff]
> >>
> >> I chatted with Bjorn and Rafeal on IRC about how PCI insertion works
> >> on baremetal and it sounds like Thunderbolt device insertion is an
> >> interesting case. The SMM sets the BAR regions to fit within the MMIO
> >> (which is advertised by the _CRS) and it then pokes the OS to enumerate
> >> the BARs. The OS is free to use what the firmware has set or renumber
> >> it. The end result is that since the SMM 'fits' the BAR inside the
> >> pre-set _CRS window it all works. We do not do that.
> > 
> > Who does the BAR assignment is pretty much orthogonal to the
> > problem at hand: If the region reserved for MMIO is too small,
> > no-one will be able to fit a device in there. Plus, what is being
> > reported as root bus resource doesn't have to have a
> > connection to the ranges usable for MMIO at all, at least if I
> > assume that the (Dell) system I'm right now looking at isn't
> > completely screwed:
> > 
> > pci_bus 0000:00: root bus resource [bus 00-ff]
> > pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> > pci_bus 0000:00: root bus resource [mem 0x00000000-0x3fffffffff]
> > 
> > (i.e. it simply reports the full usable 38 bits wide address space)
> > 
> > Looking at another (Intel) one, there is no mention of regions
> > above the 4G boundary at all:
> > 
> > pci_bus 0000:00: root bus resource [bus 00-3d]
> > pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> > pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> > pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> > pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000cbfff]
> > pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfedfffff]
> > pci_bus 0000:00: root bus resource [mem 0xd0000000-0xf7ffffff]
> > 
> > Not sure how the OS would know it is safe to assign BARs above
> > 4Gb here.
> > 
> > In any event, what you need is an equivalent of the frequently
> > seen BIOS option controlling the size of the space to be reserved
> > for MMIO (often allowing it to be 1, 2, or 3 Gb). I.e. an alternative
> > (or extension) to the dynamic lowering of pci_mem_start in
> > hvmloader.
> > 
> 
> I agree with Jan. By using xl pci-attach you are effectively hotplugging
> a PCI device (in the bare metal case). The only way this will work
> reliably is if you reserve some MMIO space for the device you are about
> to attach. You cannot just use space above the 4G boundary because the
> PCI device may have 32 bit only BAR's and thus it's MMIO cannot be
> placed at addresses above 4G.

Is it safe to split the BARs to be in different locations? Say stash
all 64-bit BARs above 4GB and put all 32-bit under 4GB?

Looking at the hvmloader it looks to be doing that if it has exhausted
the mmio_total.

> 
> The problem you have is that you cannot predict how much MMIO space to
> reserve because you don't know in advance how many PCI device's you are
> going to hotplug and how much MMIO space is required per device.

Perhaps following Jan's advice allow "bigger" MMIO ranges to be
predefined: 4GB, 8Gb, 16GB, etc. And the larger ranges would cover
space under 4GB (so say max 3GB) while the rest is spilled past the 4GB
past the 'maxmem' range?

> 
> As for the CRS regions: These typically describe the BIOS set limits in
> hardware configuration for the MMIO hole itself. On single socket
> systems anything which isn't RAM or another predefined region decodes to
> MMIO. This is probably why Jan's Dell system has a CRS region which
> covers the entire address space.
> 
> On multi socket systems the CRS is very important because the chipset is
> configured to only decode certain regions to the PCI express ports, if
> you use an address out side of those regions then accessing that address
> will go "nowhere" and the machine will crash.
> 
> Typically you will see a separate high MMIO CRS region if 64bit BAR
> support is enabled in BIOS.
> 
> 
> To do HVM pci hotplug properly we need to reserve MMIO space below 4G
> and emulate a PCI hotplug capable PCI-PCI bridge device. The bridge
> device will know the maximum size of the MMIO behind it (as allocated at
> boot time) and so we can calculate if the device we are hotplugging can
> fit. If it doesn't fit then we fail the hotplug otherwise we allow it
> and the OS will correct allocate the BAR behind the bridge.

I think that can be done right now for the MMIO and _CRS in hvmloader
and libxc/libxl. I wonder if that can all be done without having an
PCI-PCI bridge device introduced?

> 
> BTW, calculating the required MMIO for multi BAR PCI device's is not
> easy because all the BAR's need to be aligned to their size (naturally
> aligned).

Ouch. So two 512MB and an 1GB can't be next to each but would need:
512GB BAR<-- 512GB space--->| 1GB BAR.

Or just put the 1GB first:

1GB BAR | 512GB 

?

> 
> Malcolm
> 
> 
> > Jan
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxx
> > http://lists.xen.org/xen-devel
> > 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.