[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
On Wed, Nov 12, 2014 at 10:01:28AM +0000, Malcolm Crossley wrote: > On 12/11/14 09:24, Jan Beulich wrote: > >>>> On 12.11.14 at 02:37, <konrad.wilk@xxxxxxxxxx> wrote: > >> When we PCI insert an device, the BARs are not set at all - and hence > >> the Linux kernel is the one that tries to set the BARs in. The > >> reason it cannot fit the device in the MMIO region is due to the > >> _CRS only having certain ranges (even thought the MMIO region can > >> cover 2GB). See: > >> > >> Without any devices (and me doing PCI insertion after that): > >> # dmesg | grep "bus resource" > >> [ 0.366000] pci_bus 0000:00: root bus resource [bus 00-ff] > >> [ 0.366000] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7] > >> [ 0.366000] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff] > >> [ 0.366000] pci_bus 0000:00: root bus resource [mem > >> 0x000a0000-0x000bffff] > >> [ 0.366000] pci_bus 0000:00: root bus resource [mem > >> 0xf0000000-0xfbffffff] > >> > >> With the device (my GPU card) inserted so that hvmloader can enumerate it: > >> dmesg | grep 'resource' > >> [ 0.455006] pci_bus 0000:00: root bus resource [bus 00-ff] > >> [ 0.459006] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7] > >> [ 0.462006] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff] > >> [ 0.466006] pci_bus 0000:00: root bus resource [mem > >> 0x000a0000-0x000bffff] > >> [ 0.469006] pci_bus 0000:00: root bus resource [mem > >> 0xe0000000-0xfbffffff] > >> > >> I chatted with Bjorn and Rafeal on IRC about how PCI insertion works > >> on baremetal and it sounds like Thunderbolt device insertion is an > >> interesting case. The SMM sets the BAR regions to fit within the MMIO > >> (which is advertised by the _CRS) and it then pokes the OS to enumerate > >> the BARs. The OS is free to use what the firmware has set or renumber > >> it. The end result is that since the SMM 'fits' the BAR inside the > >> pre-set _CRS window it all works. We do not do that. > > > > Who does the BAR assignment is pretty much orthogonal to the > > problem at hand: If the region reserved for MMIO is too small, > > no-one will be able to fit a device in there. Plus, what is being > > reported as root bus resource doesn't have to have a > > connection to the ranges usable for MMIO at all, at least if I > > assume that the (Dell) system I'm right now looking at isn't > > completely screwed: > > > > pci_bus 0000:00: root bus resource [bus 00-ff] > > pci_bus 0000:00: root bus resource [io 0x0000-0xffff] > > pci_bus 0000:00: root bus resource [mem 0x00000000-0x3fffffffff] > > > > (i.e. it simply reports the full usable 38 bits wide address space) > > > > Looking at another (Intel) one, there is no mention of regions > > above the 4G boundary at all: > > > > pci_bus 0000:00: root bus resource [bus 00-3d] > > pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7] > > pci_bus 0000:00: root bus resource [io 0x0d00-0xffff] > > pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff] > > pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000cbfff] > > pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfedfffff] > > pci_bus 0000:00: root bus resource [mem 0xd0000000-0xf7ffffff] > > > > Not sure how the OS would know it is safe to assign BARs above > > 4Gb here. > > > > In any event, what you need is an equivalent of the frequently > > seen BIOS option controlling the size of the space to be reserved > > for MMIO (often allowing it to be 1, 2, or 3 Gb). I.e. an alternative > > (or extension) to the dynamic lowering of pci_mem_start in > > hvmloader. > > > > I agree with Jan. By using xl pci-attach you are effectively hotplugging > a PCI device (in the bare metal case). The only way this will work > reliably is if you reserve some MMIO space for the device you are about > to attach. You cannot just use space above the 4G boundary because the > PCI device may have 32 bit only BAR's and thus it's MMIO cannot be > placed at addresses above 4G. Is it safe to split the BARs to be in different locations? Say stash all 64-bit BARs above 4GB and put all 32-bit under 4GB? Looking at the hvmloader it looks to be doing that if it has exhausted the mmio_total. > > The problem you have is that you cannot predict how much MMIO space to > reserve because you don't know in advance how many PCI device's you are > going to hotplug and how much MMIO space is required per device. Perhaps following Jan's advice allow "bigger" MMIO ranges to be predefined: 4GB, 8Gb, 16GB, etc. And the larger ranges would cover space under 4GB (so say max 3GB) while the rest is spilled past the 4GB past the 'maxmem' range? > > As for the CRS regions: These typically describe the BIOS set limits in > hardware configuration for the MMIO hole itself. On single socket > systems anything which isn't RAM or another predefined region decodes to > MMIO. This is probably why Jan's Dell system has a CRS region which > covers the entire address space. > > On multi socket systems the CRS is very important because the chipset is > configured to only decode certain regions to the PCI express ports, if > you use an address out side of those regions then accessing that address > will go "nowhere" and the machine will crash. > > Typically you will see a separate high MMIO CRS region if 64bit BAR > support is enabled in BIOS. > > > To do HVM pci hotplug properly we need to reserve MMIO space below 4G > and emulate a PCI hotplug capable PCI-PCI bridge device. The bridge > device will know the maximum size of the MMIO behind it (as allocated at > boot time) and so we can calculate if the device we are hotplugging can > fit. If it doesn't fit then we fail the hotplug otherwise we allow it > and the OS will correct allocate the BAR behind the bridge. I think that can be done right now for the MMIO and _CRS in hvmloader and libxc/libxl. I wonder if that can all be done without having an PCI-PCI bridge device introduced? > > BTW, calculating the required MMIO for multi BAR PCI device's is not > easy because all the BAR's need to be aligned to their size (naturally > aligned). Ouch. So two 512MB and an 1GB can't be next to each but would need: 512GB BAR<-- 512GB space--->| 1GB BAR. Or just put the 1GB first: 1GB BAR | 512GB ? > > Malcolm > > > > Jan > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@xxxxxxxxxxxxx > > http://lists.xen.org/xen-devel > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |