[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Is: pci=assign-busses blows up Xen 4.4 Was:Re: [PATCH] x86/msi: Validate the guest-identified PCI devices in pci_prepare_msix()



> And sure enough if I boot Xen without 'pci=assign-busses' it works just
> fine.
> 
> Ugh.
> 
> I wonder how Xen 4.3 would actually do the PCI passthrough - it booted with
> the 'assign-busses' - but I hadn't tried to do PCI passthrough of the
> PF device (the I210).

I did not work very well. Especially with PCI devices.

> 
> If do pass in '05:00.0' (new bus number) I wonder if it will use IOMMU context
> with whatever '05:00.0' was _before_ the bus re-assigment  aka:
> 
> 05:00.0 PCI bridge: Tundra Semiconductor Corp. Device 8113 (rev 01) (prog-if 
> 01 [Subtractive decode])
>         Flags: bus master, fast devsel, latency 0
>         Bus: primary=05, secondary=06, subordinate=07, sec-latency=32
>         Memory behind bridge: f1500000-f16fffff
>         Capabilities: [60] Subsystem: Super Micro Computer Inc Device 0805
>         Capabilities: [a0] Power Management version 3
> 
> Which I think would confuse Xen as this is clearly labeled as bridge
> not a PCI device.
> 
> 
> The reason for me using 'pci=assign-busses' is that it looks to be
> the only option to use SR-IOV.

.. on this particular motherboard.

> 
> Which I suppose makes sense as it tries to create VFs right after its own bus 
> id:
> 
> 
>   +-01.1-[02-03]--+-[0000:03]-+-10.0  Intel Corporation 82576 Virtual Function
>            |               |           +-10.1  Intel Corporation 82576 
> Virtual Function
>            |               |           +-10.2  Intel Corporation 82576 
> Virtual Function
>            |               |           +-10.3  Intel Corporation 82576 
> Virtual Function
>            |               |           +-10.4  Intel Corporation 82576 
> Virtual Function
>            |               |           +-10.5  Intel Corporation 82576 
> Virtual Function
>            |               |           +-10.6  Intel Corporation 82576 
> Virtual Function
>            |               |           +-10.7  Intel Corporation 82576 
> Virtual Function
>            |               |           +-11.0  Intel Corporation 82576 
> Virtual Function
>            |               |           +-11.1  Intel Corporation 82576 
> Virtual Function
>            |               |           +-11.2  Intel Corporation 82576 
> Virtual Function
>            |               |           +-11.3  Intel Corporation 82576 
> Virtual Function
>            |               |           +-11.4  Intel Corporation 82576 
> Virtual Function
>            |               |           \-11.5  Intel Corporation 82576 
> Virtual Function
>            |               \-[0000:02]-+-00.0  Intel Corporation 82576 
> Gigabit Network Connection
>            |                           \-00.1  Intel Corporation 82576 
> Gigabit Network Connection
> 
> 
> But why does it have to have the bus _right_ after its own? Can't it
> use one at the end of the its bus-space? The bus is after it is occupied
> by another card (if I boot without 'pci=assign-busses').

Because you need to program the bridge to accept the bus requests for the
PF and VF bus numbers. And hence the need to program it in the bridge
to span more bus numbers.
> 
> I do recall using this particular SR-IOV card on a different hardware
> a year ago or so. And it did work. I think that might be because
> there were no PCI cards _after_ the SR-IOV card.

It was because it was a motherboard with an SRIOV aware BIOS. And a
server one while this is more geared towards .. budget-servers?

Anyhow, what I discovered was that the patch attached does allow me to
boot with Xen. It is not pretty.

But I was thinking to fix in the hypervisor and realized there are three
ways of fixing it:

 1). Do the hypercall to delete/add devices and let initial domain figure
     this out. (which the Linux attached patch does).

 2). Be more aware of the bus2bridge topology when removing a PCI bridge or
     device. I had one bug where we ended up with this bus2bridge structure:

      6 -> 06:00.0
      7 -> 06:00.0
      8 -> 07:01.0

Which meant that for devices on bus 8, 7 and 6 we would never find the 
upstream bridge. The reason is that 6 -> 06 points to itself so
find_upstream_bridge ends up looping 255 times around and returns -1.

Oddly enough the 06:00.0 device does get removed and then added (Via
the hypercalls) and the reason for the bus2bridge having stale information
is that it copies the data but does not invalidate that.

I am not entirely sure I undertand why we do that. In 'free_pdev' we do
this:

        for ( ; sec_bus <= sub_bus; sec_bus++ )
                pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];

and then:
 list_del(&pdev->alldevs_list);
 xfree(pdev->msix);
 xfree(pdev);

so if the device that is being deleted is the bridge - we point the secondary
and subordinate to the deleted device. But if the deleted device is the
upstream bridge we end up leaving a stale bus2bridge context.

That is OK normally as 'alloc_pdev' would over-write it (if the secondary
and subordinate did not change). But in 'assign-busses' case they change so
we are left with an 'stale' one.

This means when the same device is added (but with a new bus value) we
end fixing up the secondary to subordinate busses to point to us (06).
But '06' which used to be a secondary bus, still retains the old value.

One way to fix this is to detect it and correct:

          spin_lock(&pseg->bus2bridge_lock);
            for ( ; sec_bus <= sub_bus; sec_bus++ )
                pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];
            /* Check for infinite recursion where bus2bridge would point to
             * itself, aka:
             *    6-> 06:00.0
             *    7-> 06:00.0
             * and we are removing 06:00.0, but may have 07:00.0 devices.
             * We invalidate the 6 as the upstream bridge is effectively
             * removed. We cannot remove the 07 as the 06:00.0 might be
             * added right back in. */
            if ( pseg->bus2bridge[pdev->bus].map )
            {    
                u8 bus, devfn;

                bus = pdev->bus;
                if ( __find_upstream_bridge( pseg->nr, &bus, &devfn, &sec_bus, 
1 ) < 0 )
                {
                    /* Recursion detected! Invalidate ourselves. */
                    printk("%04x:%02x:%02x.%u recursed clearing it",
                       pseg->nr, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
                    if (pdev->bus != bus || devfn != pdev->devfn)
                        printk(" %02x:%02x supplied", pdev->bus, pdev->devfn);
                                PCI_SLOT(pseg->bus2bridge[i].devfn),
                                PCI_FUNC(pseg->bus2bridge[i].devfn));
                    pseg->bus2bridge[pdev->bus].map = 0;
                }
            }
            spin_unlock(&pseg->bus2bridge_lock);

But I am not sure if that is the right way of doing it. Anyhow there
was another assumption made in which 'assign-busses' crippled Xen
(see second attachment).

3). Trap on PCI_SECONDARY_BUS and PCI_SUBORDINATE_BUS writes and
    fixup the structures.

    I hadn't attempted that but that could also be done. That way Xen
    is aware of those changes and can update its PCI structures.

Thoughts?

Attachment: 0001-DRAFT-xen-pci-Re-add-all-PCI-devices-if-pci-assign-b.patch
Description: Text document

Attachment: 0001-pci-Don-t-assume-the-removed-device-is-a-bridge.patch
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.