[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] The saga of Heterodyne's PCI Passthrough

Whrrr, xen-users!  Once more someone having trouble with the elusive, gnarly,
extensive device passthrough is here, and this time, that someone is I.


--- The background ---

Here's what I'm trying to do: I have a machine with an Asus P8B WS mainboard
and an Intel Xeon E3-1230 processor which I'd like to have running three Xen
domains: the dom0 Heterodyne, and the two domUs Quail and Furn.  (The software
identifiers are all lowercase.)  All domains are running Debian GNU/Linux with
at least Linux 3.0.0, though the userspace configuration varies considerably.
Furn is currently a PV domain and Quail an HVM domain, though I might change
them to both be HVM later on.  I'm planning to partition the CPU cores between
the domains by pinning the vcpus, in case that makes any difference.

This machine has a number of PCI devices, both on the mainboard and offboard,
which I would like to divide them up between the domains.  I'm aiming for the
following allocation based on dom0 lspci:

    00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset 
Family 6 port SATA AHCI Controller (rev 05)
    05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network 
    06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network 
    07:00.0 USB Controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host 
    [and everything else]
    00:1d.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset 
Family USB Enhanced Host Controller #1 (rev 05)
    01:00.0 VGA compatible controller: ATI Technologies Inc RV710 [Radeon HD 
    01:00.1 Audio device: ATI Technologies Inc RV710/730
    08:00.0 Multimedia audio controller: C-Media Electronics Inc CMI8788 
[Oxygen HD Audio]
    08:03.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire 
II(M)] IEEE 1394 OHCI Controller (rev c0)
  Furn (optional):
    00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset 
Family USB Enhanced Host Controller #2 (rev 05)
    00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family 
High Definition Audio Controller (rev 05)

I do _not_ necessarily want to attach the VGA as one that Quail's HVM BIOS can
find; I'm happy with it being registered as a secondary graphics card with the
primary one being the emulated Cirrus VGA accessible over VNC, then having the
guest OS load drivers for the device and reinitialize it.  This would seem to
sidestep a lot of the work listed in the XenVGAPassthrough page on the wiki,
since that refers to making the VGA work for the BIOS.

I'm currently running Xen 4.1.1 (Debian 4.1.1-2).  The mainboard has VT-d
IOMMU support, and it is enabled in the BIOS (which is currently version 0605
from Asus, but I'm planning to upgrade it to 0704 since this supposedly fixes
the bogus ECC issue from the current version).  The Xen and dom0 Linux in the
plainest configuration are loaded with:

  /boot/xen-4.1-amd64.gz placeholder iommu=1 console=com1,vga com1=38400,8n1 
dom0_mem=512M dom0_max_vcpus=2 dom0_vcpus_pin
  /boot/vmlinuz-3.0.0-1-amd64 placeholder root=UUID=<...> ro quiet console=tty0 
console=hvc0 mem=512M

--- The saga ---

I'm going to reproduce these steps as I'm writing this, so it should be a
fairly accurate accounting as far as results, though it may not reflect
exactly what I went through earlier.  My first priorities are to get the
Radeon, one USB controller, and the PCI audio device attached to Quail, in
that order; once I can manage that, it seems likely that Furn's configuration
will fall into place.  All of the below is done with Furn absent as a Xen

Auxiliary files that are too large to include inline are available from

  ATTEMPT #1: "Hot Cross Plugs"

To avoid interference, I preÃmptively blacklist the drivers radeon, radeonfb,
snd_hda_intel, and snd_virtuoso in the dom0 Linux configuration, using
/etc/modprobe.d/blacklist.conf.  (snd_hda_intel would also normally bind to
the secondary function of the Radeon HD 4350.)  The dom0 is running Debian
testing (wheezy), and the domU is running unstable (sid).

Boot.  lsmod reveals that none of the aforementioned modules have been loaded;
there is still a VGA console visible through the Radeon.  dmesg output is in
01-dom0-dmesg.txt and 01-xen-dmesg.txt.  Note that at boot dom0 complains
about PCI address space overlapping video ROM.  The IOMMU is in fact
initialized according to Xen.

Boot domU, with VNC open and configuration from 01-quail-cfg.txt.  VNC has a
getty running on the emulated Cirrus VGA, and xm console attaches to the HVM
serial device properly.  Linux in the domU is loaded as:

  /boot/vmlinuz-3.0.0-1-amd64 root=UUID=<...> ro quiet console=ttyS0,38400n1

Full dmesg output is in 01-quail-dmesg-boot.txt.

Run  stubify 0000:01:00.0  and  stubify 0000:01:00.1  on the dom0; stubify
is a shell script that rebinds a PCI device to the pci-stub driver.
xen-pcifront and xen-pciback are nowhere to be found on either the dom0 or
domU kernel, that I can detect---I suppose the Debian configuration doesn't
come with either of them.

  root@heterodyne:~# xm pci-list-assignable-devices
  root@heterodyne:~# xm pci-attach quail 0000:01:00.0
  root@heterodyne:~# xm console quail

  [:: ... log in as root ... ::]  

  root@quail:~# modprobe pci-hotplug
  root@quail:~# echo 1 >/sys/bus/pci/rescan
  root@quail:~# [  450.499603] radeon 0000:00:04.0: Fatal error during GPU init
  [  450.518377] [TTM] Trying to take down uninitialized memory manager type 1

Quail's next dmesg entries (01-quail-dmesg-hotadd.txt) suggest that it can't
assign PCI resources for the video RAM, and the Radeon driver falls over with
a zero access window.  qemu-dm spits out messages in 01-qemu-log-hotadd.txt.

  ATTEMPT #2: "A Bridge over Stormy SeaBIOS"

My guess is that the HVM emulated PCI bridge isn't leaving enough window to
allocate MMIO space for the video RAM.  So I tentatively append pci=nocrs to
Quail's Linux command line to see whether that'll convince it to use better
window areas.  The full domU command line is now:

  /boot/vmlinuz-3.0.0-1-amd64 root=UUID=<...> ro quiet console=ttyS0,38400n1 

dmesg output is in 02-quail-dmesg-boot.txt.  I do the same xm pci-attach and
rescan as before.  This time, Quail hangs hard for a while, then does this:

  [  172.050122] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:1686]
  [  172.053355] Stack:
  [  172.054119] Call Trace:
  [  172.054119] Code: 88 03 00 00 b8 ea ff ff ff 78 22 3b b7 70 03 00 00 77 1a 
c1 e6 03 48 81 e2 00 f0 ff ff 48 63 f6 48 83 ca 67 48 8d 34 31 48 89 16

 xm pci-detach quail 0000:01:00.0  yields "Error: Timed out waiting for
device model action", so I destroy the domain with  xm destroy quail Â.
02-qemu-log-hotadd.txt suggests that Quail tried to map the graphics card into
a high (>= 4 GiB) MMIO region and that qemu-dm's PCI emulation blew chunks all
over the memory map as a result.  /var/log/messages from Quail contains the
fragment in 02-quail-messages-hotadd.txt, which seems to confirm this.

  ATTEMPT #3: "Boom!  Address Space Klotski"

I change Quail's configuration to set memory = '3500', leading to
03-quail-cfg.txt.  pci=nocrs is still in effect.  After boot and hot-add,
03-quail-dmesg.txt results; it seems to have mapped the GPU correctly and
initialized it.  Indeed at this point the DVI output of the Radeon card is
correctly showing Quail's Linux framebuffer console.

However, dom0 grimaces and spits out the following on the serial console:

  (XEN) vmsi.c:122:d32767 Unsupported delivery mode 3

Oops.  qemu-dm has meanwhile generated 03-qemu-log.txt.

I create quail:/etc/X11/xorg.conf with the following text:

  Section "Device"
    Identifier "primary screen"
    Driver "radeon"
    BusID "PCI:0:4:0"

to force it to use the Radeon card.  (I haven't forced pci-attach to use a
specific slot yet, so this isn't necessarily reliable between boots, but that
should be easy enough to fix ex post facto.)  Starting X as root yields an X
display on the Radeon!  glxinfo yields 03-glxinfo.txt, suggesting that the
Radeon DRI is working.

Running glxgears claims that it is synchronized to the vertical retrace, but
no output appears.  Woops.  2D output to X mostly works, but is horribly slow
and full of tearing artifacts.

This suggests to me that the vsync is broken.  The vmsi error above suggests
that this is because an MSI from the Radeon card is not being delivered.

Right then!

  ATTEMPT #4: "Friendly MSI have been destroy!"

I alter Quail's Linux boot line to include pci=nomsi (combining it with the
existing option).  The new full Linux command line is:

  /boot/vmlinuz-3.0.0-1-amd64 root=UUID=<...> ro  quiet console=ttyS0,38400n1 

dmesg after domU boot yields 04-quail-dmesg-boot.txt.  xm pci-attach, then
rescan PCI device in domU.  The Radeon is initialized successfully.  I start X
as root.  glxinfo returns the exact same results.  Running glxgears displays
some rotating gears:

  328 frames in 5.0 seconds = 65.401 FPS

This is indeed correct!

But then I attempt to attach the USB controller.  I run  stubify 0000:00:1d.0
 in the dom0, and get a correct-looking:

  [ 3068.112581] ehci_hcd 0000:00:1d.0: remove, state 4
  [ 3068.112588] usb usb2: USB disconnect, device number 1
  [ 3068.112590] usb 2-1: USB disconnect, device number 2
  [ 3068.131201] ehci_hcd 0000:00:1d.0: USB bus 2 deregistered
  [ 3068.131242] ehci_hcd 0000:00:1d.0: PCI INT A disabled
  [ 3068.131341] pci-stub 0000:00:1d.0: claimed by stub

And the moment of truth:

  root@heterodyne:~# xm pci-list-assignable-devices
  root@heterodyne:~# xm pci-attach quail 0000:00:1d.0

But the serial console coughs up:

  (XEN) physdev.c:182: dom7: no free pirq

Uh-oh.  Quail's dmesg reveals:

  [  259.241751] pci 0000:00:05.0: [8086:1c26] type 0 class 0x000c03
  [  259.242090] pci 0000:00:05.0: reg 10: [mem 0x00000000-0x00000fff]
  [  259.244192] pci 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus 
  [  259.244196] pci 0000:00:05.0: BAR 0: assigned [mem 0xdc030000-0xdc030fff]
  [  259.244276] pci 0000:00:05.0: BAR 0: set to [mem 0xdc030000-0xdc030fff] 
(PCI address [0xdc030000-0xdc030fff])
  [  259.304377] usbcore: registered new interface driver usbfs
  [  259.304392] usbcore: registered new interface driver hub
  [  259.304412] usbcore: registered new device driver usb
  [  259.305635] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
  [  259.305727] xen map irq failed -22
  [  259.305729] ehci_hcd 0000:00:05.0: PCI INT A: failed to register GSI

And indeed, qemu-dm complains of:

  dm-command: hot insert pass-through pci dev 
  register_real_device: Assigning real physical device 00:1d.0 ...
  register_real_device: Enable MSI translation via per device option
  register_real_device: Disable power management
  pt_iomul_init: Error: pt_iomul_init can't open file /dev/xen/pci_iomul: No 
such file or directory: 0x0:0x1d.0x0
  pt_register_regions: IO region registered (size=0x00000400 
  register_real_device: Error: Mapping irq failed, rc = -1
  register_real_device: Real physical device 00:1d.0 registered successfuly!
  IRQ type = INTx
  pt_pci_write_config: Warning: Guest attempt to set address to unused Base 
Address Register. [00:05.0][Offset:30h][Length:4]
  pt_iomem_map: e_phys=dc030000 maddr=fe706000 type=0 len=4096 index=0 

Ow!  (The full log is in 04-qemu-log.txt.)  The rescan also convinces Xen that:

  (XEN) irq.c:1817: dom7: invalid pirq -28 or emuirq 36

At this point I got sort of stumped and started paging through Xen source code
trying to figure out what in the world was going on, though I didn't get very

--- The quandary ---

My best guess at this stage is that:

  - Xen is deriving the number of GSIs and MSIs available from the host APIC,
    and this is somehow carrying over to the domU interrupts.

  - For some reason, the MSI format being used is not supported by the PCI
    passthrough code, or else it's misdetecting what sort of MSI is coming
    down the line (I'm not intimately familiar enough with PCI to know how
    this works (yet)).

  - Since Xen and qemu-dm only support attaching devices to separate virtual
    PCI buses in the domU, they can't share GSI interrupts.  (I wouldn't
    expect them to work on the same virtual bus anyway, given that some are
    PCI Express and expect a point-to-point link, but maybe it'd work if
    they're not too picky.)

  - On this modern machine, the number of GSIs available is severely
    diminished under the expectation that almost all mainboard and add-on
    devices will use MSI.  (E.g., there's only one PCI (as opposed to PCI-E)
    slot.)  This carries over to the domU.  Combined with the above, we run
    out of GSIs trying to attach the second device.

Note that:

  - None of the devices involved support function-level reset that I know of
    (via  lspci -vv  in dom0).  This does not appear to actually be a
    problem in practice; I have successfully attached the USB controller by
    itself to Quail, or the PCI audio device by itself.  I can give  lspci
    -vv  output if it's useful, but not right now since I'd have to reboot
    the dom0 again to do it and I'm running out of time.

  - I'm not currently attaching the second function of the Radeon card.  This
    again does not appear to cause a problem even though theoretically one is
    supposed to attach both functions.  I think I tried attaching both of them
    simultaneously, and it didn't work due to this not being a bleeding-edge
    Xen, but I don't have the details at the moment; attaching them both as
    separate PCI passthrough devices didn't obviously do anything useful.

    (I'd actually prefer to avoid attaching that second function if possible;
    it'd be very convenient to not have to bother with a worthless second
    audio device gunking up Quail's ALSA configuration.)

  - Last I checked, trying to boot Quail with any PCI passthrough devices
    already attached caused a failure to find the root filesystem.  I haven't
    gone back through this to check again, but I believe what is happening is
    that the Xen domU platform PCI device that the PV-on-HVM drivers use(?) is
    found, and the driver disables the emulated disks as a result, but then
    its interrupt attachment goes haywire because the available interrupts
    have all been used up by the passthrough devices, so communication with
    the Xen backends breaks down, which grinds everything to a halt because
    various essential devices (such as the disks) are already gone.

    I can go back and try this again if more specific information on it would
    be useful.

  - Previously I was running the dom0 kernel with pci=nocrs as well, which
    avoided the PCI address space overlap with video ROM message present in
    01-dom0-dmesg.txt.  This doesn't seem to have made a significant
    difference in behavior at any of the steps.

My questions:

  - Does my tentative analysis seem sound?  If not, what have I miscalculated?
    Or am I on the wrong track from the very beginning?

  - Will upgrading to a bleeding-edge Xen make any difference?  I'd rather not
    do this unless necessary, since I'd prefer a stabler machine with a more
    well-tested hypervisor, but I'm well aware that what I'm trying to do is
    fairly obtuse and may require jiggering software around.

    + A diff of xen/arch/x86/hvm/vmsi.c suggests that it will _not_ make the
      MSIs work if the delivery mode is being detected properly.  vmsi_deliver
      seems to have the same code regarding delivery mode 3 (i.e., none; it's
      dest__reserved_1 in xen/include/asm-x86/io_apic.h).

    + A grep of tools/ioemu-qemu-xen/hw/pass-through.c suggests that it
      _might_ allow using 5000 MB of memory for the domU rather than 3500 MB
      without bogotifying the PCI windows, since the "Guest attempt to set
      high MMIO Base Address" message is gone, suggesting that 64-bit BAR will
      work, but I haven't confirmed this.  Granting the domU more memory would
      be very good, but is not essential.

  - What is the best way to proceed that might allow me to attach these
    devices to an HVM domU at the same time?  Or am I merely hosed?  In

    + Could I convince Xen and/or the device model to emulate the HVM domU's
      APIC with more available GSIs rather than copying the model from the
      host (if that is indeed what it's doing)?

    + Could I patch vmsi.c to make delivery mode 3 work so that MSIs from the
      Radeon card will be delivered?  Or is delivery mode 3 truly nonexistent,
      meaning that there is something else wrong with MSI passthrough?  If the
      latter, where might I look to find out more?

      Searching the mailing list archives yields various patches that have
      appeared in the past regarding vmsi.c, but nothing that looks both
      relevant and unapplied.

  - I would like to not have to run the domU kernel with pci=nocrs.  Is there
    a reasonable way to make the device model allocate the PCI host bridge
    resources differently to make this unnecessary?  It doesn't seem important
    by comparison to the above, but is continuing to run with pci=nocrs an
    indicator that I may run into problems later if I upgrade Xen or qemu-dm?

  - Similarly, would running the dom0 kernel with pci=nocrs help or hurt in
    any particular way?  It doesn't seem to have much effect, but maybe I'm
    missing something.

To anyone who made it this far: I am at least theoretically open to arranging
reasonable bribes for the first people to help me get this to work.  Inquire
privately if you wish to take advantage of this.  Otherwise, you still get my
thanks in advance.  :-)


   ---> Drake Wilson

Post Scriptum: BIOS 0704 has been unhelpful in either getting ECC enabled or
doing anything to the PCI passthrough.  Alas.  Also,  lspci -vv  output is
now available as 05-lspci-vv.txt.

Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.