[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Odd domU Reboot Bug (possibly VGA passthrough related)



On 07/02/2013 09:42 AM, Ian Campbell wrote:
On Mon, 2013-07-01 at 19:22 +0100, Gordan Bobic wrote:
The thing that bothers me is that NVRM seems to be what's complaining,
but the GPU being passed through is firmly under control of xen-pciback.

Do the xl -vvv logs or the logs under /var/log/xen/ say anything about
rebinding the device at all?

Nothing at all.

AIUI pci-assignable-add is supposed to unbind the original driver and
bind to pciback and nothing is supposed to rebind until
pci-assignable-remove, but perhaps something is (inadvertently)
happening on domain shutdown too?

If you examine /sys you should be able to see which driver is bound to
the device, which might give a clue.

I'm quite certain it never unbinds - lspci -vvv shows the device still being handled by the pciback driver.

If you just nuke the NV driver from dom0 altogether does that help? What
about if you hide the device via the kernel command line rather than
dynamically (assuming that works in your setup)?

I added xen-pciback module to initramfs and made sure it loads. I still have to manually add the USB controllers manually, though, because the USB driver appears to be built in on my kernel. Either way, this doesn't change the situation, still works fine after a fresh reboot, but not after a full VM shutdown.

The pattern of events is quite consistent:

1) Fresh boot - all works fine. Shut down the domU. See attached
qemu-dm-edi.log.3

2) Try booting the domU - locks up during boot as soon as it tries to initialize the GPU (there's a flash of desktop background and the mouse pointer, but it goes black before the login screen shows up and never comes back. Have to terminate it using "xl destroy edi". See attached qemu-dm-edi.2

3) Try booting domU again - it will get to the desktop in VNC, but only in 16 colour VGA mode, but still thinking it's running on the Quadro card. Shuts down cleanly. See attached qemu-dm-edi.1

4) Try booting domU again - hard-lock-up of the host. Have to hard-reset it (actually, not sure if it's a complete hard-lock-up on the host, I haven't yet tried ssh-ing to it after that happens.

Just looking through /var/log/messages for clues, and I can see this on the 2nd domU start:

Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018 Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID) Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: device [8086:340a] error status/mask=00004000/00000000 Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: [14] Completion Timeout (First) Jul 2 21:13:46 normandy kernel: pciback 0000:0d:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.1: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0d:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.1: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0d:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.1: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: AER: Device recovery successful

lspci shows that device 00:03.0 is the Intel PCIe bridge on which three of the the passed through devices are: 1) Quadro 6000 (well modified GTX480, but close enough to make no difference)
2) Nvidia audio on the Nvidia card
3) Sound Blaster PCIe

So I'm wondering if this might be a problem with either:

1) another PCI memory stomp going on since symptoms are similar to what I was seeing before with > 2GB assigned to domU (but why would it only happen on a second and subsequent domU startups (domU restarts trigger it, too)?)

or

2) PCIe bridging anomaly due to the VGA card being on the same bridge as another device - Thinking about it, I did add the sound card to the machine recently, and not only is it on the same Intel PCIe bridge -> Nvidia NF200 PCIe bridge, but the Sound card has it's own PCIe->PCI bridge on it, so it's doubly bridged for extra weirdness.

Time to start experimenting with different slots again, it seems...

Gordan

Attachment: qemu-dm-edi.log.3
Description: Unix manual page

Attachment: qemu-dm-edi.log
Description: Text document

Attachment: qemu-dm-edi.log.1
Description: Unix manual page

Attachment: qemu-dm-edi.log.2
Description: Unix manual page

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.