[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

PCI pass-through problem for SN570 NVME SSD



Hi everybody,

I run into problems passing through a SN570 NVME SSD to a HVM guest.
So far I have no idea if the problem is with this specific SSD or with
the CPU + motherboard combination or the SW stack.
Looking for some suggestions on troubleshooting.

List of build info:
CPU+motherboard: E-2146G + Gigabyte C246N-WU2
XEN version: 4.14.3
Dom0: Linux Kernel 5.10 (built from Debian 11.2 kernel source package)
The SN570 SSD sits here in the PCI tree:
           +-1d.0-[05]----00.0  Sandisk Corp Device 501a

Syndromes observed:
With ASPM enabled, pciback has problem seizing the device.

Jul  2 00:36:54 gaia kernel: [    1.648270] pciback 0000:05:00.0:
xen_pciback: seizing device
...
Jul  2 00:36:54 gaia kernel: [    1.768646] pcieport 0000:00:1d.0:
AER: enabled with IRQ 150
Jul  2 00:36:54 gaia kernel: [    1.768716] pcieport 0000:00:1d.0:
DPC: enabled with IRQ 150
Jul  2 00:36:54 gaia kernel: [    1.768717] pcieport 0000:00:1d.0:
DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+
SwTrigger+ RP PIO Log 4, DL_ActiveErr+
...
Jul  2 00:36:54 gaia kernel: [    1.770039] xen: registering gsi 16
triggering 0 polarity 1
Jul  2 00:36:54 gaia kernel: [    1.770041] Already setup the GSI :16
Jul  2 00:36:54 gaia kernel: [    1.770314] pcieport 0000:00:1d.0:
DPC: containment event, status:0x1f11 source:0x0000
Jul  2 00:36:54 gaia kernel: [    1.770315] pcieport 0000:00:1d.0:
DPC: unmasked uncorrectable error detected
Jul  2 00:36:54 gaia kernel: [    1.770320] pcieport 0000:00:1d.0:
PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction
Layer, (Receiver ID)
Jul  2 00:36:54 gaia kernel: [    1.770371] pcieport 0000:00:1d.0:
device [8086:a330] error status/mask=00200000/00010000
Jul  2 00:36:54 gaia kernel: [    1.770413] pcieport 0000:00:1d.0:
[21] ACSViol                (First)
Jul  2 00:36:54 gaia kernel: [    1.770466] pciback 0000:05:00.0:
xen_pciback: device is not found/assigned
Jul  2 00:36:54 gaia kernel: [    1.920195] pciback 0000:05:00.0:
xen_pciback: device is not found/assigned
Jul  2 00:36:54 gaia kernel: [    1.920260] pcieport 0000:00:1d.0:
AER: device recovery successful
Jul  2 00:36:54 gaia kernel: [    1.920263] pcieport 0000:00:1d.0:
DPC: containment event, status:0x1f01 source:0x0000
Jul  2 00:36:54 gaia kernel: [    1.920264] pcieport 0000:00:1d.0:
DPC: unmasked uncorrectable error detected
Jul  2 00:36:54 gaia kernel: [    1.920267] pciback 0000:05:00.0:
xen_pciback: device is not found/assigned
Jul  2 00:36:54 gaia kernel: [    1.938406] xen: registering gsi 16
triggering 0 polarity 1
Jul  2 00:36:54 gaia kernel: [    1.938408] Already setup the GSI :16
Jul  2 00:36:54 gaia kernel: [    1.938666] xen_pciback: backend is vpci
...
Jul  2 00:43:48 gaia kernel: [  420.231955] pcieport 0000:00:1d.0:
DPC: containment event, status:0x1f01 source:0x0000
Jul  2 00:43:48 gaia kernel: [  420.231961] pcieport 0000:00:1d.0:
DPC: unmasked uncorrectable error detected
Jul  2 00:43:48 gaia kernel: [  420.231993] pcieport 0000:00:1d.0:
PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction
Layer, (Requester ID)
Jul  2 00:43:48 gaia kernel: [  420.235775] pcieport 0000:00:1d.0:
device [8086:a330] error status/mask=00100000/00010000
Jul  2 00:43:48 gaia kernel: [  420.235779] pcieport 0000:00:1d.0:
[20] UnsupReq               (First)
Jul  2 00:43:48 gaia kernel: [  420.235783] pcieport 0000:00:1d.0:
AER:   TLP Header: 34000000 05000010 00000000 88458845
Jul  2 00:43:48 gaia kernel: [  420.235819] pci 0000:05:00.0: AER:
can't recover (no error_detected callback)
Jul  2 00:43:48 gaia kernel: [  420.384349] pcieport 0000:00:1d.0:
AER: device recovery successful
... // The following might relate to an attempt to assign the device
to guest, not very sure...
Jul  2 00:46:06 gaia kernel: [  559.147333] pciback 0000:05:00.0:
xen_pciback: seizing device
Jul  2 00:46:06 gaia kernel: [  559.147435] pciback 0000:05:00.0:
enabling device (0000 -> 0002)
Jul  2 00:46:06 gaia kernel: [  559.147508] xen: registering gsi 16
triggering 0 polarity 1
Jul  2 00:46:06 gaia kernel: [  559.147511] Already setup the GSI :16
Jul  2 00:46:06 gaia kernel: [  559.147558] pciback 0000:05:00.0:
xen_pciback: MSI-X preparation failed (-6)


With pcie_aspm=off, the error log related to pciback goes away.
But I suspect there are still some problems hidden -- since I don't
see any AER enabled messages so errors may be hidden.
I have the xen_pciback built directly into the kernel and assigned the
SSD to it in the kernel command-line.
However, the result from pci-assignable-xxx commands are not very consistent:

root@gaia:~# xl pci-assignable-list
0000:00:17.0
0000:05:00.0
root@gaia:~# xl pci-assignable-remove 05:00.0
libxl: error: libxl_pci.c:853:libxl__device_pci_assignable_remove:
failed to de-quarantine 0000:05:00.0 <===== Here!!!
root@gaia:~# xl pci-assignable-add 05:00.0
libxl: warning: libxl_pci.c:794:libxl__device_pci_assignable_add:
0000:05:00.0 already assigned to pciback <==== Here!!!
root@gaia:~# xl pci-assignable-remove 05:00.0
root@gaia:~# xl pci-assignable-list
0000:00:17.0
root@gaia:~# xl pci-assignable-add 05:00.0
libxl: warning: libxl_pci.c:814:libxl__device_pci_assignable_add:
0000:05:00.0 not bound to a driver, will not be rebound.
root@gaia:~# xl pci-assignable-list
0000:00:17.0
0000:05:00.0


After the 'xl pci-assignable-list' appears to be self-consistent,
creating VM with the SSD assigned still leads to a guest crash:
>From qemu log:
[00:06.0] xen_pt_region_update: Error: create new mem mapping failed! (err: 1)
qemu-system-i386: terminating on signal 1 from pid 1192 (xl)

>From the 'xl dmesg' output:
(XEN) d1: GFN 0xf3078 (0xa2616,0,5,7) -> (0xa2504,0,5,7) not permitted
(XEN) domain_crash called from p2m.c:1301
(XEN) Domain 1 reported crashed by domain 0 on cpu#4:
(XEN) memory_map:fail: dom1 gfn=f3078 mfn=a2504 nr=1 ret:-1


Which of the three syndromes are more fundamental?
1. The DPC / AER error log
2. The inconsistency in 'xl pci-assignable-list' state tracking
3. The GFN mapping failure on guest setup

Any suggestions for the next step?


Thanks,
G.R.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.