[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PCI pass-through problem for SN570 NVME SSD



On Mon, Jul 4, 2022 at 5:53 PM Roger Pau Monné <roger.pau@xxxxxxxxxx> wrote:
>
> On Sun, Jul 03, 2022 at 01:43:11AM +0800, G.R. wrote:
> > Hi everybody,
> >
> > I run into problems passing through a SN570 NVME SSD to a HVM guest.
> > So far I have no idea if the problem is with this specific SSD or with
> > the CPU + motherboard combination or the SW stack.
> > Looking for some suggestions on troubleshooting.
> >
> > List of build info:
> > CPU+motherboard: E-2146G + Gigabyte C246N-WU2
> > XEN version: 4.14.3
>
> Are you using a debug build of Xen? (if not it would be helpful to do
> so).
It's a release version at this moment. I can switch to a debug version
later when I get my hands free.
BTW, I got a DEBUG build of the xen_pciback driver to see how it plays
with 'xl pci-assignable-xxx' commands.
You can find this in my 2nd email in the chain.

>
> > Dom0: Linux Kernel 5.10 (built from Debian 11.2 kernel source package)
> > The SN570 SSD sits here in the PCI tree:
> >            +-1d.0-[05]----00.0  Sandisk Corp Device 501a
>
> Could be helpful to post the output with -vvv so we can see the
> capabilities of the device.
Sure, please find the -vvv output from the attachment.
This one is just to indicate the connection in the PCI tree.
I.e. 05:00.0 is attached under 00:1d.0.

>
> > Syndromes observed:
> > With ASPM enabled, pciback has problem seizing the device.
> >
> > Jul  2 00:36:54 gaia kernel: [    1.648270] pciback 0000:05:00.0:
> > xen_pciback: seizing device
> > ...
> > Jul  2 00:36:54 gaia kernel: [    1.768646] pcieport 0000:00:1d.0:
> > AER: enabled with IRQ 150
> > Jul  2 00:36:54 gaia kernel: [    1.768716] pcieport 0000:00:1d.0:
> > DPC: enabled with IRQ 150
> > Jul  2 00:36:54 gaia kernel: [    1.768717] pcieport 0000:00:1d.0:
> > DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+
> > SwTrigger+ RP PIO Log 4, DL_ActiveErr+
>
> Is there a device reset involved here?  It's possible the device
> doesn't reset properly and hence the Uncorrectable Error Status
> Register ends up with inconsistent bits set.

xen_pciback appears to force a FLR whenever it attempts to seize a
capable device.
As shown in pciback_dbg_xl-pci_assignable_XXX.log attached in my 2nd mail.
[  323.448115] xen_pciback: wants to seize 0000:05:00.0
[  323.448136] pciback 0000:05:00.0: xen_pciback: probing...
[  323.448137] pciback 0000:05:00.0: xen_pciback: seizing device
[  323.448162] pciback 0000:05:00.0: xen_pciback: pcistub_device_alloc
[  323.448162] pciback 0000:05:00.0: xen_pciback: initializing...
[  323.448163] pciback 0000:05:00.0: xen_pciback: initializing config
[  323.448344] pciback 0000:05:00.0: xen_pciback: enabling device
[  323.448425] xen: registering gsi 16 triggering 0 polarity 1
[  323.448428] Already setup the GSI :16
[  323.448497] pciback 0000:05:00.0: xen_pciback: save state of device
[  323.448642] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3,
etc) the device
[  323.448707] pcieport 0000:00:1d.0: DPC: containment event,
status:0x1f11 source:0x0000
[  323.448730] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error detected
[  323.448760] pcieport 0000:00:1d.0: PCIe Bus Error:
severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
ID)
[  323.448786] pcieport 0000:00:1d.0:   device [8086:a330] error
status/mask=00200000/00010000
[  323.448813] pcieport 0000:00:1d.0:    [21] ACSViol                (First)
[  324.690979] pciback 0000:05:00.0: not ready 1023ms after FLR;
waiting  <============ HERE
[  325.730706] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting
[  327.997638] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting
[  332.264251] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting
[  340.584320] pciback 0000:05:00.0: not ready 16383ms after FLR;
waiting
[  357.010896] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting
[  391.143951] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up
[  392.249252] pciback 0000:05:00.0: xen_pciback: reset device
[  392.249392] pciback 0000:05:00.0: xen_pciback:
xen_pcibk_error_detected(bus:5,devfn:0)
[  392.249393] pciback 0000:05:00.0: xen_pciback: device is not found/assigned
[  392.397074] pciback 0000:05:00.0: xen_pciback:
xen_pcibk_error_resume(bus:5,devfn:0)
[  392.397080] pciback 0000:05:00.0: xen_pciback: device is not found/assigned
[  392.397284] pcieport 0000:00:1d.0: AER: device recovery successful
Note, I only see this in FLR action the 1st attempt.
And my SATA controller which doesn't support FLR appears to pass
through just fine...

>
> > ...
> > Jul  2 00:36:54 gaia kernel: [    1.770039] xen: registering gsi 16
> > triggering 0 polarity 1
> > Jul  2 00:36:54 gaia kernel: [    1.770041] Already setup the GSI :16
> > Jul  2 00:36:54 gaia kernel: [    1.770314] pcieport 0000:00:1d.0:
> > DPC: containment event, status:0x1f11 source:0x0000
> > Jul  2 00:36:54 gaia kernel: [    1.770315] pcieport 0000:00:1d.0:
> > DPC: unmasked uncorrectable error detected
> > Jul  2 00:36:54 gaia kernel: [    1.770320] pcieport 0000:00:1d.0:
> > PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction
> > Layer, (Receiver ID)
> > Jul  2 00:36:54 gaia kernel: [    1.770371] pcieport 0000:00:1d.0:
> > device [8086:a330] error status/mask=00200000/00010000
> > Jul  2 00:36:54 gaia kernel: [    1.770413] pcieport 0000:00:1d.0:
> > [21] ACSViol                (First)
> > Jul  2 00:36:54 gaia kernel: [    1.770466] pciback 0000:05:00.0:
> > xen_pciback: device is not found/assigned
> > Jul  2 00:36:54 gaia kernel: [    1.920195] pciback 0000:05:00.0:
> > xen_pciback: device is not found/assigned
> > Jul  2 00:36:54 gaia kernel: [    1.920260] pcieport 0000:00:1d.0:
> > AER: device recovery successful
> > Jul  2 00:36:54 gaia kernel: [    1.920263] pcieport 0000:00:1d.0:
> > DPC: containment event, status:0x1f01 source:0x0000
> > Jul  2 00:36:54 gaia kernel: [    1.920264] pcieport 0000:00:1d.0:
> > DPC: unmasked uncorrectable error detected
> > Jul  2 00:36:54 gaia kernel: [    1.920267] pciback 0000:05:00.0:
> > xen_pciback: device is not found/assigned
>
> That's from a different device (05:00.0).
00:1d.0 is the bridge port that 05:00.0 attaches to.


> >
> > After the 'xl pci-assignable-list' appears to be self-consistent,
> > creating VM with the SSD assigned still leads to a guest crash:
> > From qemu log:
> > [00:06.0] xen_pt_region_update: Error: create new mem mapping failed! (err: 
> > 1)
> > qemu-system-i386: terminating on signal 1 from pid 1192 (xl)
> >
> > From the 'xl dmesg' output:
> > (XEN) d1: GFN 0xf3078 (0xa2616,0,5,7) -> (0xa2504,0,5,7) not permitted
>
> Seems like QEMU is attempting to remap a p2m_mmio_direct region.
>
> Can you paste the full output of `xl dmesg`? (as that will contain the
> memory map).
Attached.

>
> Would also be helpful if you could get the RMRR regions from that
> box. Booting with `iommu=verbose` on the Xen command line should print
> those.
Coming in my next reply...

Attachment: lspcivvv_cutdown.log
Description: Text Data

Attachment: xldmesg_full.log
Description: Text Data


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.