Xen project Mailing List

Re: PCI pass-through problem for SN570 NVME SSD

From: "G.R." <firemeteor@xxxxxxxxxxxxxxxxxxxxx>

Date: Thu, 7 Jul 2022 23:24:21 +0800

Cc: xen-devel <xen-devel@xxxxxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

Delivery-date: Thu, 07 Jul 2022 15:24:51 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeulich@xxxxxxxx> wrote: > > On 06.07.2022 08:25, G.R. wrote: > > On Tue, Jul 5, 2022 at 7:59 PM Jan Beulich <jbeulich@xxxxxxxx> wrote: > >> Nothing useful in there. Yet independent of that I guess we need to > >> separate the issues you're seeing. Otherwise it'll be impossible to > >> know what piece of data belongs where. > > Yep, I think I'm seeing several different issues here: > > 1. The FLR related DPC / AER message seen on the 1st attempt only when > > pciback tries to seize and release the SN570 > > - Later-on pciback operations appear just fine. > > 2. MSI-X preparation failure message that shows up each time the SN570 > > is seized by pciback or when it's passed to domU. > > 3. XEN tries to map BAR from two devices to the same page > > 4. The "write-back to unknown field" message in QEMU log that goes > > away with permissive=1 passthrough config. > > 5. The "irq 16: nobody cared" message shows up *sometimes* in a > > pattern that I haven't figured out (See attached) > > 6. The FreeBSD domU sees the device but fails to use it because low > > level commands sent to it are aborted. > > 7. The device does not return to the pci-assignable-list when the domU > > it was assigned shuts-down. (See attached) > > > > #3 appears to be a known issue that could be worked around with > > patches from the list. > > I suspect #1 may have something to do with the device itself. It's > > still not clear if it's deadly or just annoying. > > I was able to update the firmware to the latest version and confirmed > > that the new firmware didn't make any noticeable difference. > > > > I suspect issue #2, #4, #5, #6, #7 may be related, and the > > pass-through was not completely successful... > > > > Should I expect a debug build of XEN hypervisor to give better > > diagnose messages, without the debug patch that Roger mentioned? > > Well, "expect" is perhaps too much to say, but with problems like > yours (and even more so with multiple ones) using a debug > hypervisor (or kernel, if there such a build mode existed) is imo > always a good idea. As is using as up-to-date a version as > possible. I built both 4.14.3 debug version and 4.16.1 release version for testing purposes. Unfortunately they gave me absolutely zero information, since both of them are not able to get through issue #1 the FlR related DPC / AER issue. With 4.16.1 release, it actually can survive the 'xl pci-assignable-add' which triggers the first AER failure. But the 'xl pci-assignable-remove' will lead to xl segmentation fault... >[ 655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp 00007ffd73a3d4d0 >error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000] >[ 655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c 0f >86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 <48> 8b >3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44 Since I'll need a couple of pci-assignable-add && pci-assignable-remove to get to a seemingly normal state, I cannot proceed from here. With 4.14.3 debug build, the hypervisor / dom0 reboots on 'xl pci-assignable-add'. [ 574.623143] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3, etc) the device [ 574.623203] pcieport 0000:00:1d.0: DPC: containment event, status:0x1f11 source:0x0000 [ 574.623204] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error detected [ 574.623209] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) [ 574.623240] pcieport 0000:00:1d.0: device [8086:a330] error status/mask=00200000/00010000 [ 574.623261] pcieport 0000:00:1d.0: [21] ACSViol (First) [ 575.855026] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting [ 576.895015] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting [ 579.028311] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting [ 583.294910] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting [ 591.614965] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting [ 609.534502] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting [ 643.667069] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up //<=======The reboot happens somewhere here, not immediately, but after a while... //Maybe I can get something from xl dmesg if I was quick enough and have connected from a second terminal... [ 644.773922] pciback 0000:05:00.0: xen_pciback: reset device [ 644.774050] pciback 0000:05:00.0: xen_pciback: xen_pcibk_error_detected(bus:5,devfn:0) [ 644.774051] pciback 0000:05:00.0: xen_pciback: device is not found/assigned [ 644.923432] pciback 0000:05:00.0: xen_pciback: xen_pcibk_error_resume(bus:5,devfn:0) [ 644.923437] pciback 0000:05:00.0: xen_pciback: device is not found/assigned [ 644.923616] pcieport 0000:00:1d.0: AER: device recovery successful > > Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.