[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PCI pass-through problem for SN570 NVME SSD



On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>
> On 06.07.2022 08:25, G.R. wrote:
> > On Tue, Jul 5, 2022 at 7:59 PM Jan Beulich <jbeulich@xxxxxxxx> wrote:
> >> Nothing useful in there. Yet independent of that I guess we need to
> >> separate the issues you're seeing. Otherwise it'll be impossible to
> >> know what piece of data belongs where.
> > Yep, I think I'm seeing several different issues here:
> > 1. The FLR related DPC / AER message seen on the 1st attempt only when
> > pciback tries to seize and release the SN570
> >     - Later-on pciback operations appear just fine.
> > 2. MSI-X preparation failure message that shows up each time the SN570
> > is seized by pciback or when it's passed to domU.
> > 3. XEN tries to map BAR from two devices to the same page
> > 4. The "write-back to unknown field" message in QEMU log that goes
> > away with permissive=1 passthrough config.
> > 5. The "irq 16: nobody cared" message shows up *sometimes* in a
> > pattern that I haven't figured out  (See attached)
> > 6. The FreeBSD domU sees the device but fails to use it because low
> > level commands sent to it are aborted.
> > 7. The device does not return to the pci-assignable-list when the domU
> > it was assigned shuts-down. (See attached)
> >
> > #3 appears to be a known issue that could be worked around with
> > patches from the list.
> > I suspect #1 may have something to do with the device itself. It's
> > still not clear if it's deadly or just annoying.
> > I was able to update the firmware to the latest version and confirmed
> > that the new firmware didn't make any noticeable difference.
> >
> > I suspect issue #2, #4, #5, #6, #7 may be related, and the
> > pass-through was not completely successful...
> >
> > Should I expect a debug build of XEN hypervisor to give better
> > diagnose messages, without the debug patch that Roger mentioned?
>
> Well, "expect" is perhaps too much to say, but with problems like
> yours (and even more so with multiple ones) using a debug
> hypervisor (or kernel, if there such a build mode existed) is imo
> always a good idea. As is using as up-to-date a version as
> possible.

I built both 4.14.3 debug version and 4.16.1 release version for
testing purposes.
Unfortunately they gave me absolutely zero information, since both of
them are not able to get through issue #1
the FlR related DPC / AER issue.
With 4.16.1 release, it actually can survive the 'xl
pci-assignable-add' which triggers the first AER failure.
But the 'xl pci-assignable-remove' will lead to xl segmentation fault...
>[  655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp 00007ffd73a3d4d0 
>error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000]
>[  655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c 0f 
>86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 <48> 8b 
>3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44
Since I'll need a couple of pci-assignable-add &&
pci-assignable-remove to get to a seemingly normal state, I cannot
proceed from here.

With 4.14.3 debug build, the hypervisor / dom0 reboots on 'xl
pci-assignable-add'.

[  574.623143] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3,
etc) the device
[  574.623203] pcieport 0000:00:1d.0: DPC: containment event,
status:0x1f11 source:0x0000
[  574.623204] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error detected
[  574.623209] pcieport 0000:00:1d.0: PCIe Bus Error:
severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
ID)
[  574.623240] pcieport 0000:00:1d.0:   device [8086:a330] error
status/mask=00200000/00010000
[  574.623261] pcieport 0000:00:1d.0:    [21] ACSViol                (First)
[  575.855026] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting
[  576.895015] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting
[  579.028311] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting
[  583.294910] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting
[  591.614965] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting
[  609.534502] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting
[  643.667069] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up
//<=======The reboot happens somewhere here, not immediately, but
after a while...
//Maybe I can get something from xl dmesg if I was quick enough and
have connected from a second terminal...
[  644.773922] pciback 0000:05:00.0: xen_pciback: reset device
[  644.774050] pciback 0000:05:00.0: xen_pciback:
xen_pcibk_error_detected(bus:5,devfn:0)
[  644.774051] pciback 0000:05:00.0: xen_pciback: device is not found/assigned
[  644.923432] pciback 0000:05:00.0: xen_pciback:
xen_pcibk_error_resume(bus:5,devfn:0)
[  644.923437] pciback 0000:05:00.0: xen_pciback: device is not found/assigned
[  644.923616] pcieport 0000:00:1d.0: AER: device recovery successful



>
> Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.