[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PCI pass-through problem for SN570 NVME SSD



On Thu, Jul 7, 2022 at 11:24 PM G.R. <firemeteor@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeulich@xxxxxxxx> wrote:
> >
> > > Should I expect a debug build of XEN hypervisor to give better
> > > diagnose messages, without the debug patch that Roger mentioned?
> >
> > Well, "expect" is perhaps too much to say, but with problems like
> > yours (and even more so with multiple ones) using a debug
> > hypervisor (or kernel, if there such a build mode existed) is imo
> > always a good idea. As is using as up-to-date a version as
> > possible.
>
> I built both 4.14.3 debug version and 4.16.1 release version for
> testing purposes.
> Unfortunately they gave me absolutely zero information, since both of
> them are not able to get through issue #1
> the FlR related DPC / AER issue.
> With 4.16.1 release, it actually can survive the 'xl
> pci-assignable-add' which triggers the first AER failure.
> But the 'xl pci-assignable-remove' will lead to xl segmentation fault...
> >[  655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp 
> >00007ffd73a3d4d0 error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000]
> >[  655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c 
> >0f 86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 <48> 
> >8b 3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44
> Since I'll need a couple of pci-assignable-add &&
> pci-assignable-remove to get to a seemingly normal state, I cannot
> proceed from here.
>
> With 4.14.3 debug build, the hypervisor / dom0 reboots on 'xl
> pci-assignable-add'.
>
> [  574.623143] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3,
> etc) the device
> [  574.623203] pcieport 0000:00:1d.0: DPC: containment event,
> status:0x1f11 source:0x0000
> [  574.623204] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error 
> detected
> [  574.623209] pcieport 0000:00:1d.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
> ID)
> [  574.623240] pcieport 0000:00:1d.0:   device [8086:a330] error
> status/mask=00200000/00010000
> [  574.623261] pcieport 0000:00:1d.0:    [21] ACSViol                (First)
> [  575.855026] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting
> [  576.895015] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting
> [  579.028311] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting
> [  583.294910] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting
> [  591.614965] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting
> [  609.534502] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting
> [  643.667069] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up
> //<=======The reboot happens somewhere here, not immediately, but
> after a while...
> //Maybe I can get something from xl dmesg if I was quick enough and
> have connected from a second terminal...

Unfortunately I didn't see anything from xl dmesg...
I wish the 'xl dmesg' can support the follow mode (dmesg -w) that the
Linux dmesg does.
Here I have to manually repeat this command. The machine suddenly
freezes after the 'giving up' message is out.
I see nothing special in the log. Maybe I'm just not lucky enough to
catch the output, not sure.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.