[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: PCI pass-through problem for SN570 NVME SSD
On Thu, Jul 7, 2022 at 11:24 PM G.R. <firemeteor@xxxxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeulich@xxxxxxxx> wrote: > > > > > Should I expect a debug build of XEN hypervisor to give better > > > diagnose messages, without the debug patch that Roger mentioned? > > > > Well, "expect" is perhaps too much to say, but with problems like > > yours (and even more so with multiple ones) using a debug > > hypervisor (or kernel, if there such a build mode existed) is imo > > always a good idea. As is using as up-to-date a version as > > possible. > > I built both 4.14.3 debug version and 4.16.1 release version for > testing purposes. > Unfortunately they gave me absolutely zero information, since both of > them are not able to get through issue #1 > the FlR related DPC / AER issue. > With 4.16.1 release, it actually can survive the 'xl > pci-assignable-add' which triggers the first AER failure. > But the 'xl pci-assignable-remove' will lead to xl segmentation fault... > >[ 655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp > >00007ffd73a3d4d0 error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000] > >[ 655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c > >0f 86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 <48> > >8b 3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44 > Since I'll need a couple of pci-assignable-add && > pci-assignable-remove to get to a seemingly normal state, I cannot > proceed from here. > > With 4.14.3 debug build, the hypervisor / dom0 reboots on 'xl > pci-assignable-add'. > > [ 574.623143] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3, > etc) the device > [ 574.623203] pcieport 0000:00:1d.0: DPC: containment event, > status:0x1f11 source:0x0000 > [ 574.623204] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error > detected > [ 574.623209] pcieport 0000:00:1d.0: PCIe Bus Error: > severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver > ID) > [ 574.623240] pcieport 0000:00:1d.0: device [8086:a330] error > status/mask=00200000/00010000 > [ 574.623261] pcieport 0000:00:1d.0: [21] ACSViol (First) > [ 575.855026] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting > [ 576.895015] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting > [ 579.028311] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting > [ 583.294910] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting > [ 591.614965] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting > [ 609.534502] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting > [ 643.667069] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up > //<=======The reboot happens somewhere here, not immediately, but > after a while... > //Maybe I can get something from xl dmesg if I was quick enough and > have connected from a second terminal... Unfortunately I didn't see anything from xl dmesg... I wish the 'xl dmesg' can support the follow mode (dmesg -w) that the Linux dmesg does. Here I have to manually repeat this command. The machine suddenly freezes after the 'giving up' message is out. I see nothing special in the log. Maybe I'm just not lucky enough to catch the output, not sure.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |