[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: PCI pass-through problem for SN570 NVME SSD
On 07.07.2022 17:36, G.R. wrote: > On Thu, Jul 7, 2022 at 11:24 PM G.R. <firemeteor@xxxxxxxxxxxxxxxxxxxxx> wrote: >> >> On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeulich@xxxxxxxx> wrote: >>> >>>> Should I expect a debug build of XEN hypervisor to give better >>>> diagnose messages, without the debug patch that Roger mentioned? >>> >>> Well, "expect" is perhaps too much to say, but with problems like >>> yours (and even more so with multiple ones) using a debug >>> hypervisor (or kernel, if there such a build mode existed) is imo >>> always a good idea. As is using as up-to-date a version as >>> possible. >> >> I built both 4.14.3 debug version and 4.16.1 release version for >> testing purposes. >> Unfortunately they gave me absolutely zero information, since both of >> them are not able to get through issue #1 >> the FlR related DPC / AER issue. >> With 4.16.1 release, it actually can survive the 'xl >> pci-assignable-add' which triggers the first AER failure. >> But the 'xl pci-assignable-remove' will lead to xl segmentation fault... >>> [ 655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp >>> 00007ffd73a3d4d0 error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000] >>> [ 655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c >>> 0f 86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 >>> <48> 8b 3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44 >> Since I'll need a couple of pci-assignable-add && >> pci-assignable-remove to get to a seemingly normal state, I cannot >> proceed from here. >> >> With 4.14.3 debug build, the hypervisor / dom0 reboots on 'xl >> pci-assignable-add'. >> >> [ 574.623143] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3, >> etc) the device >> [ 574.623203] pcieport 0000:00:1d.0: DPC: containment event, >> status:0x1f11 source:0x0000 >> [ 574.623204] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error >> detected >> [ 574.623209] pcieport 0000:00:1d.0: PCIe Bus Error: >> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver >> ID) >> [ 574.623240] pcieport 0000:00:1d.0: device [8086:a330] error >> status/mask=00200000/00010000 >> [ 574.623261] pcieport 0000:00:1d.0: [21] ACSViol (First) >> [ 575.855026] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting >> [ 576.895015] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting >> [ 579.028311] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting >> [ 583.294910] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting >> [ 591.614965] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting >> [ 609.534502] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting >> [ 643.667069] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up >> //<=======The reboot happens somewhere here, not immediately, but >> after a while... >> //Maybe I can get something from xl dmesg if I was quick enough and >> have connected from a second terminal... > > Unfortunately I didn't see anything from xl dmesg... > I wish the 'xl dmesg' can support the follow mode (dmesg -w) that the > Linux dmesg does. > Here I have to manually repeat this command. The machine suddenly > freezes after the 'giving up' message is out. > I see nothing special in the log. Maybe I'm just not lucky enough to > catch the output, not sure. If the box reboots in the middle, I guess you really want to hook up a serial console. Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |