[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PCI pass-through problem for SN570 NVME SSD


  • To: "G.R." <firemeteor@xxxxxxxxxxxxxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 7 Jul 2022 18:18:25 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=vkBQYIncmfnZ6I7MvVd8YjOSYY1zkcjGVBb9JS5loRY=; b=RPaG+SEYyhlhq1FUJY6yI6r5mjUHNHlVKbumz61J4a31uHJUK8jvvh4NXhxfzFAkboSBtUaHPdoYAT5VIpVeEl0bB9ECzMihoDww1/lGz0MjNPHQ6vXs/ICB2QGycP0WbM1QaAIPl0HMjcgC0w7A6HGNoQgxKKjN9Q/7mJYFb/EH8NjXJvTWIGQW6dEOvIWIe3kHH+7TBrVbo/ObXphDAGrALHRPewXgcB3PtzFiofP0sIBH3sL/PpViQlf3gQ1tJmF91s+qFhl/7YYzwMDal4OVjfJwPpcQAlxBwPro/Oc6oXO6eac+OMOnOtvIJdy3SxGa2ctIAYcSMzz1DRoOFQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=JInaqqVJxwHkQ9dkSf5PX+iYI1NLBX+YO9MteaEEESOzINLvw/kk/c322GQlAyLEhq/zoE0XvO1J3tdou84Osp9laf/N7aEFeG4uzrfmRT8MpICdpoS/reYb5TugVnikfRKq5XcQ6/BJuTzto9EwUtH0jSNC3tcMex7JzbFeHfe1M93c0ydtHhXAxkZoJ8f3OlUoDTzQBtMVtiv00upM/bM0/CJgct1D6JntxlnEnna3Du7a4mnu1mIIZCP+ADmcHwdekkR9M0ySuJtn+BT+lXXJLNalqsBF8T5PBemz1xYJkJU9fhDGoBZnwuzkreDnmLB0K52bUlWpSlhM8NKNdA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: xen-devel <xen-devel@xxxxxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Thu, 07 Jul 2022 16:18:40 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 07.07.2022 17:36, G.R. wrote:
> On Thu, Jul 7, 2022 at 11:24 PM G.R. <firemeteor@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>>>
>>>> Should I expect a debug build of XEN hypervisor to give better
>>>> diagnose messages, without the debug patch that Roger mentioned?
>>>
>>> Well, "expect" is perhaps too much to say, but with problems like
>>> yours (and even more so with multiple ones) using a debug
>>> hypervisor (or kernel, if there such a build mode existed) is imo
>>> always a good idea. As is using as up-to-date a version as
>>> possible.
>>
>> I built both 4.14.3 debug version and 4.16.1 release version for
>> testing purposes.
>> Unfortunately they gave me absolutely zero information, since both of
>> them are not able to get through issue #1
>> the FlR related DPC / AER issue.
>> With 4.16.1 release, it actually can survive the 'xl
>> pci-assignable-add' which triggers the first AER failure.
>> But the 'xl pci-assignable-remove' will lead to xl segmentation fault...
>>> [  655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp 
>>> 00007ffd73a3d4d0 error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000]
>>> [  655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c 
>>> 0f 86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 
>>> <48> 8b 3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44
>> Since I'll need a couple of pci-assignable-add &&
>> pci-assignable-remove to get to a seemingly normal state, I cannot
>> proceed from here.
>>
>> With 4.14.3 debug build, the hypervisor / dom0 reboots on 'xl
>> pci-assignable-add'.
>>
>> [  574.623143] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3,
>> etc) the device
>> [  574.623203] pcieport 0000:00:1d.0: DPC: containment event,
>> status:0x1f11 source:0x0000
>> [  574.623204] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error 
>> detected
>> [  574.623209] pcieport 0000:00:1d.0: PCIe Bus Error:
>> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
>> ID)
>> [  574.623240] pcieport 0000:00:1d.0:   device [8086:a330] error
>> status/mask=00200000/00010000
>> [  574.623261] pcieport 0000:00:1d.0:    [21] ACSViol                (First)
>> [  575.855026] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting
>> [  576.895015] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting
>> [  579.028311] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting
>> [  583.294910] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting
>> [  591.614965] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting
>> [  609.534502] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting
>> [  643.667069] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up
>> //<=======The reboot happens somewhere here, not immediately, but
>> after a while...
>> //Maybe I can get something from xl dmesg if I was quick enough and
>> have connected from a second terminal...
> 
> Unfortunately I didn't see anything from xl dmesg...
> I wish the 'xl dmesg' can support the follow mode (dmesg -w) that the
> Linux dmesg does.
> Here I have to manually repeat this command. The machine suddenly
> freezes after the 'giving up' message is out.
> I see nothing special in the log. Maybe I'm just not lucky enough to
> catch the output, not sure.

If the box reboots in the middle, I guess you really want to hook up
a serial console.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.