[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PCI pass-through problem for SN570 NVME SSD


  • To: "G.R." <firemeteor@xxxxxxxxxxxxxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 7 Jul 2022 18:23:21 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=i7itPSpvibWGsBe0Aq83kx9QoMVNKCF9uTMtkyS1gJM=; b=Su0Fs1Y6qLeeMyC0L26zRZcUopOHodIXrsGTK0IKcVnvg0kUwVxDcksDfld81yCxRs9ZPk7hVdgOYb18q3zddcRbPYa1+FpQBTu4l7p/A3/YA5SFivFOducDZhL/DaIF4OmjXEC8q5s/C3RZKPiXz00jPpTpEIXBZXxQsL5W7U2iSjaqYMyxTiw0PsUdSzEIDbpxegDG52mkDJ5OoTaSqW7dBoaD2CDEIhSgYYg8yU487l5QPwj7BZZLHl2oR0zEr7h2Jm8zc7dfGhjDDKYtfGD/zjIbymBq4kecWp1MrZXNYAZuIbMe0EZ/m/nsEWpp+JUcXZbuiGue07H1QiVlQw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=OxjzyPo35btAQHOZZZnLesyEkSNiL6bnorTObIeqdqD46bQ6lxWQg/Q7wMXgWcQeWMp8VzCcwIRsrO45UJly7rUpYx179FQScaA2j4n/fRc0KiyjEGZhnmeGY1eI9gjv2PKc7s/dvbQcRrTzIsUXpMkXCrMhDY10ZrH874ZqIzT+4li6ypi9DMSEIrTa3zQNcYimYl8WcvlrKo6M1aPnzffzVMKHh84UliKKzOGhciQY5P8y45MNPOgUHhhDOc91braMv4QpVZUlIsk0HbiobYUya9CLRJTRBLhB2TOauMpZFpPY9N2RcTMTWjkGOeB0M9RDSuUpY9uDf2LrO5wTaw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: xen-devel <xen-devel@xxxxxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony Perard <anthony.perard@xxxxxxxxxx>
  • Delivery-date: Thu, 07 Jul 2022 16:23:30 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 07.07.2022 17:24, G.R. wrote:
> On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>>
>> On 06.07.2022 08:25, G.R. wrote:
>>> On Tue, Jul 5, 2022 at 7:59 PM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>>>> Nothing useful in there. Yet independent of that I guess we need to
>>>> separate the issues you're seeing. Otherwise it'll be impossible to
>>>> know what piece of data belongs where.
>>> Yep, I think I'm seeing several different issues here:
>>> 1. The FLR related DPC / AER message seen on the 1st attempt only when
>>> pciback tries to seize and release the SN570
>>>     - Later-on pciback operations appear just fine.
>>> 2. MSI-X preparation failure message that shows up each time the SN570
>>> is seized by pciback or when it's passed to domU.
>>> 3. XEN tries to map BAR from two devices to the same page
>>> 4. The "write-back to unknown field" message in QEMU log that goes
>>> away with permissive=1 passthrough config.
>>> 5. The "irq 16: nobody cared" message shows up *sometimes* in a
>>> pattern that I haven't figured out  (See attached)
>>> 6. The FreeBSD domU sees the device but fails to use it because low
>>> level commands sent to it are aborted.
>>> 7. The device does not return to the pci-assignable-list when the domU
>>> it was assigned shuts-down. (See attached)
>>>
>>> #3 appears to be a known issue that could be worked around with
>>> patches from the list.
>>> I suspect #1 may have something to do with the device itself. It's
>>> still not clear if it's deadly or just annoying.
>>> I was able to update the firmware to the latest version and confirmed
>>> that the new firmware didn't make any noticeable difference.
>>>
>>> I suspect issue #2, #4, #5, #6, #7 may be related, and the
>>> pass-through was not completely successful...
>>>
>>> Should I expect a debug build of XEN hypervisor to give better
>>> diagnose messages, without the debug patch that Roger mentioned?
>>
>> Well, "expect" is perhaps too much to say, but with problems like
>> yours (and even more so with multiple ones) using a debug
>> hypervisor (or kernel, if there such a build mode existed) is imo
>> always a good idea. As is using as up-to-date a version as
>> possible.
> 
> I built both 4.14.3 debug version and 4.16.1 release version for
> testing purposes.
> Unfortunately they gave me absolutely zero information, since both of
> them are not able to get through issue #1
> the FlR related DPC / AER issue.
> With 4.16.1 release, it actually can survive the 'xl
> pci-assignable-add' which triggers the first AER failure.

Then that's what needs debugging first. Yet from all I've seen so
far I'm not sure who one the Xen side could be doing that, the more
without themselves being able to repro - this seems more like a
Linux side issue (and even outside of the pciback driver).

> But the 'xl pci-assignable-remove' will lead to xl segmentation fault...
>> [  655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp 
>> 00007ffd73a3d4d0 error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000]
>> [  655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c 
>> 0f 86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 <48> 
>> 8b 3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44

That'll need debugging. Cc-ing Anthony for awareness, but I'm sure
he'll need more data to actually stand a chance of doing something
about it.

Is there any chance you could be doing some debugging work yourself,
at the very least to figure out where this (apparent) NULL deref is
happening?

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.