[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] x86/msi: fix locking for SR-IOV devices


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
  • Date: Sun, 25 Aug 2024 14:03:26 -0400
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=suse.com smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0)
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=EbzNDj1Z9NQ0MfA/byv5edJIS37uK2LeuyIBCwbsJPc=; b=rPp3kdh3yTERgSKwZ6AIioW40jdyxBqOftWK04HhQxmctkvuoS5RGU4A8rrhuHjK54IR56R1d4Dcyr2tc826D2YHHxlTo/R6r+FcH1CG5Os8SJn8Fx7+Ux5gij9q3vtbuXW4kzzvUWKpX58GX2yBhE7/IzpOl7Cyl7Fw6g0yvrArJ3OCTC70gJCKQaTu1AsZicDxMKjqSSOk4bU7dEreJ8w8aBS/1c8xLD76cDUcRr5jwMWIjh0gPSNbL+nTm2Jw39JjgX3da51Hw0PkKOgxi4mGuSNQoSkGJKEKSypAygvrSYjYsDWONEneUhYAvf1oeroGUkiiFOtoZ+t9GsVYSw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=X3WGYEiRDFbOtAwiyqrBBToL/QqAUHLfeutGTmq5GmrKR66Svnj/QIy2Ezm0DpLhroMYfxvDiQZYcCOIx0Pg4uLPGgLOidF/kXD9/hgNN2jGaKh20MBrygUWpW1ZM+HSlM43q3ehzpc38IeGtEjom4AYT08lUHitPxS9N3Vyc3svqOGXgjZxw/9rFrRXikYVWanXUzSEA0Sqg4l89mDQGEwTI5JP/t3wN2j0Bp5pETyANpl5gb6O32TZ13916qoMr4iHNHZPftu7EHmEMNVzeX+oyfWUM7CZhYcw1spOpyueks4Ex5rO2kYfAIuNeOpVHl2iSKWucfLjrL3L+zT1/Q==
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Teddy Astie <teddy.astie@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Sun, 25 Aug 2024 18:03:59 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 8/13/24 10:01, Jan Beulich wrote:
> On 12.08.2024 22:39, Stewart Hildebrand wrote:
>> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
>> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
>> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
>> call path wasn't updated to reflect the change, leading to a failed
>> assertion observed under the following conditions:
>>
>> * PV dom0
>> * Debug build (debug=y) of Xen
>> * There is an SR-IOV device in the system with one or more VFs enabled
>> * Dom0 has loaded the driver for the VF and enabled MSI-X
>>
>> (XEN) Assertion 'd || pcidevs_locked()' failed at 
>> drivers/passthrough/pci.c:535
>> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
>> ...
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
>> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
>> (XEN)    [<ffff82d04034530e>] F 
>> arch/x86/msi.c#msix_capability_init+0x198/0x755
>> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
>> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
>> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
>> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
>> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
>> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
>> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
>> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
>>
>> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
>> associated PF to access the vf_rlen array. This array is initialized in
>> pci_add_device() and is only populated in the associated PF's struct
>> pci_dev.
>>
>> Add a link from the VF's struct pci_dev to the associated PF struct
>> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
>> VFs have gone away. Access the vf_rlen array via the new link to the PF,
>> and remove the troublesome call to pci_get_pdev().
>>
>> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
>> pci_add_device() to set up the link from VF to PF. In case the new
>> pci_add_device() invocation fails to find the associated PF (returning
>> NULL), we are no worse off than before: read_pci_mem_bar() will still
>> return 0 in that case.
>>
>> Note that currently the only way for Xen to know if a device is a VF is
>> if the toolstack tells Xen about it. Using PHYSDEVOP_manage_pci_add for
>> a VF is not a case that Xen handles.
> 
> How does the toolstack come into play here? It's still the Dom0 kernel to
> tell Xen, via PHYSDEVOP_pci_device_add (preferred) or
> PHYSDEVOP_manage_pci_add_ext (kind of deprecated; PHYSDEVOP_manage_pci_add
> is even more kind of deprecated).

This last bit of the commit description isn't adding much, I'll remove
it.

>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -662,34 +662,32 @@ static int msi_capability_init(struct pci_dev *dev,
>>      return 0;
>>  }
>>  
>> -static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int 
>> vf)
>> +static u64 read_pci_mem_bar(struct pf_info *pf_info, u16 seg, u8 bus, u8 
>> slot,
>> +                            u8 func, u8 bir, int vf)
>>  {
>> +    pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func);
>>      u8 limit;
>>      u32 addr, base = PCI_BASE_ADDRESS_0;
>>      u64 disp = 0;
>>  
>>      if ( vf >= 0 )
>>      {
>> -        struct pci_dev *pdev = pci_get_pdev(NULL,
>> -                                            PCI_SBDF(seg, bus, slot, func));
>>          unsigned int pos;
>>          uint16_t ctrl, num_vf, offset, stride;
>>  
>> -        if ( !pdev )
>> -            return 0;
>> -
>> -        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
>> -        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
>> -        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
>> -        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> -        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
>> +        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
>> +        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
>> +        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
>> +        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> +        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
>>  
>>          if ( !pos ||
>>               !(ctrl & PCI_SRIOV_CTRL_VFE) ||
>>               !(ctrl & PCI_SRIOV_CTRL_MSE) ||
>>               !num_vf || !offset || (num_vf > 1 && !stride) ||
>>               bir >= PCI_SRIOV_NUM_BARS ||
>> -             !pdev->vf_rlen[bir] )
>> +             !pf_info ||
> 
> See further down - I think it would be nice if we didn't need this new
> check.

OK

>> @@ -813,6 +811,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>          int vf;
>>          paddr_t pba_paddr;
>>          unsigned int pba_offset;
>> +        struct pf_info *pf_info = NULL;
> 
> Pointer-to-const?

OK

>> @@ -827,9 +826,12 @@ static int msix_capability_init(struct pci_dev *dev,
>>              pslot = PCI_SLOT(dev->info.physfn.devfn);
>>              pfunc = PCI_FUNC(dev->info.physfn.devfn);
>>              vf = dev->sbdf.bdf;
>> +            if ( dev->virtfn.pf_pdev )
>> +                pf_info = &dev->virtfn.pf_pdev->physfn;
>>          }
>>  
>> -        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
>> +        table_paddr = read_pci_mem_bar(pf_info, seg, pbus, pslot, pfunc, 
>> bir,
>> +                                       vf);
> 
> I don't think putting the new arg first is very nice. SBDF should remain
> first imo. Between the latter arguments I'm not as fussed.

OK

>> @@ -446,7 +448,27 @@ static void free_pdev(struct pci_seg *pseg, struct 
>> pci_dev *pdev)
>>  
>>      list_del(&pdev->alldevs_list);
>>      pdev_msi_deinit(pdev);
>> -    xfree(pdev);
>> +
>> +    if ( pdev->info.is_virtfn )
>> +    {
>> +        struct pci_dev *pf_pdev = pdev->virtfn.pf_pdev;
>> +
>> +        if ( pf_pdev )
>> +        {
>> +            list_del(&pdev->virtfn.next);
>> +            if ( pf_pdev->physfn.dealloc_pending &&
>> +                 list_empty(&pf_pdev->physfn.vf_list) )
>> +                xfree(pf_pdev);
>> +        }
>> +        xfree(pdev);
>> +    }
>> +    else
>> +    {
>> +        if ( list_empty(&pdev->physfn.vf_list) )
>> +            xfree(pdev);
>> +        else
>> +            pdev->physfn.dealloc_pending = true;
>> +    }
> 
> Could I talk you into un-indenting the last if/else by a level, by making
> the earlier else and "else if()"?

Yep. Actually, there will be no need for dealloc_pending after reworking
pci_remove_device() (see below).

> One aspect I didn't properly consider when making the suggestion: What if,
> without all VFs having gone away, the PF is re-added? In that case we
> would better recycle the existing structure. That's getting a little
> complicated, so maybe a better approach is to refuse the request (in
> pci_remove_device()) when the list isn't empty?

I set up a test case locally to remove a PF without removing the
associated VFs by hacking an SR-IOV PF driver. Although the PF driver
*should* remove the VFs first, it's completely up to the particular PF
driver how VFs/PFs are removed during hot-un-plug, in what order, or
whether at all to remove the VFs before removing the PF. Anyway, during
PF-only removal, at least the Linux PCI subsystem warns about it:

[  106.500334] igb 0000:01:00.0: driver left SR-IOV enabled after remove

Returning an error code in pci_remove_device() results in only a warning
from Linux:

[  106.507011] pci 0000:01:00.0: Failed to delete - passthrough or MSI/MSI-X 
might fail!

Despite the warning, Linux still proceeds to remove the PF, and we would
retain a stale PF in Xen. Re-adding (hotplugging) the just-removed PF
led to Xen crashing in another weird way.

To handle this more gracefully, I suggest removing the VFs right away
along with the PF in pci_remove_device() when a PF removal request comes
along. This would satisfy the test case described here without Xen
crashing.

>> @@ -700,10 +722,22 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>           * extended function.
>>           */
>>          if ( pdev->info.is_virtfn )
>> +        {
>> +            struct pci_dev *pf_pdev;
>> +
>>              pdev->info.is_extfn = pf_is_extfn;
>> +            pf_pdev = pci_get_pdev(NULL,
>> +                                   PCI_SBDF(seg, info->physfn.bus,
>> +                                            info->physfn.devfn));
>> +            if ( pf_pdev )
>> +            {
>> +                pdev->virtfn.pf_pdev = pf_pdev;
>> +                list_add(&pdev->virtfn.next, &pf_pdev->physfn.vf_list);
>> +            }
>> +        }
> 
> Hmm, yes, we can't really use the result of the earlier pci_get_pdev(), as
> we drop the lock intermediately. I wonder though if we wouldn't better fail
> the function if we can't find the PF instance.

Yes

>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -150,7 +150,17 @@ struct pci_dev {
>>          unsigned int count;
>>  #define PT_FAULT_THRESHOLD 10
>>      } fault;
>> -    u64 vf_rlen[6];
>> +    struct pf_info {
>> +        /* Only populated for PFs (info.is_virtfn == false) */
>> +        uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
>> +        struct list_head vf_list;             /* List head */
>> +        bool dealloc_pending;
>> +    } physfn;
>> +    struct {
>> +        /* Only populated for VFs (info.is_virtfn == true) */
>> +        struct pci_dev *pf_pdev;              /* Link from VF to PF */
>> +        struct list_head next;                /* Entry in vf_list */
> 
> For doubly-linked lists "next" isn't really a good name. Since both struct
> variants need such a member, why not use vf_list uniformly?

I'd like to retain some distinction between what's a list head and
what's an entry in the list. I suggest the name "entry".

> That'll then
> also lower the significance of my other question:
> 
>> +    } virtfn;
> 
> Should the two new struct-s perhaps be a union?

Yes



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.