[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Date: Tue, 11 Apr 2023 23:41:04 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=eJtZlAWsizs0bUKr7ZMUM4O/+fMTn+dea7fKlJpYbe8=; b=e7P8mSSe41Yli0AZg94xUeIA2YW5jn9Pr09MdqiZ9umAjIoR+QjZww5r0trn/0HHlton1OhQn0FDiOaXcvANHQEiTE7SMaqsZEaWjVz8ZYraOVcKbFg9FXlHl+LMB6Bkl6ZqfIIl9SBVdF79pPiXl2WHrt9GUvKVZ5wOuPA9hsTpaiM/Ys3iobgt1SLcGBAJfc1OTJtbF3+3QBN5XNzNurnSMxx1k3XUgtb5TdI1SC9fmB04faDu2qa8N5xYWWpQnjydOmvXiSjDlzyl75vquHX115AJHvgzs5xd57Ius7RgEre2jUUcxhRfsYTHRyZcIoyI3ZoSjupO90aYUttbBw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=VSl54BxMIS/LRC689wNJe0FViLOMWlQrZSgRWINqjENzjdgda+8VVWBXYB1D05VIaHeXPqOiNOv6BVmb+jfSiCKhiPB3JkWzQJwBdg4//lhZwHMuAlGbEaVUW4xjuFSr2crBCYf08Ms9b7lYZNYzbrXZHTAu614QPjV8+mw617DWZ98h5dr8r5HoHR3FawmcRWSjW389Hh9hEakB3kWAG9JCZ8viNzGRySTULPpMtfR2mrL2SgRCh5k1zjI0ZOrMjeROxDbk/hh1jcG0PXpR+QQd//wySgbA1GX3ohtzVgMbb3Bx8YVLJHSASEbdQeepzsF8YFaBjlngxgWXA54t1g==
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>
  • Delivery-date: Tue, 11 Apr 2023 23:41:42 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHZVrdzsdckomMx4kauxHkZQ597Iq79mAMAgClI2AA=
  • Thread-topic: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev

Hi Roger,

Roger Pau Monné <roger.pau@xxxxxxxxxx> writes:

> On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote:
>> Prior to this change, lifetime of pci_dev objects was protected by global
>> pcidevs_lock(). Long-term plan is to remove this log, so we need some
>                                                    ^ lock
>
> I wouldn't say remove, as one way or another we need a lock to protect
> concurrent accesses.
>

I'll write "replace this global lock with couple of more granular
locking devices"
if this is okay for you.

>> other mechanism to ensure that those objects will not disappear under
>> feet of code that access them. Reference counting is a good choice as
>> it provides easy to comprehend way to control object lifetime.
>> 
>> This patch adds two new helper functions: pcidev_get() and
>> pcidev_put(). pcidev_get() will increase reference counter, while
>> pcidev_put() will decrease it, destroying object when counter reaches
>> zero.
>> 
>> pcidev_get() should be used only when you already have a valid pointer
>> to the object or you are holding lock that protects one of the
>> lists (domain, pseg or ats) that store pci_dev structs.
>> 
>> pcidev_get() is rarely used directly, because there already are
>> functions that will provide valid pointer to pci_dev struct:
>> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list,
>> find needed object and increase its reference counter before returning
>> to the caller.
>> 
>> Naturally, pci_put() should be called after finishing working with a
>> received object. This is the reason why this patch have so many
>> pcidev_put()s and so little pcidev_get()s: existing calls to
>> pci_get_*() functions now will increase reference counter
>> automatically, we just need to decrease it back when we finished.
>
> After looking a bit into this, I would like to ask whether it's been
> considered the need to increase the refcount for each use of a pdev.
>

This is how Linux uses reference locking. It decreases cognitive load
and chance for an error, as there is a simple set of rules, which you
follow.

> For example I would consider the initial alloc_pdev() to take a
> refcount, and then pci_remove_device() _must_ be the function that
> removes the last refcount, so that it can return -EBUSY otherwise (see
> my comment below).

I tend to disagree there, as this ruins the very idea of reference
counting. We can't know who else holds reference right now. Okay, we
might know, but this requires additional lock to serialize
accesses. Which, in turn, makes refcount un-needed.

>
> I would also think that having the device assigned to a guest will take
> another refcount, and then any usage from further callers (ie: like
> vpci) will need some kind of protection from preventing the device
> from being deassigned from a domain while vPCI handlers are running,
> and the current refcount won't help with that.

Yes, idea of this refcounting is to ensure that a pdev object exists as an
valid object in memory if we are holding a long-term pointer to
it. Indeed, vPCI handlers should use some other mechanism to ensure that
pdev is not being re-assigned while handlers are running. I believe,
this is the task of vpci->lock. Should we call
vpci_remove_device/vpci_add_handlers each time we re-assign a PCI device?

>
> That makes me wonder if for example callers of pci_get_pdev(d, sbdf)
> do need to take an extra refcount, because such access is already
> protected from the pdev going away by the fact that the device is
> assigned to a guest.  But maybe it's too much work to separate users
> of pci_get_pdev(d, ...); vs pci_get_pdev(NULL, ...);.
>
> There's also a window when the refcount is dropped to 0, and the
> destruction function is called, but at the same time a concurrent
> thread could attempt to take a reference to the pdev still?

Last pcidev_put() would be called by pci_remove_device(), after removing
it from all lists. This should prevent other threads from obtaining a valid
reference to the pdev.

>
>> 
>> This patch removes "const" qualifier from some pdev pointers because
>> pcidev_put() technically alters the contents of pci_dev structure.
>> 
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@xxxxxxxx>
>> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
>> 
>> ---
>> 
>> v3:
>>  - Moved in from another patch series
>>  - Fixed code formatting (tabs -> spaces)
>>  - Removed erroneous pcidev_put in vga.c
>>  - Added missing pcidev_put in couple of places
>>  - removed mention of pci_get_pdev_by_domain()
>> ---
>>  xen/arch/x86/hvm/vmsi.c                  |   2 +-
>>  xen/arch/x86/irq.c                       |   4 +
>>  xen/arch/x86/msi.c                       |  44 +++++++-
>>  xen/arch/x86/pci.c                       |   3 +
>>  xen/arch/x86/physdev.c                   |  17 ++-
>>  xen/common/sysctl.c                      |   7 +-
>>  xen/drivers/passthrough/amd/iommu_init.c |  12 +-
>>  xen/drivers/passthrough/amd/iommu_map.c  |   6 +-
>>  xen/drivers/passthrough/pci.c            | 138 +++++++++++++++--------
>>  xen/drivers/passthrough/vtd/quirks.c     |   2 +
>>  xen/drivers/video/vga.c                  |   7 +-
>>  xen/drivers/vpci/vpci.c                  |  16 ++-
>>  xen/include/xen/pci.h                    |  18 +++
>>  13 files changed, 215 insertions(+), 61 deletions(-)
>> 
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 3cd4923060..8c3d673872 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -914,7 +914,7 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>  
>>              spin_unlock(&msix->pdev->vpci->lock);
>>              process_pending_softirqs();
>> -            /* NB: we assume that pdev cannot go away for an alive domain. 
>> */
>> +
>>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>>                  return -EBUSY;
>>              if ( pdev->vpci->msix != msix )
>> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
>> index 20150b1c7f..87464d82c8 100644
>> --- a/xen/arch/x86/irq.c
>> +++ b/xen/arch/x86/irq.c
>> @@ -2175,6 +2175,7 @@ int map_domain_pirq(
>>                  msi->entry_nr = ret;
>>                  ret = -ENFILE;
>>              }
>> +            pcidev_put(pdev);
>>              goto done;
>>          }
>>  
>> @@ -2189,6 +2190,7 @@ int map_domain_pirq(
>>              msi_desc->irq = -1;
>>              msi_free_irq(msi_desc);
>>              ret = -EBUSY;
>> +            pcidev_put(pdev);
>>              goto done;
>>          }
>>  
>> @@ -2273,10 +2275,12 @@ int map_domain_pirq(
>>              }
>>              msi_desc->irq = -1;
>>              msi_free_irq(msi_desc);
>> +            pcidev_put(pdev);
>>              goto done;
>>          }
>>  
>>          set_domain_irq_pirq(d, irq, info);
>> +        pcidev_put(pdev);
>>          spin_unlock_irqrestore(&desc->lock, flags);
>>      }
>>      else
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index d0bf63df1d..91926fce50 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -572,6 +572,10 @@ int msi_free_irq(struct msi_desc *entry)
>>                          virt_to_fix((unsigned long)entry->mask_base));
>>  
>>      list_del(&entry->list);
>> +
>> +    /* Corresponds to pcidev_get() in msi[x]_capability_init()  */
>> +    pcidev_put(entry->dev);
>> +
>>      xfree(entry);
>>      return 0;
>>  }
>> @@ -644,6 +648,7 @@ static int msi_capability_init(struct pci_dev *dev,
>>              entry[i].msi.mpos = mpos;
>>          entry[i].msi.nvec = 0;
>>          entry[i].dev = dev;
>> +        pcidev_get(dev);
>>      }
>>      entry->msi.nvec = nvec;
>>      entry->irq = irq;
>> @@ -703,22 +708,36 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, 
>> u8 func, u8 bir, int vf)
>>               !num_vf || !offset || (num_vf > 1 && !stride) ||
>>               bir >= PCI_SRIOV_NUM_BARS ||
>>               !pdev->vf_rlen[bir] )
>> +        {
>> +            if ( pdev )
>> +                pcidev_put(pdev);
>>              return 0;
>> +        }
>>          base = pos + PCI_SRIOV_BAR;
>>          vf -= PCI_BDF(bus, slot, func) + offset;
>>          if ( vf < 0 )
>> +        {
>> +            pcidev_put(pdev);
>>              return 0;
>> +        }
>>          if ( stride )
>>          {
>>              if ( vf % stride )
>> +            {
>> +                pcidev_put(pdev);
>>                  return 0;
>> +            }
>>              vf /= stride;
>>          }
>>          if ( vf >= num_vf )
>> +        {
>> +            pcidev_put(pdev);
>>              return 0;
>> +        }
>>          BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
>>          disp = vf * pdev->vf_rlen[bir];
>>          limit = PCI_SRIOV_NUM_BARS;
>> +        pcidev_put(pdev);
>>      }
>>      else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func),
>>                                   PCI_HEADER_TYPE) & 0x7f )
>> @@ -925,6 +944,8 @@ static int msix_capability_init(struct pci_dev *dev,
>>          entry->dev = dev;
>>          entry->mask_base = base;
>>  
>> +        pcidev_get(dev);
>> +
>>          list_add_tail(&entry->list, &dev->msi_list);
>>          *desc = entry;
>>      }
>> @@ -999,6 +1020,7 @@ static int __pci_enable_msi(struct msi_info *msi, 
>> struct msi_desc **desc)
>>  {
>>      struct pci_dev *pdev;
>>      struct msi_desc *old_desc;
>> +    int ret;
>>  
>>      ASSERT(pcidevs_locked());
>>      pdev = pci_get_pdev(NULL, msi->sbdf);
>> @@ -1010,6 +1032,7 @@ static int __pci_enable_msi(struct msi_info *msi, 
>> struct msi_desc **desc)
>>      {
>>          printk(XENLOG_ERR "irq %d already mapped to MSI on %pp\n",
>>                 msi->irq, &pdev->sbdf);
>> +        pcidev_put(pdev);
>>          return -EEXIST;
>>      }
>>  
>> @@ -1020,7 +1043,10 @@ static int __pci_enable_msi(struct msi_info *msi, 
>> struct msi_desc **desc)
>>          __pci_disable_msix(old_desc);
>>      }
>>  
>> -    return msi_capability_init(pdev, msi->irq, desc, msi->entry_nr);
>> +    ret = msi_capability_init(pdev, msi->irq, desc, msi->entry_nr);
>> +    pcidev_put(pdev);
>> +
>> +    return ret;
>>  }
>>  
>>  static void __pci_disable_msi(struct msi_desc *entry)
>> @@ -1054,20 +1080,29 @@ static int __pci_enable_msix(struct msi_info *msi, 
>> struct msi_desc **desc)
>>  {
>>      struct pci_dev *pdev;
>>      struct msi_desc *old_desc;
>> +    int ret;
>>  
>>      ASSERT(pcidevs_locked());
>>      pdev = pci_get_pdev(NULL, msi->sbdf);
>>      if ( !pdev || !pdev->msix )
>> +    {
>> +        if ( pdev )
>> +            pcidev_put(pdev);
>>          return -ENODEV;
>> +    }
>>  
>>      if ( msi->entry_nr >= pdev->msix->nr_entries )
>> +    {
>> +        pcidev_put(pdev);
>>          return -EINVAL;
>> +    }
>>  
>>      old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSIX);
>>      if ( old_desc )
>>      {
>>          printk(XENLOG_ERR "irq %d already mapped to MSI-X on %pp\n",
>>                 msi->irq, &pdev->sbdf);
>> +        pcidev_put(pdev);
>>          return -EEXIST;
>>      }
>>  
>> @@ -1078,7 +1113,11 @@ static int __pci_enable_msix(struct msi_info *msi, 
>> struct msi_desc **desc)
>>          __pci_disable_msi(old_desc);
>>      }
>>  
>> -    return msix_capability_init(pdev, msi, desc);
>> +    ret = msix_capability_init(pdev, msi, desc);
>> +
>> +    pcidev_put(pdev);
>> +
>> +    return ret;
>>  }
>>  
>>  static void _pci_cleanup_msix(struct arch_msix *msix)
>> @@ -1159,6 +1198,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool 
>> off)
>>      }
>>      else
>>          rc = msix_capability_init(pdev, NULL, NULL);
>> +    pcidev_put(pdev);
>>      pcidevs_unlock();
>>  
>>      return rc;
>> diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
>> index 97b792e578..c1fcdf08d6 100644
>> --- a/xen/arch/x86/pci.c
>> +++ b/xen/arch/x86/pci.c
>> @@ -92,7 +92,10 @@ int pci_conf_write_intercept(unsigned int seg, unsigned 
>> int bdf,
>>  
>>      pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
>>      if ( pdev )
>> +    {
>>          rc = pci_msi_conf_write_intercept(pdev, reg, size, data);
>> +        pcidev_put(pdev);
>> +    }
>>  
>>      pcidevs_unlock();
>>  
>> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
>> index 2f1d955a96..96214a3d40 100644
>> --- a/xen/arch/x86/physdev.c
>> +++ b/xen/arch/x86/physdev.c
>> @@ -533,7 +533,14 @@ ret_t do_physdev_op(int cmd, 
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>          pcidevs_lock();
>>          pdev = pci_get_pdev(NULL,
>>                              PCI_SBDF(0, restore_msi.bus, 
>> restore_msi.devfn));
>> -        ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV;
>> +        if ( pdev )
>> +        {
>> +            ret = pci_restore_msi_state(pdev);
>> +            pcidev_put(pdev);
>> +        }
>> +        else
>> +            ret = -ENODEV;
>> +
>>          pcidevs_unlock();
>>          break;
>>      }
>> @@ -548,7 +555,13 @@ ret_t do_physdev_op(int cmd, 
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>  
>>          pcidevs_lock();
>>          pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
>> -        ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV;
>> +        if ( pdev )
>> +        {
>> +            ret =  pci_restore_msi_state(pdev);
>> +            pcidev_put(pdev);
>> +        }
>> +        else
>> +            ret = -ENODEV;
>>          pcidevs_unlock();
>>          break;
>>      }
>> diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
>> index 02505ab044..9af07fa92a 100644
>> --- a/xen/common/sysctl.c
>> +++ b/xen/common/sysctl.c
>> @@ -438,7 +438,7 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) 
>> u_sysctl)
>>          {
>>              physdev_pci_device_t dev;
>>              uint32_t node;
>> -            const struct pci_dev *pdev;
>> +            struct pci_dev *pdev;
>>  
>>              if ( copy_from_guest_offset(&dev, ti->devs, i, 1) )
>>              {
>> @@ -454,8 +454,11 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) 
>> u_sysctl)
>>                  node = XEN_INVALID_NODE_ID;
>>              else
>>                  node = pdev->node;
>> -            pcidevs_unlock();
>>  
>> +            if ( pdev )
>> +                pcidev_put(pdev);
>> +
>> +            pcidevs_unlock();
>>              if ( copy_to_guest_offset(ti->nodes, i, &node, 1) )
>>              {
>>                  ret = -EFAULT;
>> diff --git a/xen/drivers/passthrough/amd/iommu_init.c 
>> b/xen/drivers/passthrough/amd/iommu_init.c
>> index 9773ccfcb4..f90b1c1e58 100644
>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>> @@ -646,6 +646,7 @@ static void cf_check parse_ppr_log_entry(struct 
>> amd_iommu *iommu, u32 entry[])
>>  
>>      if ( pdev )
>>          guest_iommu_add_ppr_log(pdev->domain, entry);
>> +    pcidev_put(pdev);
>>  }
>>  
>>  static void iommu_check_ppr_log(struct amd_iommu *iommu)
>> @@ -749,6 +750,11 @@ static bool_t __init set_iommu_interrupt_handler(struct 
>> amd_iommu *iommu)
>>      }
>>  
>>      pcidevs_lock();
>> +    /*
>> +     * XXX: it is unclear if this device can be removed. Right now
>> +     * there is no code that clears msi.dev, so no one will decrease
>> +     * refcount on it.
>> +     */
>>      iommu->msi.dev = pci_get_pdev(NULL, PCI_SBDF(iommu->seg, iommu->bdf));
>
> I don't think we can remove an IOMMU from the system, so this is
> fine as-is AFAICT.
>

Oh, thank you for the clarification. I'll remove the comment then.

>>      pcidevs_unlock();
>>      if ( !iommu->msi.dev )
>> @@ -1274,7 +1280,7 @@ static int __init cf_check 
>> amd_iommu_setup_device_table(
>>      {
>>          if ( ivrs_mappings[bdf].valid )
>>          {
>> -            const struct pci_dev *pdev = NULL;
>> +            struct pci_dev *pdev = NULL;
>>  
>>              /* add device table entry */
>>              iommu_dte_add_device_entry(&dt[bdf], &ivrs_mappings[bdf]);
>> @@ -1299,7 +1305,10 @@ static int __init cf_check 
>> amd_iommu_setup_device_table(
>>                          pdev->msix ? pdev->msix->nr_entries
>>                                     : pdev->msi_maxvec);
>>                  if ( !ivrs_mappings[bdf].intremap_table )
>> +                {
>> +                    pcidev_put(pdev);
>>                      return -ENOMEM;
>> +                }
>>  
>>                  if ( pdev->phantom_stride )
>>                  {
>> @@ -1317,6 +1326,7 @@ static int __init cf_check 
>> amd_iommu_setup_device_table(
>>                              ivrs_mappings[bdf].intremap_inuse;
>>                      }
>>                  }
>> +                pcidev_put(pdev);
>>              }
>>  
>>              amd_iommu_set_intremap_table(
>> diff --git a/xen/drivers/passthrough/amd/iommu_map.c 
>> b/xen/drivers/passthrough/amd/iommu_map.c
>> index 993bac6f88..9d621e3d36 100644
>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -724,14 +724,18 @@ int cf_check amd_iommu_get_reserved_device_memory(
>>          if ( !iommu )
>>          {
>>              /* May need to trigger the workaround in 
>> find_iommu_for_device(). */
>> -            const struct pci_dev *pdev;
>> +            struct pci_dev *pdev;
>>  
>>              pcidevs_lock();
>>              pdev = pci_get_pdev(NULL, sbdf);
>>              pcidevs_unlock();
>>  
>>              if ( pdev )
>> +            {
>>                  iommu = find_iommu_for_device(seg, bdf);
>> +                /* XXX: Should we hold pdev reference till end of the loop? 
>> */
>> +                pcidev_put(pdev);
>
> I don't think you need to hold a reference to the device until the end
> of the loop, the data fetched there is from the ACPI tables.  If the
> func() helper also needs a pdev instance is it's task to get one.
>

Thank you for the clarification. I'll remove the comment.

>> +            }
>>              if ( !iommu )
>>                  continue;
>>          }
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index b42acb8d7c..b32382aca0 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -328,6 +328,7 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, 
>> u8 bus, u8 devfn)
>>      *((u8*) &pdev->bus) = bus;
>>      *((u8*) &pdev->devfn) = devfn;
>>      pdev->domain = NULL;
>> +    refcnt_init(&pdev->refcnt);
>>  
>>      arch_pci_init_pdev(pdev);
>>  
>> @@ -422,33 +423,6 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, 
>> u8 bus, u8 devfn)
>>      return pdev;
>>  }
>>  
>> -static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>> -{
>> -    /* update bus2bridge */
>> -    switch ( pdev->type )
>> -    {
>> -        unsigned int sec_bus, sub_bus;
>> -
>> -        case DEV_TYPE_PCIe2PCI_BRIDGE:
>> -        case DEV_TYPE_LEGACY_PCI_BRIDGE:
>> -            sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS);
>> -            sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS);
>> -
>> -            spin_lock(&pseg->bus2bridge_lock);
>> -            for ( ; sec_bus <= sub_bus; sec_bus++ )
>> -                pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];
>> -            spin_unlock(&pseg->bus2bridge_lock);
>> -            break;
>> -
>> -        default:
>> -            break;
>> -    }
>> -
>> -    list_del(&pdev->alldevs_list);
>> -    pdev_msi_deinit(pdev);
>> -    xfree(pdev);
>> -}
>> -
>>  static void __init _pci_hide_device(struct pci_dev *pdev)
>>  {
>>      if ( pdev->domain )
>> @@ -517,10 +491,14 @@ struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf)
>>      {
>>          if ( !(sbdf.devfn & stride) )
>>              continue;
>> +
>
> Unrelated change?  There are some of those in the patch, should be
> removed.

Yes, sorry for this.

>
>>          sbdf.devfn &= ~stride;
>>          pdev = pci_get_pdev(NULL, sbdf);
>>          if ( pdev && stride != pdev->phantom_stride )
>> +        {
>> +            pcidev_put(pdev);
>>              pdev = NULL;
>> +        }
>>      }
>>  
>>      return pdev;
>> @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, 
>> pci_sbdf_t sbdf)
>>          list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>>              if ( pdev->sbdf.bdf == sbdf.bdf &&
>>                   (!d || pdev->domain == d) )
>> +            {
>> +                pcidev_get(pdev);
>>                  return pdev;
>> +            }
>>      }
>>      else
>>          list_for_each_entry ( pdev, &d->pdev_list, domain_list )
>>              if ( pdev->sbdf.bdf == sbdf.bdf )
>> +            {
>> +                pcidev_get(pdev);
>>                  return pdev;
>> -
>> +            }
>>      return NULL;
>>  }
>>  
>> @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>                              PCI_SBDF(seg, info->physfn.bus,
>>                                       info->physfn.devfn));
>>          if ( pdev )
>> +        {
>>              pf_is_extfn = pdev->info.is_extfn;
>> +            pcidev_put(pdev);
>> +        }
>>          pcidevs_unlock();
>>          if ( !pdev )
>>              pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
>> @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>              if ( pdev->domain )
>>                  list_del(&pdev->domain_list);
>>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>> -            free_pdev(pseg, pdev);
>> +            list_del(&pdev->alldevs_list);
>> +            pdev_msi_deinit(pdev);
>> +            pcidev_put(pdev);
>
> Hm, I think here we want to make sure that the device has been freed,
> or else you would have to return -EBUSY to the calls to notify that
> the device is still in use.

Why? As I can see, pdev object is still may potentially be accessed by
some other CPU right now. So pdev object will be freed after last
reference is dropped. As it is already removed from all the lists,
pci_dev_get() will not find it anymore.

Actually, I can't see how this can happen in reality, as VPCI, MSI and
IOMMU are already deactivated for this device. So, no one would touch it.

>
> I think we need an extra pcidev_put_final() or similar that can be
> used in pci_remove_device() to assert that the device has been
> actually removed.

Will something break if we don't do this? I can't see how this can
happen.

>
>>              break;
>>          }
>>  
>> @@ -848,7 +836,7 @@ static int deassign_device(struct domain *d, uint16_t 
>> seg, uint8_t bus,
>>      {
>>          ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
>>          if ( ret )
>> -           return ret;
>> +            goto out;
>>  
>>          target = dom_io;
>>      }
>> @@ -878,6 +866,7 @@ static int deassign_device(struct domain *d, uint16_t 
>> seg, uint8_t bus,
>>      pdev->fault.count = 0;
>>  
>>   out:
>> +    pcidev_put(pdev);
>>      if ( ret )
>>          printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n",
>>                 d, &PCI_SBDF(seg, bus, devfn), ret);
>> @@ -1011,7 +1000,10 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 
>> devfn)
>>              pdev->fault.count >>= 1;
>>          pdev->fault.time = now;
>>          if ( ++pdev->fault.count < PT_FAULT_THRESHOLD )
>> +        {
>> +            pcidev_put(pdev);
>>              pdev = NULL;
>> +        }
>>      }
>>      pcidevs_unlock();
>>  
>> @@ -1022,6 +1014,8 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 
>> devfn)
>>       * control it for us. */
>>      cword = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
>>      pci_conf_write16(pdev->sbdf, PCI_COMMAND, cword & ~PCI_COMMAND_MASTER);
>> +
>> +    pcidev_put(pdev);
>>  }
>>  
>>  /*
>> @@ -1138,6 +1132,7 @@ static int __hwdom_init cf_check 
>> _setup_hwdom_pci_devices(
>>                  printk(XENLOG_WARNING "Dom%d owning %pp?\n",
>>                         pdev->domain->domain_id, &pdev->sbdf);
>>  
>> +            pcidev_put(pdev);
>>              if ( iommu_verbose )
>>              {
>>                  pcidevs_unlock();
>> @@ -1385,33 +1380,28 @@ static int iommu_remove_device(struct pci_dev *pdev)
>>      return iommu_call(hd->platform_ops, remove_device, devfn, 
>> pci_to_dev(pdev));
>>  }
>>  
>> -static int device_assigned(u16 seg, u8 bus, u8 devfn)
>> +static int device_assigned(struct pci_dev *pdev)
>>  {
>> -    struct pci_dev *pdev;
>>      int rc = 0;
>>  
>>      ASSERT(pcidevs_locked());
>> -    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
>> -
>> -    if ( !pdev )
>> -        rc = -ENODEV;
>>      /*
>>       * If the device exists and it is not owned by either the hardware
>>       * domain or dom_io then it must be assigned to a guest, or be
>>       * hidden (owned by dom_xen).
>>       */
>> -    else if ( pdev->domain != hardware_domain &&
>> -              pdev->domain != dom_io )
>> +    if ( pdev->domain != hardware_domain &&
>> +         pdev->domain != dom_io )
>>          rc = -EBUSY;
>>  
>>      return rc;
>>  }
>>  
>>  /* Caller should hold the pcidevs_lock */
>
> I would assume the caller has taken an extra reference to the pdev, so
> holding the pcidevs_lock is no longer needed?

I am assumed that lock may be required by MSIX or IOMMU functions, that
are being called here. For example, I can see that reassign_device() in
pci_amd_iommu.c manipulates with some lists. I believe, it should be
protected with the lock.

>
>> -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 
>> flag)
>> +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
>>  {
>>      const struct domain_iommu *hd = dom_iommu(d);
>> -    struct pci_dev *pdev;
>> +    uint8_t devfn;
>>      int rc = 0;
>>  
>>      if ( !is_iommu_enabled(d) )
>> @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 seg, 
>> u8 bus, u8 devfn, u32 flag)
>>  
>>      /* device_assigned() should already have cleared the device for 
>> assignment */
>>      ASSERT(pcidevs_locked());
>> -    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
>>      ASSERT(pdev && (pdev->domain == hardware_domain ||
>>                      pdev->domain == dom_io));
>>  
>> +    devfn = pdev->devfn;
>> +
>>      /* Do not allow broken devices to be assigned to guests. */
>>      rc = -EBADF;
>>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>> @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, u8 
>> bus, u8 devfn, u32 flag)
>>   done:
>>      if ( rc )
>>          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>> -               d, &PCI_SBDF(seg, bus, devfn), rc);
>> +               d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc);
>>      /* The device is assigned to dom_io so mark it as quarantined */
>>      else if ( d == dom_io )
>>          pdev->quarantine = true;
>> @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl(
>>          ASSERT(d);
>>          /* fall through */
>>      case XEN_DOMCTL_test_assign_device:
>> +    {
>> +        struct pci_dev *pdev;
>> +
>>          /* Don't support self-assignment of devices. */
>>          if ( d == current->domain )
>>          {
>> @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl(
>>          seg = machine_sbdf >> 16;
>>          bus = PCI_BUS(machine_sbdf);
>>          devfn = PCI_DEVFN(machine_sbdf);
>> -
>>          pcidevs_lock();
>> -        ret = device_assigned(seg, bus, devfn);
>> +        pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
>> +        if ( !pdev )
>> +        {
>> +            printk(XENLOG_G_INFO "%pp non-existent\n",
>> +                   &PCI_SBDF(seg, bus, devfn));
>> +            ret = -EINVAL;
>> +            break;
>> +        }
>> +
>> +        ret = device_assigned(pdev);
>>          if ( domctl->cmd == XEN_DOMCTL_test_assign_device )
>>          {
>>              if ( ret )
>>              {
>> -                printk(XENLOG_G_INFO "%pp already assigned, or 
>> non-existent\n",
>> +                printk(XENLOG_G_INFO "%pp already assigned\n",
>>                         &PCI_SBDF(seg, bus, devfn));
>>                  ret = -EINVAL;
>>              }
>>          }
>>          else if ( !ret )
>> -            ret = assign_device(d, seg, bus, devfn, flags);
>> +            ret = assign_device(d, pdev, flags);
>> +
>> +        pcidev_put(pdev);
>
> I would think you need to keep the refcount here if ret == 0, so that
> the device cannot be removed while assigned to a domain?

Looks like we are perceiving function of refcnt in a different
ways. For me, this is the mechanism to guarantee that if we have a valid
pointer to an object, this object will not disappear under our
feet. This is the main function of krefs in the linux kernel: if your
code holds a reference to an object, you can be sure that this object is
exists in memory.

On other hand, it seems that you are considering this refcnt as an usage
counter for an actual PCI device, not "struct pdev" that represent
it. Those are two related things, but not the same. So, I can see why
you are suggesting to get additional reference there. But for me, this
looks unnecessary: the very first refcount is obtained in
pci_add_device() and there is the corresponding function
pci_remove_device() that will drop this refcount. So, for me, if admin
wants to remove a PCI device which is assigned to a domain, they can do
this as they were able to do this prior this patches.

The main value of introducing refcnt is to be able to access pdev objects
without holding the global pcidevs_lock(). This does not mean that you
don't need locking at all. But this allows you to use pdev->lock (which
does not exists in this series, but was introduced in a RFC earlier), or
vpci->lock, or any other subsystem->lock.

>
>>          pcidevs_unlock();
>>          if ( ret == -ERESTART )
>>              ret = hypercall_create_continuation(__HYPERVISOR_domctl,
>>                                                  "h", u_domctl);
>>          break;
>> -
>> +    }
>>      case XEN_DOMCTL_deassign_device:
>>          /* Don't support self-deassignment of devices. */
>>          if ( d == current->domain )
>> @@ -1681,6 +1685,46 @@ int iommu_do_pci_domctl(
>>      return ret;
>>  }
>>  
>> +static void release_pdev(refcnt_t *refcnt)
>> +{
>> +    struct pci_dev *pdev = container_of(refcnt, struct pci_dev, refcnt);
>> +    struct pci_seg *pseg = get_pseg(pdev->seg);
>> +
>> +    printk(XENLOG_DEBUG "PCI release device %pp\n", &pdev->sbdf);
>> +
>> +    /* update bus2bridge */
>> +    switch ( pdev->type )
>> +    {
>> +        unsigned int sec_bus, sub_bus;
>> +
>> +        case DEV_TYPE_PCIe2PCI_BRIDGE:
>> +        case DEV_TYPE_LEGACY_PCI_BRIDGE:
>> +            sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS);
>> +            sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS);
>> +
>> +            spin_lock(&pseg->bus2bridge_lock);
>> +            for ( ; sec_bus <= sub_bus; sec_bus++ )
>> +                pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];
>> +            spin_unlock(&pseg->bus2bridge_lock);
>> +            break;
>> +
>> +        default:
>> +            break;
>> +    }
>> +
>> +    xfree(pdev);
>> +}
>> +
>> +void pcidev_get(struct pci_dev *pdev)
>> +{
>> +    refcnt_get(&pdev->refcnt);
>> +}
>> +
>> +void pcidev_put(struct pci_dev *pdev)
>> +{
>> +    refcnt_put(&pdev->refcnt, release_pdev);
>> +}
>> +
>>  /*
>>   * Local variables:
>>   * mode: C
>> diff --git a/xen/drivers/passthrough/vtd/quirks.c 
>> b/xen/drivers/passthrough/vtd/quirks.c
>> index fcc8f73e8b..d240da0416 100644
>> --- a/xen/drivers/passthrough/vtd/quirks.c
>> +++ b/xen/drivers/passthrough/vtd/quirks.c
>> @@ -429,6 +429,8 @@ static int __must_check map_me_phantom_function(struct 
>> domain *domain,
>>          rc = domain_context_unmap_one(domain, drhd->iommu, 0,
>>                                        PCI_DEVFN(dev, 7));
>>  
>> +    pcidev_put(pdev);
>> +
>>      return rc;
>>  }
>>  
>> diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c
>> index 0a03508bee..1049d4da6d 100644
>> --- a/xen/drivers/video/vga.c
>> +++ b/xen/drivers/video/vga.c
>> @@ -114,7 +114,7 @@ void __init video_endboot(void)
>>          for ( bus = 0; bus < 256; ++bus )
>>              for ( devfn = 0; devfn < 256; ++devfn )
>>              {
>> -                const struct pci_dev *pdev;
>> +                struct pci_dev *pdev;
>>                  u8 b = bus, df = devfn, sb;
>>  
>>                  pcidevs_lock();
>> @@ -126,7 +126,11 @@ void __init video_endboot(void)
>>                                       PCI_CLASS_DEVICE) != 0x0300 ||
>>                       !(pci_conf_read16(PCI_SBDF(0, bus, devfn), 
>> PCI_COMMAND) &
>>                         (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) )
>> +                {
>> +                    if ( pdev )
>> +                        pcidev_put(pdev);
>>                      continue;
>> +                }
>>  
>>                  while ( b )
>>                  {
>> @@ -157,6 +161,7 @@ void __init video_endboot(void)
>>                             bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>                      pci_hide_device(0, bus, devfn);
>>                  }
>> +                pcidev_put(pdev);
>>              }
>>      }
>>  
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 6d48d496bb..5232f9605b 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -317,8 +317,8 @@ static uint32_t merge_result(uint32_t data, uint32_t 
>> new, unsigned int size,
>>  
>>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>  {
>> -    const struct domain *d = current->domain;
>> -    const struct pci_dev *pdev;
>> +    struct domain *d = current->domain;
>
> Why do you need to drop the const on domain here?
>

Looks like leftover from a previous version. Will remove.

>> +    struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>>      uint32_t data = ~(uint32_t)0;
>> @@ -332,7 +332,11 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, 
>> unsigned int size)
>>      /* Find the PCI dev matching the address. */
>>      pdev = pci_get_pdev(d, sbdf);
>>      if ( !pdev || !pdev->vpci )
>> +    {
>> +        if ( pdev )
>> +            pcidev_put(pdev);
>>          return vpci_read_hw(sbdf, reg, size);
>> +    }
>>  
>>      spin_lock(&pdev->vpci->lock);
>>  
>> @@ -378,6 +382,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, 
>> unsigned int size)
>>          ASSERT(data_offset < size);
>>      }
>>      spin_unlock(&pdev->vpci->lock);
>> +    pcidev_put(pdev);
>>  
>>      if ( data_offset < size )
>>      {
>> @@ -420,8 +425,8 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>                  uint32_t data)
>>  {
>> -    const struct domain *d = current->domain;
>> -    const struct pci_dev *pdev;
>> +    struct domain *d = current->domain;
>> +    struct pci_dev *pdev;

-- 
WBR, Volodymyr

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.