[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev


  • To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Wed, 12 Apr 2023 11:13:47 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=HK/G2jZPuDIzdh5segEY+bnvxm5SocI73p3sLSt9VtI=; b=cy+2WOhKdXKipqA+evdEDMNeh9KwuUqDgaX/I+DTvrcgqG7U7sk3/CasnZvA/qqCnh59Vy/os7ffDgBGK8n3ZBoetVXA1TSGfl9li0gc8k1SAWmOLvSQTQQpoEJevj9XrVVFkkOMgKUIp4K3us+QPadzPqmsmmUzHEWFab6zVWgQrOvnTnX9J6CFEl7iNxTIAOO4mftan3a1srmdaX8c7LLLIjS1KZaWulGEokpYOzZB59i6WMijhMy7CWzhE+eGq/L+HoCDSxaArP5PraagH6tAln1AtBNBDUDp4YmGGYor96t1t8dz9xJatsibn2iw4VRGMQ72XOcIJrffoQP1nw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=YRI/DNIdat9Yxq9em9NW7+zqMf/zi9mgU/w7GjDei02GUaDGCgVLceNJZRbX/4pwaw3CNsnkh4kogxvnCBjyUYZMd3tTOGX0QLrCFGqd92fB154AV9o8lxkQjrwwE/jlR5CujZcXsiWcXl/Ti79BxFXYsJPmAcaNqYXnahx2xc1KylYamfO61O+FeO2UfdE0eYZWsFcMJ8w7iwiHT47ywZsGWCLzcyKelEYu4AhYvrwHEeHSyC7moV20Za+Y469VgwhW7qV65I4DATCb9nRje9wxsg1AUOQXpBgmEru8pIrmjUnY3RxxXpUM3nodgCB1Mk+/njNSiiDL1SugOAZK3g==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>
  • Delivery-date: Wed, 12 Apr 2023 09:14:24 +0000
  • Ironport-data: A9a23:FLSQ7a2UyrwK8F7Z2/bD5V1wkn2cJEfYwER7XKvMYLTBsI5bp2MHn GQeWG2PMveKamSnfd4kYN+18B8Fu5CGzIcwTAdtpC1hF35El5HIVI+TRqvS04F+DeWYFR46s J9OAjXkBJppJpMJjk71atANlVEliefTAOK6ULWeUsxIbVcMYD87jh5+kPIOjIdtgNyoayuAo tq3qMDEULOf82cc3lk8tuTS+HuDgNyo4GlD5gBnP6gS1LPjvyJ94Kw3dPnZw0TQGuG4LsbiL 87fwbew+H/u/htFIrtJRZ6iLyXm6paLVeS/oiI+t5qK23CulQRrukoPD9IOaF8/ttm8t4sZJ OOhF3CHYVxB0qXkwIzxWvTDes10FfUuFLTveRBTvSEPpqFvnrSFL/hGVSkL0YMkFulfLHleq ewjAigxMEqK2M7p3bjjRsNwmZF2RCXrFNt3VnBI6xj8VapjZK+ZBqLA6JlfwSs6gd1IEbDGf c0FZDFzbRPGJRpSJlMQD5F4l+Ct7pX9W2QA9BTJ+uxouC6Kk1cZPLvFabI5fvSQQspYhACAr 3/u9GXlGBAKcteYzFJp91r13rCRxnqjBdN6+LuQx+VJmQSh7D0oEAA0WUe+pqOrkG2BcocKQ 6AT0m90xUQoz2SVSd36Uwy9sWSzlBcWUNpNEMU38AiIjKHT5m6xFmUCCzJMdtEinMs3XiAxk E+EmcvzAj5iu6HTTmiSnp+Wpz6vPSkeLUcZeDQJCwAC5rHLv4Ubnh/JCNF5H8adjMDxGDz26 yCHqm45nbp7pdUQy6yx8FTDgjStjpvEVAg44kPQRG3NxhtweYqNd4Gur1/B4p5oL4uHT1/Ho HkNneCf6vwDCdeGkynlfQkWNLSg5vLANSKGh1dqR8Ul7270pCXlep1M6jZjIksvKtwDZTLif E7Uv0VW+YNXO3ypK6RwZupdFvgX8EQpLvy9Pti8UzaESsIZmNOvlM22WXOt4g==
  • Ironport-hdrordr: A9a23:eWUF+6Np4prdvsBcTvmjsMiBIKoaSvp037Eqv3oBKiC9E/b1qy nKpp9w6faaslsssQ4b6LS90cW7L080l6QFhLX5TI3SPjUO0VHARL2KhrGC/9SPIU3DH+dmpM BdWqJjEsD3CVRgreuS2maFL+o=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Apr 11, 2023 at 11:41:04PM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@xxxxxxxxxx> writes:
> 
> > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote:
> >> Prior to this change, lifetime of pci_dev objects was protected by global
> >> pcidevs_lock(). Long-term plan is to remove this log, so we need some
> >                                                    ^ lock
> >
> > I wouldn't say remove, as one way or another we need a lock to protect
> > concurrent accesses.
> >
> 
> I'll write "replace this global lock with couple of more granular
> locking devices"
> if this is okay for you.
> 
> >> other mechanism to ensure that those objects will not disappear under
> >> feet of code that access them. Reference counting is a good choice as
> >> it provides easy to comprehend way to control object lifetime.
> >> 
> >> This patch adds two new helper functions: pcidev_get() and
> >> pcidev_put(). pcidev_get() will increase reference counter, while
> >> pcidev_put() will decrease it, destroying object when counter reaches
> >> zero.
> >> 
> >> pcidev_get() should be used only when you already have a valid pointer
> >> to the object or you are holding lock that protects one of the
> >> lists (domain, pseg or ats) that store pci_dev structs.
> >> 
> >> pcidev_get() is rarely used directly, because there already are
> >> functions that will provide valid pointer to pci_dev struct:
> >> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list,
> >> find needed object and increase its reference counter before returning
> >> to the caller.
> >> 
> >> Naturally, pci_put() should be called after finishing working with a
> >> received object. This is the reason why this patch have so many
> >> pcidev_put()s and so little pcidev_get()s: existing calls to
> >> pci_get_*() functions now will increase reference counter
> >> automatically, we just need to decrease it back when we finished.
> >
> > After looking a bit into this, I would like to ask whether it's been
> > considered the need to increase the refcount for each use of a pdev.
> >
> 
> This is how Linux uses reference locking. It decreases cognitive load
> and chance for an error, as there is a simple set of rules, which you
> follow.
> 
> > For example I would consider the initial alloc_pdev() to take a
> > refcount, and then pci_remove_device() _must_ be the function that
> > removes the last refcount, so that it can return -EBUSY otherwise (see
> > my comment below).
> 
> I tend to disagree there, as this ruins the very idea of reference
> counting. We can't know who else holds reference right now. Okay, we
> might know, but this requires additional lock to serialize
> accesses. Which, in turn, makes refcount un-needed.

In principle pci_remove_device() must report whether the device is
ready to be physically removed from the system, so it must return
-EBUSY if there are still users accessing the device.

A user would use PHYSDEVOP_manage_pci_remove to signal Xen it's trying
to physically remove a PCI device from a system, so we must ensure
that when the hypervisor returns success the device is ready to be
physically removed.

Or at least that's my understanding of how this should work.

> >
> > I would also think that having the device assigned to a guest will take
> > another refcount, and then any usage from further callers (ie: like
> > vpci) will need some kind of protection from preventing the device
> > from being deassigned from a domain while vPCI handlers are running,
> > and the current refcount won't help with that.
> 
> Yes, idea of this refcounting is to ensure that a pdev object exists as an
> valid object in memory if we are holding a long-term pointer to
> it. Indeed, vPCI handlers should use some other mechanism to ensure that
> pdev is not being re-assigned while handlers are running. I believe,
> this is the task of vpci->lock. Should we call
> vpci_remove_device/vpci_add_handlers each time we re-assign a PCI device?

Yes, I think this was also part of a comment I've made on a different
patch.  The device state needs to be cleared when assigned to a
different guest (as the hardware domain will also perform a device
reset).

I think there are some points that needs to be part of the commit
message so the code can be properly evaluated:

 - The reference counting is only used to ensure the object cannot be
   removed while in use.  Users of the pci device object should
   implement whatever protections required in order to get mutual
   exclusion between them and device state changes.

> >
> > That makes me wonder if for example callers of pci_get_pdev(d, sbdf)
> > do need to take an extra refcount, because such access is already
> > protected from the pdev going away by the fact that the device is
> > assigned to a guest.  But maybe it's too much work to separate users
> > of pci_get_pdev(d, ...); vs pci_get_pdev(NULL, ...);.
> >
> > There's also a window when the refcount is dropped to 0, and the
> > destruction function is called, but at the same time a concurrent
> > thread could attempt to take a reference to the pdev still?
> 
> Last pcidev_put() would be called by pci_remove_device(), after removing
> it from all lists. This should prevent other threads from obtaining a valid
> reference to the pdev.

What if a concurrent user has taken a reference to the object before
pci_remove_device() has removed the device from the lists, and still
holds it when pci_remove_device() performs the supposedly last
pcidev_put() call?

> >
> >>          sbdf.devfn &= ~stride;
> >>          pdev = pci_get_pdev(NULL, sbdf);
> >>          if ( pdev && stride != pdev->phantom_stride )
> >> +        {
> >> +            pcidev_put(pdev);
> >>              pdev = NULL;
> >> +        }
> >>      }
> >>  
> >>      return pdev;
> >> @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, 
> >> pci_sbdf_t sbdf)
> >>          list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
> >>              if ( pdev->sbdf.bdf == sbdf.bdf &&
> >>                   (!d || pdev->domain == d) )
> >> +            {
> >> +                pcidev_get(pdev);
> >>                  return pdev;
> >> +            }
> >>      }
> >>      else
> >>          list_for_each_entry ( pdev, &d->pdev_list, domain_list )
> >>              if ( pdev->sbdf.bdf == sbdf.bdf )
> >> +            {
> >> +                pcidev_get(pdev);
> >>                  return pdev;
> >> -
> >> +            }
> >>      return NULL;
> >>  }
> >>  
> >> @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
> >>                              PCI_SBDF(seg, info->physfn.bus,
> >>                                       info->physfn.devfn));
> >>          if ( pdev )
> >> +        {
> >>              pf_is_extfn = pdev->info.is_extfn;
> >> +            pcidev_put(pdev);
> >> +        }
> >>          pcidevs_unlock();
> >>          if ( !pdev )
> >>              pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
> >> @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
> >>              if ( pdev->domain )
> >>                  list_del(&pdev->domain_list);
> >>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
> >> -            free_pdev(pseg, pdev);
> >> +            list_del(&pdev->alldevs_list);
> >> +            pdev_msi_deinit(pdev);
> >> +            pcidev_put(pdev);
> >
> > Hm, I think here we want to make sure that the device has been freed,
> > or else you would have to return -EBUSY to the calls to notify that
> > the device is still in use.
> 
> Why? As I can see, pdev object is still may potentially be accessed by
> some other CPU right now. So pdev object will be freed after last
> reference is dropped. As it is already removed from all the lists,
> pci_dev_get() will not find it anymore.
> 
> Actually, I can't see how this can happen in reality, as VPCI, MSI and
> IOMMU are already deactivated for this device. So, no one would touch it.

Wouldn't it be possible for a concurrent user to hold a reference from
befoe the device has been 'deactivated'?

> >
> > I think we need an extra pcidev_put_final() or similar that can be
> > used in pci_remove_device() to assert that the device has been
> > actually removed.
> 
> Will something break if we don't do this? I can't see how this can
> happen.

As mentioned above, once pci_remove_device() returns 0 the admin
should be capable of physically removing the device from the system.

> >
> >>              break;
> >>          }
> >>  
> >> @@ -848,7 +836,7 @@ static int deassign_device(struct domain *d, uint16_t 
> >> seg, uint8_t bus,
> >>      {
> >>          ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
> >>          if ( ret )
> >> -           return ret;
> >> +            goto out;
> >>  
> >>          target = dom_io;
> >>      }
> >> @@ -878,6 +866,7 @@ static int deassign_device(struct domain *d, uint16_t 
> >> seg, uint8_t bus,
> >>      pdev->fault.count = 0;
> >>  
> >>   out:
> >> +    pcidev_put(pdev);
> >>      if ( ret )
> >>          printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n",
> >>                 d, &PCI_SBDF(seg, bus, devfn), ret);
> >> @@ -1011,7 +1000,10 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 
> >> devfn)
> >>              pdev->fault.count >>= 1;
> >>          pdev->fault.time = now;
> >>          if ( ++pdev->fault.count < PT_FAULT_THRESHOLD )
> >> +        {
> >> +            pcidev_put(pdev);
> >>              pdev = NULL;
> >> +        }
> >>      }
> >>      pcidevs_unlock();
> >>  
> >> @@ -1022,6 +1014,8 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 
> >> devfn)
> >>       * control it for us. */
> >>      cword = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
> >>      pci_conf_write16(pdev->sbdf, PCI_COMMAND, cword & 
> >> ~PCI_COMMAND_MASTER);
> >> +
> >> +    pcidev_put(pdev);
> >>  }
> >>  
> >>  /*
> >> @@ -1138,6 +1132,7 @@ static int __hwdom_init cf_check 
> >> _setup_hwdom_pci_devices(
> >>                  printk(XENLOG_WARNING "Dom%d owning %pp?\n",
> >>                         pdev->domain->domain_id, &pdev->sbdf);
> >>  
> >> +            pcidev_put(pdev);
> >>              if ( iommu_verbose )
> >>              {
> >>                  pcidevs_unlock();
> >> @@ -1385,33 +1380,28 @@ static int iommu_remove_device(struct pci_dev 
> >> *pdev)
> >>      return iommu_call(hd->platform_ops, remove_device, devfn, 
> >> pci_to_dev(pdev));
> >>  }
> >>  
> >> -static int device_assigned(u16 seg, u8 bus, u8 devfn)
> >> +static int device_assigned(struct pci_dev *pdev)
> >>  {
> >> -    struct pci_dev *pdev;
> >>      int rc = 0;
> >>  
> >>      ASSERT(pcidevs_locked());
> >> -    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
> >> -
> >> -    if ( !pdev )
> >> -        rc = -ENODEV;
> >>      /*
> >>       * If the device exists and it is not owned by either the hardware
> >>       * domain or dom_io then it must be assigned to a guest, or be
> >>       * hidden (owned by dom_xen).
> >>       */
> >> -    else if ( pdev->domain != hardware_domain &&
> >> -              pdev->domain != dom_io )
> >> +    if ( pdev->domain != hardware_domain &&
> >> +         pdev->domain != dom_io )
> >>          rc = -EBUSY;
> >>  
> >>      return rc;
> >>  }
> >>  
> >>  /* Caller should hold the pcidevs_lock */
> >
> > I would assume the caller has taken an extra reference to the pdev, so
> > holding the pcidevs_lock is no longer needed?
> 
> I am assumed that lock may be required by MSIX or IOMMU functions, that
> are being called here. For example, I can see that reassign_device() in
> pci_amd_iommu.c manipulates with some lists. I believe, it should be
> protected with the lock.

OK, so that's pcidevs_lock being used to protect something else that's
not strictly a pci device, but a related structure.

> >
> >> -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 
> >> flag)
> >> +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
> >>  {
> >>      const struct domain_iommu *hd = dom_iommu(d);
> >> -    struct pci_dev *pdev;
> >> +    uint8_t devfn;
> >>      int rc = 0;
> >>  
> >>      if ( !is_iommu_enabled(d) )
> >> @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 
> >> seg, u8 bus, u8 devfn, u32 flag)
> >>  
> >>      /* device_assigned() should already have cleared the device for 
> >> assignment */
> >>      ASSERT(pcidevs_locked());
> >> -    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
> >>      ASSERT(pdev && (pdev->domain == hardware_domain ||
> >>                      pdev->domain == dom_io));
> >>  
> >> +    devfn = pdev->devfn;
> >> +
> >>      /* Do not allow broken devices to be assigned to guests. */
> >>      rc = -EBADF;
> >>      if ( pdev->broken && d != hardware_domain && d != dom_io )
> >> @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, 
> >> u8 bus, u8 devfn, u32 flag)
> >>   done:
> >>      if ( rc )
> >>          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
> >> -               d, &PCI_SBDF(seg, bus, devfn), rc);
> >> +               d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc);
> >>      /* The device is assigned to dom_io so mark it as quarantined */
> >>      else if ( d == dom_io )
> >>          pdev->quarantine = true;
> >> @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl(
> >>          ASSERT(d);
> >>          /* fall through */
> >>      case XEN_DOMCTL_test_assign_device:
> >> +    {
> >> +        struct pci_dev *pdev;
> >> +
> >>          /* Don't support self-assignment of devices. */
> >>          if ( d == current->domain )
> >>          {
> >> @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl(
> >>          seg = machine_sbdf >> 16;
> >>          bus = PCI_BUS(machine_sbdf);
> >>          devfn = PCI_DEVFN(machine_sbdf);
> >> -
> >>          pcidevs_lock();
> >> -        ret = device_assigned(seg, bus, devfn);
> >> +        pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
> >> +        if ( !pdev )
> >> +        {
> >> +            printk(XENLOG_G_INFO "%pp non-existent\n",
> >> +                   &PCI_SBDF(seg, bus, devfn));
> >> +            ret = -EINVAL;
> >> +            break;
> >> +        }
> >> +
> >> +        ret = device_assigned(pdev);
> >>          if ( domctl->cmd == XEN_DOMCTL_test_assign_device )
> >>          {
> >>              if ( ret )
> >>              {
> >> -                printk(XENLOG_G_INFO "%pp already assigned, or 
> >> non-existent\n",
> >> +                printk(XENLOG_G_INFO "%pp already assigned\n",
> >>                         &PCI_SBDF(seg, bus, devfn));
> >>                  ret = -EINVAL;
> >>              }
> >>          }
> >>          else if ( !ret )
> >> -            ret = assign_device(d, seg, bus, devfn, flags);
> >> +            ret = assign_device(d, pdev, flags);
> >> +
> >> +        pcidev_put(pdev);
> >
> > I would think you need to keep the refcount here if ret == 0, so that
> > the device cannot be removed while assigned to a domain?
> 
> Looks like we are perceiving function of refcnt in a different
> ways. For me, this is the mechanism to guarantee that if we have a valid
> pointer to an object, this object will not disappear under our
> feet. This is the main function of krefs in the linux kernel: if your
> code holds a reference to an object, you can be sure that this object is
> exists in memory.
> 
> On other hand, it seems that you are considering this refcnt as an usage
> counter for an actual PCI device, not "struct pdev" that represent
> it. Those are two related things, but not the same. So, I can see why
> you are suggesting to get additional reference there. But for me, this
> looks unnecessary: the very first refcount is obtained in
> pci_add_device() and there is the corresponding function
> pci_remove_device() that will drop this refcount. So, for me, if admin
> wants to remove a PCI device which is assigned to a domain, they can do
> this as they were able to do this prior this patches.

This is all fine, but needs to be stated in the commit message.

> The main value of introducing refcnt is to be able to access pdev objects
> without holding the global pcidevs_lock(). This does not mean that you
> don't need locking at all. But this allows you to use pdev->lock (which
> does not exists in this series, but was introduced in a RFC earlier), or
> vpci->lock, or any other subsystem->lock.

I guess I was missing this other bit about introducing a
per-device lock, would it be possible to bundle all this together into
a single patch series?

It would be good to place this change together with any other locking
related change that you have pending.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.