[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Date: Wed, 21 Jun 2023 22:07:20 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Cp0b19tD1nMYsybyGekZTVaOEAB61qlBtXWBsfxLsUg=; b=KziH9IQBMnMRC1h+lktqA1Chekg0hRTh9lC6a4NMuI92kFxqqbIilQlaYj1tenqxu3NAUQeL6XdHO2HbY5Pgb75hioclOF3KXZKGNg+Y+IypemDW/ooGYWJEcKgJMon4USPJeqI1O3nnGh7Cz2K9/cQ3oWfhjLEjPxI+cdtTxqSfAX1vDxPwGnM0mQrcOyWSQsYO+iQ0pbfkLLaoyitoCcZwyabNf15wcRTLZipodFg3XaCgwD8WpuUwDE7g+KeEBXGC109cGrkBnRgWRBCeZ0A7aDatcNlPYouieRsw7zrVg0ra/YdEOdORMKZFNXt9OKZr/W0rYiJp/PZdwchM5A==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=F2XZuh99FJNKa3lAcsjzbCgTXcu+F5k1ILqIOTN4iuA+jp3rVml+vrEiZlfvLMsGWw0rVnrFgO6Sblw8JVbempYP5F37cBUhM5BTBA1sQRkCXjBr/crSkCZu6yxwJQ48OHZ/WgyyyrGO+2ECYo5Gq8tFsoi3uj8dvbt0r/ic6WoWYnlKdW/l78Hycx886Dl1CJfDEfNrgRJSoaTdWlrejMf9GLxZDQOejW5sJD0/bUzeWPV8gKhCScGoOR8ayArNxtCCk2mDjOabmdD2ZTacVf+Qro0OHIxjHNVsTccpjrpdySKEHqEGLKqqSbVyUiGRxIclUu19/cBT1zIQeSzZkQ==
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>
  • Delivery-date: Wed, 21 Jun 2023 22:08:01 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHZneJZdDE+HOoSO0y6VB9l0x1hDq+No9OAgAgqUAA=
  • Thread-topic: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure

Hi Roger,

Roger Pau Monné <roger.pau@xxxxxxxxxx> writes:

> On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>> 
>> Introduce a per-domain read/write lock to check whether vpci is present,
>> so we are sure there are no accesses to the contents of the vpci struct
>> if not. This lock can be used (and in a few cases is used right away)
>> so that vpci removal can be performed while holding the lock in write
>> mode. Previously such removal could race with vpci_read for example.
>> 
>> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> from being removed.
>> 
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if done
>> under the read lock, requires vpci->lock to be acquired on both devices
>> being compared, which may produce a deadlock. It is not possible to
>> upgrade read lock to write lock in such a case. So, in order to prevent
>> the deadlock, check which registers are going to be written and acquire
>> the lock in the appropriate mode from the beginning.
>> 
>> All other code, which doesn't lead to pdev->vpci destruction and does not
>> access multiple pdevs at the same time, can still use a combination of the
>> read lock and pdev->vpci->lock.
>> 
>> 3. Optimize if ROM BAR write lock required detection by caching offset
>> of the ROM BAR register in vpci->header->rom_reg which depends on
>> header's type.
>> 
>> 4. Reduce locked region in vpci_remove_device as it is now possible
>> to set pdev->vpci to NULL early right after the write lock is acquired.
>> 
>> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> initialize many more fields of the struct vpci before assigning it to
>> pdev->vpci.
>> 
>> 6. vpci_{add|remove}_register are required to be called with the write lock
>> held, but it is not feasible to add an assert there as it requires
>> struct domain to be passed for that. So, add a comment about this requirement
>> to these and other functions with the equivalent constraints.
>> 
>> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>> 
>> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> prior the call and re-acquire the locks after. After re-acquiring the
>> lock there is no need to check if pdev->vpci exists:
>>  - in apply_map because of the context it is called (no race condition
>>    possible)
>>  - for MSI/MSI-X debug code because it is called at the end of
>>    pdev->vpci access and no further access to pdev->vpci is made
>> 
>> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> and if so, allow reading or writing the hardware register directly. This is
>> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> added the write will need to be ignored and read return all 0's for the
>> guests, while Dom0 can still access the registers directly.
>> 
>> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> the pcidev's lock.
>> 
>> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>> 
>> 12. This is based on the discussion at [1].
>> 
>> [1] 
>> https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@xxxxxxxxx/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$
>>  [lore[.]kernel[.]org]
>> 
>> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
>> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>
> Thanks.
>
> I haven't looked in full detail, but I'm afraid there's an ABBA
> deadlock with the per-domain vpci lock and the pcidevs lock in
> modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
>
> I've made some comments below.

Thank you for the review. I believe that it is a good idea to have a
per-domain pdev_list lock. See my answers below.


>> ---
>> This was checked on x86: with and without PVH Dom0.
>> ---
>>  xen/arch/x86/hvm/vmsi.c       |   7 +++
>>  xen/common/domain.c           |   3 +
>>  xen/drivers/passthrough/pci.c |   5 ++
>>  xen/drivers/vpci/header.c     |  52 +++++++++++++++++
>>  xen/drivers/vpci/msi.c        |  25 +++++++-
>>  xen/drivers/vpci/msix.c       |  51 +++++++++++++---
>>  xen/drivers/vpci/vpci.c       | 107 +++++++++++++++++++++++++---------
>>  xen/include/xen/pci.h         |   1 +
>>  xen/include/xen/sched.h       |   3 +
>>  xen/include/xen/vpci.h        |   6 ++
>>  10 files changed, 223 insertions(+), 37 deletions(-)
>> 
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 3cd4923060..f188450b1b 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -895,6 +895,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>  {
>>      unsigned int i;
>>  
>> +    ASSERT(rw_is_locked(&msix->pdev->domain->vpci_rwlock));
>> +    ASSERT(pcidevs_locked());
>> +
>>      for ( i = 0; i < msix->max_entries; i++ )
>>      {
>>          const struct vpci_msix_entry *entry = &msix->entries[i];
>> @@ -913,7 +916,11 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>              struct pci_dev *pdev = msix->pdev;
>>  
>>              spin_unlock(&msix->pdev->vpci->lock);
>> +            pcidevs_unlock();
>> +            read_unlock(&pdev->domain->vpci_rwlock);
>>              process_pending_softirqs();
>> +            read_lock(&pdev->domain->vpci_rwlock);
>> +            pcidevs_lock();
>
> Why do you need the pcidevs_lock here?  Isn't it enough to have the
> per-domain rwlock taken in read mode in order to prevent devices from
> being de-assigned from the domain?
>

Yes, looks like you are right. I'll remove pcidevs_lock() 


>>              /* NB: we assume that pdev cannot go away for an alive domain. 
>> */
>>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>>                  return -EBUSY;
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index 6a440590fe..a4640acb62 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -622,6 +622,9 @@ struct domain *domain_create(domid_t domid,
>>  
>>  #ifdef CONFIG_HAS_PCI
>>      INIT_LIST_HEAD(&d->pdev_list);
>> +#ifdef CONFIG_HAS_VPCI
>> +    rwlock_init(&d->vpci_rwlock);
>> +#endif
>>  #endif
>>  
>>      /* All error paths can depend on the above setup. */
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 07d1986d33..0afcb59af0 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>>      spin_lock_recursive(&_pcidevs_lock);
>>  }
>>  
>> +int pcidevs_trylock(void)
>> +{
>> +    return spin_trylock_recursive(&_pcidevs_lock);
>> +}
>> +
>>  void pcidevs_unlock(void)
>>  {
>>      spin_unlock_recursive(&_pcidevs_lock);
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index ec2e978a4e..23b5223adf 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -152,12 +152,14 @@ bool vpci_process_pending(struct vcpu *v)
>>          if ( rc == -ERESTART )
>>              return true;
>>  
>> +        read_lock(&v->domain->vpci_rwlock);
>>          spin_lock(&v->vpci.pdev->vpci->lock);
>>          /* Disable memory decoding unconditionally on failure. */
>>          modify_decoding(v->vpci.pdev,
>>                          rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : 
>> v->vpci.cmd,
>>                          !rc && v->vpci.rom_only);
>>          spin_unlock(&v->vpci.pdev->vpci->lock);
>> +        read_unlock(&v->domain->vpci_rwlock);
>
> I think you likely want to expand the scope of the domain vpci lock in
> order to also protect the data in v->vpci?  So that vPCI device
> removal can clear this data if required without racing with
> vpci_process_pending().  Otherwise you need to pause the domain when
> removing vPCI.

Yes, you are right.

>
>>  
>>          rangeset_destroy(v->vpci.mem);
>>          v->vpci.mem = NULL;
>> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const 
>> struct pci_dev *pdev,
>>      struct map_data data = { .d = d, .map = true };
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&d->vpci_rwlock));
>> +
>>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == 
>> -ERESTART )
>> +    {
>> +        /*
>> +         * It's safe to drop and reacquire the lock in this context
>> +         * without risking pdev disappearing because devices cannot be
>> +         * removed until the initial domain has been started.
>> +         */
>> +        write_unlock(&d->vpci_rwlock);
>>          process_pending_softirqs();
>> +        write_lock(&d->vpci_rwlock);
>> +    }
>> +
>>      rangeset_destroy(mem);
>>      if ( !rc )
>>          modify_decoding(pdev, cmd, false);
>> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev 
>> *pdev,
>>      raise_softirq(SCHEDULE_SOFTIRQ);
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>
> Why it must be in write mode?
>
> Is this to avoid ABBA deadlocks as not taking the per-domain lock in
> write mode would then force us to take each pdev->vpci->lock in order
> to prevent concurrent modifications of remote devices?

Yes, exactly this. This is mentioned in the commit message:

    2. Writing the command register and ROM BAR register may trigger
    modify_bars to run, which in turn may access multiple pdevs while
    checking for the existing BAR's overlap. The overlapping check, if done
    under the read lock, requires vpci->lock to be acquired on both devices
    being compared, which may produce a deadlock. It is not possible to
    upgrade read lock to write lock in such a case. So, in order to prevent
    the deadlock, check which registers are going to be written and acquire
    the lock in the appropriate mode from the beginning.

>
>>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool 
>> rom_only)
>>  {
>>      struct vpci_header *header = &pdev->vpci->header;
>> @@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev, 
>> uint16_t cmd, bool rom_only)
>>       * Check for overlaps with other BARs. Note that only BARs that are
>>       * currently mapped (enabled) are checked for overlaps.
>>       */
>> +    pcidevs_lock();
>
> This can be problematic I'm afraid, as it's a guest controlled path
> that allows applying unrestricted contention on the pcidevs lock.  I'm
> unsure whether multiple guests could be able to trigger the watchdog
> if given enough devices/vcpus to be able to perform enough
> simultaneous calls to modify_bars().
>
> Maybe you could expand the per-domain vpci lock to also protect
> modifications of domain->pdev_list?
>
> IOW: kind of a per-domain pdev_lock.

This might work, but I am not sure if we need to expand vpci lock. Maybe
it is better to add another lock that protects domain->pdev_list? I
afraid that combined lock will be confusing, as it protects two
semi-related entities.

>
>>      for_each_pdev ( pdev->domain, tmp )
>>      {
>>          if ( tmp == pdev )
>> @@ -326,10 +342,12 @@ static int modify_bars(const struct pci_dev *pdev, 
>> uint16_t cmd, bool rom_only)
>>                  printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>                         start, end, rc);
>>                  rangeset_destroy(mem);
>> +                pcidevs_unlock();
>>                  return rc;
>>              }
>>          }
>>      }
>> +    pcidevs_unlock();
>>  
>>      ASSERT(dev);
>>  
>> @@ -482,6 +500,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>      struct vpci_bar *bars = header->bars;
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>      {
>>      case PCI_HEADER_TYPE_NORMAL:
>> @@ -570,6 +590,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>          }
>>      }
>>  
>> +    ASSERT(!header->rom_reg);
>> +
>>      /* Check expansion ROM. */
>>      rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
>>      if ( rc > 0 && size )
>> @@ -586,12 +608,42 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>                                 4, rom);
>>          if ( rc )
>>              rom->type = VPCI_BAR_EMPTY;
>> +
>> +        header->rom_reg = rom_reg;
>>      }
>>  
>>      return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
>>  }
>>  REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>  
>> +static bool overlap(unsigned int r1_offset, unsigned int r1_size,
>> +                    unsigned int r2_offset, unsigned int r2_size)
>> +{
>> +    /* Return true if there is an overlap. */
>> +    return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + 
>> r1_size;
>
> Hm, we already have a vpci_register_cmp(), which does a very similar
> comparison.  I wonder if it would be possible to somehow use that
> here.
>

Yeah, I'll revisit this part.

>> +}
>> +
>> +bool vpci_header_need_write_lock(const struct pci_dev *pdev,
>> +                                 unsigned int start, unsigned int size)
>> +{
>> +    /*
>> +     * Writing the command register and ROM BAR register may trigger
>> +     * modify_bars to run, which in turn may access multiple pdevs while
>> +     * checking for the existing BAR's overlap. The overlapping check, if 
>> done
>> +     * under the read lock, requires vpci->lock to be acquired on both 
>> devices
>> +     * being compared, which may produce a deadlock. At the same time it is 
>> not
>> +     * possible to upgrade read lock to write lock in such a case.
>> +     * Check which registers are going to be written and return true if lock
>> +     * needs to be acquired in write mode.
>> +     */
>> +    if ( overlap(start, size, PCI_COMMAND, 2) ||
>> +         (pdev->vpci->header.rom_reg &&
>> +          overlap(start, size, pdev->vpci->header.rom_reg, 4)) )
>> +        return true;
>> +
>> +    return false;
>> +}
>> +
>>  /*
>>   * Local variables:
>>   * mode: C
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 8f2b59e61a..e7d1f916a0 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>>      uint16_t control;
>>      int ret;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>>      if ( !pos )
>>          return 0;
>>  
>> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>>  
>>  void vpci_dump_msi(void)
>>  {
>> -    const struct domain *d;
>> +    struct domain *d;
>>  
>>      rcu_read_lock(&domlist_read_lock);
>>      for_each_domain ( d )
>> @@ -277,6 +279,15 @@ void vpci_dump_msi(void)
>>  
>>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>>  
>> +        if ( !read_trylock(&d->vpci_rwlock) )
>> +            continue;
>> +
>> +        if ( !pcidevs_trylock() )
>> +        {
>> +            read_unlock(&d->vpci_rwlock);
>> +            continue;
>> +        }
>> +
>>          for_each_pdev ( d, pdev )
>>          {
>>              const struct vpci_msi *msi;
>> @@ -318,14 +329,22 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */
>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>  
>>              spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> +            pcidevs_unlock();
>> +            read_unlock(&d->vpci_rwlock);
>> +
>>              process_pending_softirqs();
>> +
>> +            read_lock(&d->vpci_rwlock);
>> +            pcidevs_lock();
>>          }
>> +        pcidevs_unlock();
>> +        read_unlock(&d->vpci_rwlock);
>>      }
>>      rcu_read_unlock(&domlist_read_lock);
>>  }
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index 25bde77586..b5a7dfdf9c 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -143,6 +143,7 @@ static void cf_check control_write(
>>          pci_conf_write16(pdev->sbdf, reg, val);
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>>  static struct vpci_msix *msix_find(const struct domain *d, unsigned long 
>> addr)
>>  {
>>      struct vpci_msix *msix;
>> @@ -163,7 +164,13 @@ static struct vpci_msix *msix_find(const struct domain 
>> *d, unsigned long addr)
>>  
>>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>>  {
>> -    return !!msix_find(v->domain, addr);
>> +    int rc;
>> +
>> +    read_lock(&v->domain->vpci_rwlock);
>> +    rc = !!msix_find(v->domain, addr);
>> +    read_unlock(&v->domain->vpci_rwlock);
>> +
>> +    return rc;
>>  }
>>  
>>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
>> @@ -358,21 +365,34 @@ static int adjacent_read(const struct domain *d, const 
>> struct vpci_msix *msix,
>>  static int cf_check msix_read(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long 
>> *data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      const struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>>      *data = ~0ul;
>>  
>> +    read_lock(&d->vpci_rwlock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_read(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_read(d, msix, addr, len, data);
>> +        read_unlock(&d->vpci_rwlock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -404,6 +424,7 @@ static int cf_check msix_read(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->vpci_rwlock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -491,19 +512,32 @@ static int adjacent_write(const struct domain *d, 
>> const struct vpci_msix *msix,
>>  static int cf_check msix_write(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long 
>> data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>> +    read_lock(&d->vpci_rwlock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_write(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_write(d, msix, addr, len, data);
>> +        read_unlock(&d->vpci_rwlock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -579,6 +613,7 @@ static int cf_check msix_write(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->vpci_rwlock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -665,6 +700,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>>      struct vpci_msix *msix;
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>>      msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>>                                        PCI_CAP_ID_MSIX);
>>      if ( !msix_offset )
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 652807a4a4..1270174e78 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
>>  
>>  void vpci_remove_device(struct pci_dev *pdev)
>>  {
>> -    if ( !has_vpci(pdev->domain) || !pdev->vpci )
>> +    struct vpci *vpci;
>> +
>> +    if ( !has_vpci(pdev->domain) )
>>          return;
>>  
>> -    spin_lock(&pdev->vpci->lock);
>> +    write_lock(&pdev->domain->vpci_rwlock);
>> +    if ( !pdev->vpci )
>> +    {
>> +        write_unlock(&pdev->domain->vpci_rwlock);
>> +        return;
>> +    }
>> +
>> +    vpci = pdev->vpci;
>> +    pdev->vpci = NULL;
>> +    write_unlock(&pdev->domain->vpci_rwlock);
>> +
>>      while ( !list_empty(&pdev->vpci->handlers) )
>>      {
>> -        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
>> +        struct vpci_register *r = list_first_entry(&vpci->handlers,
>>                                                     struct vpci_register,
>>                                                     node);
>>  
>>          list_del(&r->node);
>>          xfree(r);
>>      }
>> -    spin_unlock(&pdev->vpci->lock);
>> +
>>      if ( pdev->vpci->msix )
>>      {
>>          unsigned int i;
>> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>>              if ( pdev->vpci->msix->table[i] )
>>                  iounmap(pdev->vpci->msix->table[i]);
>>      }
>> -    xfree(pdev->vpci->msix);
>> -    xfree(pdev->vpci->msi);
>> -    xfree(pdev->vpci);
>> -    pdev->vpci = NULL;
>> +    xfree(vpci->msix);
>> +    xfree(vpci->msi);
>> +    xfree(vpci);
>>  }
>>  
>>  int vpci_add_handlers(struct pci_dev *pdev)
>>  {
>> +    struct vpci *vpci;
>>      unsigned int i;
>>      int rc = 0;
>>  
>>      if ( !has_vpci(pdev->domain) )
>>          return 0;
>>  
>> +    vpci = xzalloc(struct vpci);
>> +    if ( !vpci )
>> +        return -ENOMEM;
>> +
>> +    INIT_LIST_HEAD(&vpci->handlers);
>> +    spin_lock_init(&vpci->lock);
>> +
>> +    write_lock(&pdev->domain->vpci_rwlock);
>
> I think the usage of the vpci per-domain lock here and in
> vpci_remove_device() create an ABBA deadlock situation with the usage
> of it in modify_bars().
>
> Both vpci_add_handlers() and vpci_remove_device() get called with the
> pcidevs lock taken, and take the per-domain vpci lock afterwards,
> while modify_bars() does the inverse locking, gets called with the
> per-domain vpci lock taken and then take the pcidevs lock inside the
> function.

You suggested to use separate per-domain pdev_list lock in
modify_bars. Looks like this can help with the ABBA deadlock.

>
>> +
>>      /* We should not get here twice for the same device. */
>>      ASSERT(!pdev->vpci);
>>  
>> -    pdev->vpci = xzalloc(struct vpci);
>> -    if ( !pdev->vpci )
>> -        return -ENOMEM;
>> -
>> -    INIT_LIST_HEAD(&pdev->vpci->handlers);
>> -    spin_lock_init(&pdev->vpci->lock);
>> +    pdev->vpci = vpci;
>>  
>>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
>>      {
>> @@ -91,6 +107,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>          if ( rc )
>>              break;
>>      }
>> +    write_unlock(&pdev->domain->vpci_rwlock);
>>  
>>      if ( rc )
>>          vpci_remove_device(pdev);
>> @@ -139,6 +156,7 @@ uint32_t cf_check vpci_hw_read32(
>>      return pci_conf_read32(pdev->sbdf, reg);
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>>  int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>                        vpci_write_t *write_handler, unsigned int offset,
>>                        unsigned int size, void *data)
>> @@ -162,8 +180,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t 
>> *read_handler,
>>      r->offset = offset;
>>      r->private = data;
>>  
>> -    spin_lock(&vpci->lock);
>
> If you remove the lock here we need to assert that the per-domain vpci
> lock is taken in write mode.

Yep, assert will be better than comment above.

>
>> -
>>      /* The list of handlers must be kept sorted at all times. */
>>      list_for_each ( prev, &vpci->handlers )
>>      {
>> @@ -175,25 +191,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t 
>> *read_handler,
>>              break;
>>          if ( cmp == 0 )
>>          {
>> -            spin_unlock(&vpci->lock);
>>              xfree(r);
>>              return -EEXIST;
>>          }
>>      }
>>  
>>      list_add_tail(&r->node, prev);
>> -    spin_unlock(&vpci->lock);
>>  
>>      return 0;
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>>  int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>>                           unsigned int size)
>>  {
>>      const struct vpci_register r = { .offset = offset, .size = size };
>>      struct vpci_register *rm;
>>  
>> -    spin_lock(&vpci->lock);
>>      list_for_each_entry ( rm, &vpci->handlers, node )
>>      {
>>          int cmp = vpci_register_cmp(&r, rm);
>> @@ -205,14 +219,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned 
>> int offset,
>>          if ( !cmp && rm->offset == offset && rm->size == size )
>>          {
>>              list_del(&rm->node);
>> -            spin_unlock(&vpci->lock);
>>              xfree(rm);
>>              return 0;
>>          }
>>          if ( cmp <= 0 )
>>              break;
>>      }
>> -    spin_unlock(&vpci->lock);
>
> Same here about the per-domain lock being taken.
>
>>  
>>      return -ENOENT;
>>  }
>> @@ -320,7 +332,7 @@ static uint32_t merge_result(uint32_t data, uint32_t 
>> new, unsigned int size,
>>  
>>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>> @@ -333,10 +345,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, 
>> unsigned int size)
>>      }
>>  
>>      /* Find the PCI dev matching the address. */
>> +    pcidevs_lock();
>>      pdev = pci_get_pdev(d, sbdf);
>> -    if ( !pdev || !pdev->vpci )
>> +    pcidevs_unlock();
>
> I think it's worth looking into expanding the per-domain vpci-lock to
> a pdevs lock or similar in order to protect the pdev_list also if
> possible.  So that we can avoid taking the pcidevs lock here.

Agree


-- 
WBR, Volodymyr

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.