[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v9 08/16] vpci/header: handle p2m range sets per BAR


  • To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Wed, 20 Sep 2023 13:35:06 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=j4MRD6gNFNJh61LVyeuvy3dTlHcFfDyJli0rfl97l1M=; b=Ong3YpJHv7+s9NkNPoB0683G+9zzN4PoZOTHyepiAlXOtYdc0TjYRjBvAlhq5BUwjyZHpu0U8ciAFYB+eJSkB+BSTSXb0o5/r/Yi32t0BZ8YB+c0HnkoOOyRZwCw720/pdZLMECynPl2NkJ8gLHUJwCPdAodlpUW5yCxeuPwwZavNvwB2NapuLQ/24+cNRx4IVc/I6oqPdX24wGvxWaY9N6UOXR05ereiEJMQzmhzUpLWZtokhilAtj1vh6+SSR7iIVVbXYvCsoH21CBQKEDcvU8b6G5f9Y34zhHMTx95A+SmMuKUpagf5UCWjP+8gYdttGg5q8UdpEsGGwjkoNuWA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=STH4Ypr+mbKUuu2f4nz99mxPcyFVl1QK0SKaA4jzC4BSg/zp+O4jF2nCHcr65TrOvrx6AVMucer8Vvy8GWUyRvj8mWLNONeDmoxhPBReboo2XO+H1jM8AFBEiaxRylESuI5Dt2Fnp3gG2YEOb6SoazS4nL2cFOm/mW/Hs1e4hVq5QN1WfljpIqtpOAa2vMSXEJIWqEb2FZpsH/7zAIFahRtW7WQH9sjQ05fS3NFWLurL+pxygdla7P7ZtcYqxFyGy/MffPArPIsLwN6CsqemPmducdsiPnLA48DGT0tFqRwfMuqWsjCKBWmKAqw3TyEbB9Uup4gAUTFzvAHf9jpmtQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Stewart Hildebrand <stewart.hildebrand@xxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>
  • Delivery-date: Wed, 20 Sep 2023 11:35:39 +0000
  • Ironport-data: A9a23:P8U6t6Lhk+d8UG0BFE+RCZQlxSXFcZb7ZxGr2PjKsXjdYENS3jBWy WMdC2GAPfeMMGf2Kd13O4jj80MCsMLRnd9gGVFlqX01Q3x08seUXt7xwmUcnc+xBpaaEB84t ZV2hv3odp1coqr0/0/1WlTZhSAhk/nOHvylULKs1hlZHWdMUD0mhQ9oh9k3i4tphcnRKw6Ws Jb5rta31GWNglaYCUpKrfrZwP9TlK6q4mhA7wZmPakjUGL2zBH5MrpOfcldEFOgKmVkNrbSb /rOyri/4lTY838FYj9yuu+mGqGiaue60Tmm0hK6aYD76vRxjnVaPpIAHOgdcS9qZwChxLid/ jnvWauYEm/FNoWU8AgUvoIx/ytWZcWq85efSZSzXFD6I+QrvBIAzt03ZHzaM7H09c5dK01r6 tIpdgktSVeJve2O+Zymd+tF05FLwMnDZOvzu1lG5BSAV7MDfsqGRK/Ho9hFwD03m8ZCW+7EY NYUYiZuaxKGZABTPlAQC9Q1m+LAanvXKmUE7g7K4/RppTSMpOBy+OGF3N79YNuFSN8Thk+Fj mnH4374ElcRM9n3JT+tqyj33LWXwHugMG4UPKWJ7sRng2CN+kUCNTwEV2Cb/uuXlWfrDrqzL GRRoELCt5MaykuvSdXsWgyil1SNtBUcRtl4HvUz7UeGza+8yxaUAC0IQyBMbPQitdQqXno62 1mRhdTrCDdz9rqPRhq17auZsjqoJW4VLGsOaCUeRA0Jy9DmrMc4iRenZvFnHa2uh9v5AwbZx TyQsTM+jLUei80M/6ij9FWBiDWpzqUlVSYw7wTTG2e6tAVwYdf/Y5TysQSEq/FdMIyeU1+N+ mAenNST5/wPCpfLkzGRROIKH/ei4PPt3CDgvGOD1qIJr1yFk0NPt6gKiN2iDC+F6vo5RAI=
  • Ironport-hdrordr: A9a23:6fauvqHW4bjMTwQVpLqEEseALOsnbusQ8zAXPiBKJCC9vPb5qy nOpoV+6faQslwssR4b9uxoVJPvfZq+z+8R3WByB8bAYOCOggLBQL2KhbGI/9SKIVydygcy78 Zdm6gVMqyMMbB55/yKnDVRxbwbsaa6GKPDv5ah8590JzsaDJ2Jd21Ce32m+ksdfnghObMJUK Cyy+BgvDSadXEefq2AdwM4t7iqnayzqHr+CyR2fyIa1A==
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Aug 29, 2023 at 11:19:44PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
> 
> Instead of handling a single range set, that contains all the memory
> regions of all the BARs and ROM, have them per BAR.
> As the range sets are now created when a PCI device is added and destroyed
> when it is removed so make them named and accounted.
> 
> Note that rangesets were chosen here despite there being only up to
> 3 separate ranges in each set (typically just 1). But rangeset per BAR
> was chosen for the ease of implementation and existing code re-usability.
> 
> This is in preparation of making non-identity mappings in p2m for the MMIOs.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
> 
> ---
> Since v9:
> - removed d->vpci.map_pending in favor of checking v->vpci.pdev !=
> NULL
> - printk -> gprintk
> - renamed bar variable to fix shadowing
> - fixed bug with iterating on remote device's BARs
> - relaxed lock in vpci_process_pending
> - removed stale comment
> Since v6:
> - update according to the new locking scheme
> - remove odd fail label in modify_bars
> Since v5:
> - fix comments
> - move rangeset allocation to init_bars and only allocate
>   for MAPPABLE BARs
> - check for overlap with the already setup BAR ranges
> Since v4:
> - use named range sets for BARs (Jan)
> - changes required by the new locking scheme
> - updated commit message (Jan)
> Since v3:
> - re-work vpci_cancel_pending accordingly to the per-BAR handling
> - s/num_mem_ranges/map_pending and s/uint8_t/bool
> - ASSERT(bar->mem) in modify_bars
> - create and destroy the rangesets on add/remove
> ---
>  xen/drivers/vpci/header.c | 252 ++++++++++++++++++++++++++------------
>  xen/drivers/vpci/vpci.c   |   6 +
>  xen/include/xen/vpci.h    |   2 +-
>  3 files changed, 180 insertions(+), 80 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index e96d7b2b37..3cc6a96849 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -161,63 +161,101 @@ static void modify_decoding(const struct pci_dev 
> *pdev, uint16_t cmd,
>  
>  bool vpci_process_pending(struct vcpu *v)
>  {
> -    if ( v->vpci.mem )
> +    struct pci_dev *pdev = v->vpci.pdev;
> +    struct map_data data = {
> +        .d = v->domain,
> +        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> +    };
> +    struct vpci_header *header = NULL;
> +    unsigned int i;
> +
> +    if ( !pdev )
> +        return false;
> +
> +    read_lock(&v->domain->pci_lock);
> +    header = &pdev->vpci->header;

You should likely check that pdev->vpci != NULL before accessing it,
and that the device is still assigned to the domain, v->domain ==
pdev->domain.

> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        struct map_data data = {
> -            .d = v->domain,
> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> -        };
> -        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
> +        struct vpci_bar *bar = &header->bars[i];
> +        int rc;
> +
> +        if ( rangeset_is_empty(bar->mem) )
> +            continue;
> +
> +        rc = rangeset_consume_ranges(bar->mem, map_range, &data);
>  
>          if ( rc == -ERESTART )
> +        {
> +            read_unlock(&v->domain->pci_lock);
>              return true;
> +        }
>  
> -        write_lock(&v->domain->pci_lock);
> -        spin_lock(&v->vpci.pdev->vpci->lock);
> -        /* Disable memory decoding unconditionally on failure. */
> -        modify_decoding(v->vpci.pdev,
> -                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
> -                        !rc && v->vpci.rom_only);
> -        spin_unlock(&v->vpci.pdev->vpci->lock);
> -
> -        rangeset_destroy(v->vpci.mem);
> -        v->vpci.mem = NULL;
>          if ( rc )
> -            /*
> -             * FIXME: in case of failure remove the device from the domain.
> -             * Note that there might still be leftover mappings. While this 
> is
> -             * safe for Dom0, for DomUs the domain will likely need to be
> -             * killed in order to avoid leaking stale p2m mappings on
> -             * failure.
> -             */
> -            vpci_deassign_device(v->vpci.pdev);
> -        write_unlock(&v->domain->pci_lock);
> +        {
> +            spin_lock(&pdev->vpci->lock);
> +            /* Disable memory decoding unconditionally on failure. */
> +            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
> +                            false);
> +            spin_unlock(&pdev->vpci->lock);
> +
> +            v->vpci.pdev = NULL;
> +
> +            read_unlock(&v->domain->pci_lock);
> +
> +            if ( is_hardware_domain(v->domain) )
> +            {
> +                write_lock(&v->domain->pci_lock);

This unlock/lock dance is racy, and I'm not sure there's much point in
removing the vPCI handlers for the device, it's not likely to be
helpful to dom0.  It might be better to just unconditionally disable
memory decoding and empty all the rangesets.  Not sure whether there's
more cached state that would need dealing with in pdev->vpci.

Maybe as a bodge you could leave the current vpci_deassign_device()
call and check that pdev->domain == v->domain after having taken the
pci_lock.

> +                vpci_deassign_device(v->vpci.pdev);
> +                write_unlock(&v->domain->pci_lock);
> +            }
> +            else
> +            {
> +                domain_crash(v->domain);
> +            }
> +            return false;
> +        }
>      }
> +    read_unlock(&v->domain->pci_lock);
> +
> +    v->vpci.pdev = NULL;
> +
> +    spin_lock(&pdev->vpci->lock);
> +    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
> +    spin_unlock(&pdev->vpci->lock);

Why do you drop the pci_lock before calling modify_decoding()?  It
needs to stay locked until operations on pdev have finished, iow:
after modify_decoding(), or else accessing the contents of pdev->vpci
is not safe, and the device could be deassigned in the meantime.

>  
>      return false;
>  }
>  
>  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> -                            struct rangeset *mem, uint16_t cmd)
> +                            uint16_t cmd)
>  {
>      struct map_data data = { .d = d, .map = true };
> -    int rc;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    int rc = 0;
> +    unsigned int i;
>  
>      ASSERT(rw_is_locked(&d->pci_lock));
>  
> -    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == 
> -ERESTART )
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        /*
> -         * It's safe to drop and reacquire the lock in this context
> -         * without risking pdev disappearing because devices cannot be
> -         * removed until the initial domain has been started.
> -         */
> -        read_unlock(&d->pci_lock);
> -        process_pending_softirqs();
> -        read_lock(&d->pci_lock);
> -    }
> +        struct vpci_bar *bar = &header->bars[i];
>  
> -    rangeset_destroy(mem);
> +        if ( rangeset_is_empty(bar->mem) )
> +            continue;
> +
> +        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
> +                                              &data)) == -ERESTART )
> +        {
> +            /*
> +             * It's safe to drop and reacquire the lock in this context
> +             * without risking pdev disappearing because devices cannot be
> +             * removed until the initial domain has been started.
> +             */
> +            write_unlock(&d->pci_lock);
> +            process_pending_softirqs();
> +            write_lock(&d->pci_lock);
> +        }
> +    }
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
>  
> @@ -225,10 +263,12 @@ static int __init apply_map(struct domain *d, const 
> struct pci_dev *pdev,
>  }
>  
>  static void defer_map(struct domain *d, struct pci_dev *pdev,
> -                      struct rangeset *mem, uint16_t cmd, bool rom_only)
> +                      uint16_t cmd, bool rom_only)
>  {
>      struct vcpu *curr = current;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      /*
>       * FIXME: when deferring the {un}map the state of the device should not
>       * be trusted. For example the enable bit is toggled after the device
> @@ -236,7 +276,6 @@ static void defer_map(struct domain *d, struct pci_dev 
> *pdev,
>       * started for the same device if the domain is not well-behaved.
>       */
>      curr->vpci.pdev = pdev;
> -    curr->vpci.mem = mem;
>      curr->vpci.cmd = cmd;
>      curr->vpci.rom_only = rom_only;
>      /*
> @@ -250,33 +289,33 @@ static void defer_map(struct domain *d, struct pci_dev 
> *pdev,
>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool 
> rom_only)
>  {
>      struct vpci_header *header = &pdev->vpci->header;
> -    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>      struct pci_dev *tmp, *dev = NULL;
>      const struct domain *d;
>      const struct vpci_msix *msix = pdev->vpci->msix;
> -    unsigned int i;
> +    unsigned int i, j;
>      int rc;
>  
>      ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>  
> -    if ( !mem )
> -        return -ENOMEM;
> -
>      /*
> -     * Create a rangeset that represents the current device BARs memory 
> region
> -     * and compare it against all the currently active BAR memory regions. If
> -     * an overlap is found, subtract it from the region to be 
> mapped/unmapped.
> +     * Create a rangeset per BAR that represents the current device memory
> +     * region and compare it against all the currently active BAR memory
> +     * regions. If an overlap is found, subtract it from the region to be
> +     * mapped/unmapped.
>       *
> -     * First fill the rangeset with all the BARs of this device or with the 
> ROM
> +     * First fill the rangesets with the BAR of this device or with the ROM
>       * BAR only, depending on whether the guest is toggling the memory decode
>       * bit of the command register, or the enable bit of the ROM BAR 
> register.
>       */
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        const struct vpci_bar *bar = &header->bars[i];
> +        struct vpci_bar *bar = &header->bars[i];
>          unsigned long start = PFN_DOWN(bar->addr);
>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>  
> +        if ( !bar->mem )
> +            continue;
> +
>          if ( !MAPPABLE_BAR(bar) ||
>               (rom_only ? bar->type != VPCI_BAR_ROM
>                         : (bar->type == VPCI_BAR_ROM && 
> !header->rom_enabled)) ||
> @@ -292,14 +331,31 @@ static int modify_bars(const struct pci_dev *pdev, 
> uint16_t cmd, bool rom_only)
>              continue;
>          }
>  
> -        rc = rangeset_add_range(mem, start, end);
> +        rc = rangeset_add_range(bar->mem, start, end);
>          if ( rc )
>          {
>              printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
>                     start, end, rc);
> -            rangeset_destroy(mem);
>              return rc;
>          }
> +
> +        /* Check for overlap with the already setup BAR ranges. */
> +        for ( j = 0; j < i; j++ )
> +        {
> +            struct vpci_bar *prev_bar = &header->bars[j];
> +
> +            if ( rangeset_is_empty(prev_bar->mem) )
> +                continue;
> +
> +            rc = rangeset_remove_range(prev_bar->mem, start, end);
> +            if ( rc )
> +            {
> +                gprintk(XENLOG_WARNING,
> +                       "%pp: failed to remove overlapping range [%lx, %lx]: 
> %d\n",
> +                        &pdev->sbdf, start, end, rc);
> +                return rc;
> +            }
> +        }
>      }
>  
>      /* Remove any MSIX regions if present. */
> @@ -309,14 +365,21 @@ static int modify_bars(const struct pci_dev *pdev, 
> uint16_t cmd, bool rom_only)
>          unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>                                       vmsix_table_size(pdev->vpci, i) - 1);
>  
> -        rc = rangeset_remove_range(mem, start, end);
> -        if ( rc )
> +        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
>          {
> -            printk(XENLOG_G_WARNING
> -                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
> -                   start, end, rc);
> -            rangeset_destroy(mem);
> -            return rc;
> +            const struct vpci_bar *bar = &header->bars[j];
> +
> +            if ( rangeset_is_empty(bar->mem) )
> +                continue;
> +
> +            rc = rangeset_remove_range(bar->mem, start, end);
> +            if ( rc )
> +            {
> +                gprintk(XENLOG_WARNING,
> +                       "%pp: failed to remove MSIX table [%lx, %lx]: %d\n",
> +                        &pdev->sbdf, start, end, rc);
> +                return rc;
> +            }
>          }
>      }
>  
> @@ -356,27 +419,34 @@ static int modify_bars(const struct pci_dev *pdev, 
> uint16_t cmd, bool rom_only)
>  
>              for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>              {
> -                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> -                unsigned long start = PFN_DOWN(bar->addr);
> -                unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> -
> -                if ( !bar->enabled ||
> -                     !rangeset_overlaps_range(mem, start, end) ||
> -                     /*
> -                      * If only the ROM enable bit is toggled check against
> -                      * other BARs in the same device for overlaps, but not
> -                      * against the same ROM BAR.
> -                      */
> -                     (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
> +                const struct vpci_bar *remote_bar = 
> &tmp->vpci->header.bars[i];
> +                unsigned long start = PFN_DOWN(remote_bar->addr);
> +                unsigned long end = PFN_DOWN(remote_bar->addr +
> +                                             remote_bar->size - 1);
> +
> +                if ( !remote_bar->enabled )
>                      continue;
>  
> -                rc = rangeset_remove_range(mem, start, end);
> -                if ( rc )
> +                for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
>                  {
> -                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: 
> %d\n",
> -                           start, end, rc);
> -                    rangeset_destroy(mem);
> -                    return rc;
> +                    const struct vpci_bar *bar = &header->bars[j];
> +                    if ( !rangeset_overlaps_range(bar->mem, start, end) ||

Missing newline between local variable definition and code.

> +                         /*
> +                          * If only the ROM enable bit is toggled check 
> against
> +                          * other BARs in the same device for overlaps, but 
> not
> +                          * against the same ROM BAR.
> +                          */
> +                         (rom_only && tmp == pdev && bar->type == 
> VPCI_BAR_ROM) )
> +                        continue;
> +
> +                    rc = rangeset_remove_range(bar->mem, start, end);
> +                    if ( rc )
> +                    {
> +                        gprintk(XENLOG_WARNING,
> +                                "%pp: failed to remove [%lx, %lx]: %d\n",
> +                                &pdev->sbdf, start, end, rc);
> +                        return rc;
> +                    }
>                  }
>              }
>          }
> @@ -400,10 +470,10 @@ static int modify_bars(const struct pci_dev *pdev, 
> uint16_t cmd, bool rom_only)
>           * will always be to establish mappings and process all the BARs.
>           */
>          ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
> -        return apply_map(pdev->domain, pdev, mem, cmd);
> +        return apply_map(pdev->domain, pdev, cmd);
>      }
>  
> -    defer_map(dev->domain, dev, mem, cmd, rom_only);
> +    defer_map(dev->domain, dev, cmd, rom_only);
>  
>      return 0;
>  }
> @@ -595,6 +665,20 @@ static void cf_check rom_write(
>          rom->addr = val & PCI_ROM_ADDRESS_MASK;
>  }
>  
> +static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
> +                            unsigned int i)
> +{
> +    char str[32];
> +
> +    snprintf(str, sizeof(str), "%pp:BAR%d", &pdev->sbdf, i);

%u for i.

> +
> +    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
> +    if ( !bar->mem )
> +        return -ENOMEM;
> +
> +    return 0;

Could be simplified as:

return !bar->mem ? -ENOMEM : 0;

But I don't have a strong opinion, I understand some people might find
this obscure.

> +}
> +
>  static int cf_check init_bars(struct pci_dev *pdev)
>  {
>      uint16_t cmd;
> @@ -675,6 +759,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          else
>              bars[i].type = VPCI_BAR_MEM32;
>  
> +        rc = bar_add_rangeset(pdev, &bars[i], i);
> +        if ( rc )
> +            return rc;

Don't you need to use the fail label in order to restore the previous
command register value on the device? (here and below)

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.