[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH] pci: introduce per-domain PCI rwlock


  • To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Tue, 11 Jul 2023 12:24:48 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=GSwtxfn4Df1/CjsGH7f5Pj53/pUlgvyEjgWxiLx7wFA=; b=O3OpMAK/+0XUc0MSMVeQLCWRwmYWmfuX0pJMe+2x/ILLowxJl6puemd7Nwfq4Hb8l9AHs7dVoTmaNGKnqYRkjeWBX03AWfyxQt5wq/G5i6bxNNYnpNjRYEyRE/A4toAV6axncimf4Me0Ht1fB7Pcn1BTSdqZRpJlrwBiW3fT/lqp5o+S6x6FdKqd3HN4VY9wqS+WYYeIiBSU8TtKBkS+aR6FLO1vUJPBtp20sB783X0rko5f/9fch0vS1w1YcRW4Oj/GODpp8JQs+F0dsIqbEi9Y6fSezxDcoHONpZ6F/83wa6VRqdWNFq6r6IT0oIn6LFuB/bTXKvN60aeidf9u5A==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=mliOuvzK08WKWdEOPGDiRb62/WwOZX+dkY9fsrWRccQYE/eFXkUm3vYakd+EKExa2DrbJOnBZdmVjU2anfiDTVsoZpQjHCWvCqLvUD2iEE9vk26Yr2TajJoSl5DpfCkEmFN5eTgEFDc9Us1vgAZ/KOhzBQgPzH61K5LYPxOzSdY1p0GMTZ+qdJ8vSFjh1guA0Dv9l7pFixb/BPdW/vRQVJUh1yJQHy4WpPBLGYBdZXvaIptoaTQl1ibLz7JkP6NZ6jTTuK5ZhrsWNLTqq3VEvrTidxp2gcluXeRdN0v/v/ztkHwa7r6Lh69Y0y6dlfVt56JsUUoavNhFdGMr1uzEgQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Oleksandr Tyshchenko <olekstysh@xxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Tue, 11 Jul 2023 10:25:11 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 11.07.2023 02:46, Volodymyr Babchuk wrote:
> Add per-domain d->pci_lock that protects access to
> d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
> that underlying pdev will not disappear under feet. Later it will also
> protect pdev->vpci structure and pdev access in modify_bars().
> 
> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@xxxxxxxx>
> 
> ---
> 
> This patch should be part of VPCI series, but I am posting it as a
> sinle-patch RFC to discuss changes to x86 MM and IOMMU code.

To aid review / judgement extending the commit message would help, to
outline around which function invocations (and for what reason) the
lock now needs to be held. Furthermore lock nesting rules want writing
down (perhaps next to the declaration of the lock). Therefore comments
below are merely preliminary and likely incomplete.

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2381,12 +2381,14 @@ int hvm_set_cr0(unsigned long value, bool may_defer)
>          }
>      }
>  
> +    read_lock(&d->pci_lock);
>      if ( ((value ^ old_value) & X86_CR0_CD) &&
>           is_iommu_enabled(d) && hvm_funcs.handle_cd &&
>           (!rangeset_is_empty(d->iomem_caps) ||
>            !rangeset_is_empty(d->arch.ioport_caps) ||
>            has_arch_pdevs(d)) )
>          alternative_vcall(hvm_funcs.handle_cd, v, value);
> +    read_unlock(&d->pci_lock);

handle_cd() is non-trivial - did you you audit it for safety of
holding a lock around it?

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -858,12 +858,15 @@ get_page_from_l1e(
>          return 0;
>      }
>  
> +    read_lock(&l1e_owner->pci_lock);
>      if ( unlikely(l1f & l1_disallow_mask(l1e_owner)) )
>      {
>          gdprintk(XENLOG_WARNING, "Bad L1 flags %x\n",
>                   l1f & l1_disallow_mask(l1e_owner));
> +        read_unlock(&l1e_owner->pci_lock);

In cases like this one I think you want to avoid holding the lock
across the printk(). This can easily be arranged for by latching
l1_disallow_mask()'s return value into a new local variable.

> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -349,10 +349,12 @@ p2m_pod_set_mem_target(struct domain *d, unsigned long 
> target)
>  
>      ASSERT( pod_target >= p2m->pod.count );
>  
> +    read_lock(&d->pci_lock);
>      if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
>          ret = -ENOTEMPTY;
>      else
>          ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/);
> +    read_unlock(&d->pci_lock);

Hmm, is it necessary to hold the lock across the function call?

> --- a/xen/arch/x86/mm/paging.c
> +++ b/xen/arch/x86/mm/paging.c
> @@ -205,21 +205,27 @@ static int paging_log_dirty_enable(struct domain *d)
>  {
>      int ret;
>  
> +    read_lock(&d->pci_lock);
>      if ( has_arch_pdevs(d) )
>      {
>          /*
>           * Refuse to turn on global log-dirty mode
>           * if the domain is sharing the P2M with the IOMMU.
>           */
> +        read_unlock(&d->pci_lock);
>          return -EINVAL;
>      }
>  
>      if ( paging_mode_log_dirty(d) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return -EINVAL;
> +    }
>  
>      domain_pause(d);
>      ret = d->arch.paging.log_dirty.ops->enable(d);
>      domain_unpause(d);
> +    read_unlock(&d->pci_lock);

This means a relatively long potential lock holding time. I wonder
whether lock release shouldn't be delegated to the ->enable() hook,
as it could do so immediately after setting the flag that would
then prevent assignment of devices.

> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -102,6 +102,8 @@ static bool any_pdev_behind_iommu(const struct domain *d,
>  {
>      const struct pci_dev *pdev;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
>      for_each_pdev ( d, pdev )
>      {
>          if ( pdev == exclude )
> @@ -467,17 +469,24 @@ static int cf_check reassign_device(
>  
>      if ( !QUARANTINE_SKIP(target, pdev) )
>      {
> +     read_lock(&target->pci_lock);
>          rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
>          if ( rc )
>              return rc;
> +     read_unlock(&target->pci_lock);

You need to drop the lock before the if().

Also nit: No hard tabs here please.

>      }
>      else
>          amd_iommu_disable_domain_device(source, iommu, devfn, pdev);

Related to my initial comment at the top: It wants clarifying for example
why "setup" needs to lock held, but "disable" doesn't.

>      if ( devfn == pdev->devfn && pdev->domain != target )
>      {
> -        list_move(&pdev->domain_list, &target->pdev_list);
> -        pdev->domain = target;
> +        write_lock(&pdev->domain->pci_lock);

Shorter as write_lock(&source->pci_lock)? (Also in the VT-d counterpart
then.)

> @@ -748,7 +750,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>      if ( !pdev->domain )
>      {
>          pdev->domain = hardware_domain;
> +        write_lock(&hardware_domain->pci_lock);
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> +        write_unlock(&hardware_domain->pci_lock);

What about the companion pci_remove_device()?

> @@ -887,26 +891,62 @@ static int deassign_device(struct domain *d, uint16_t 
> seg, uint8_t bus,
>  
>  int pci_release_devices(struct domain *d)
>  {
> -    struct pci_dev *pdev, *tmp;
> -    u8 bus, devfn;
> -    int ret;
> +    int combined_ret;
> +    LIST_HEAD(failed_pdevs);
>  
>      pcidevs_lock();
> -    ret = arch_pci_clean_pirqs(d);
> -    if ( ret )
> +    write_lock(&d->pci_lock);
> +    combined_ret = arch_pci_clean_pirqs(d);
> +    if ( combined_ret )
>      {
>          pcidevs_unlock();
> -        return ret;
> +        write_unlock(&d->pci_lock);
> +        return combined_ret;
>      }
> -    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
> +
> +    while ( !list_empty(&d->pdev_list) )
>      {
> -        bus = pdev->bus;
> -        devfn = pdev->devfn;
> -        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
> +        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
> +                                                struct pci_dev,
> +                                                domain_list);
> +        uint16_t seg = pdev->seg;
> +        uint8_t bus = pdev->bus;
> +        uint8_t devfn = pdev->devfn;
> +        int ret;
> +
> +        write_unlock(&d->pci_lock);
> +        ret = deassign_device(d, seg, bus, devfn);
> +        write_lock(&d->pci_lock);
> +        if ( ret )
> +        {
> +            bool still_present = false;
> +            const struct pci_dev *tmp;
> +
> +            /*
> +             * We need to check if deassign_device() left our pdev in
> +             * domain's list. As we dropped the lock, we can't be sure
> +             * that list wasn't permutated in some random way, so we
> +             * need to traverse the whole list.
> +             */
> +            for_each_pdev ( d, tmp )
> +            {
> +                if ( tmp == pdev )
> +                {
> +                    still_present = true;
> +                    break;
> +                }
> +            }
> +            if ( still_present )
> +                list_move(&pdev->domain_list, &failed_pdevs);
> +            combined_ret = ret;

Elsewhere we aim at returning the first error that was encountered, not
the last one.

> @@ -2765,6 +2767,7 @@ static int cf_check reassign_device_ownership(
>  
>      if ( !QUARANTINE_SKIP(target, pdev->arch.vtd.pgd_maddr) )
>      {
> +        read_lock(&target->pci_lock);
>          if ( !has_arch_pdevs(target) )
>              vmx_pi_hooks_assign(target);

I'm afraid this and the unhook side locking isn't sufficient to guarantee
no races. Things still depend on the domctl and/or pcidevs lock being
held around this. As which points acquiring the lock here (and below) is
of questionable value. In any event I think this warrants code comments.

Possibly the same also applies to check_cleanup_domid_map() and friends.

> @@ -2780,21 +2783,26 @@ static int cf_check reassign_device_ownership(
>  #endif
>  
>          ret = domain_context_mapping(target, devfn, pdev);
> +        read_unlock(&target->pci_lock);

Other calls to domain_context_mapping() aren't wrapped like this. Same
for domain_context_unmap(), wrapped exactly once below.

>          if ( !ret && pdev->devfn == devfn &&
>               !QUARANTINE_SKIP(source, pdev->arch.vtd.pgd_maddr) )
>          {
>              const struct acpi_drhd_unit *drhd = 
> acpi_find_matched_drhd_unit(pdev);
>  
> +            read_lock(&source->pci_lock);
>              if ( drhd )
>                  check_cleanup_domid_map(source, pdev, drhd->iommu);
> +            read_unlock(&source->pci_lock);

Acquiring the lock inside the if() ought to suffice here.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.