[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH] pci: introduce per-domain PCI rwlock


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Date: Tue, 11 Jul 2023 18:40:11 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/p+MxtRDwvtYjlUJxdd/9xXXLDc+sRhmhaimPrYvPIQ=; b=FZ3Fr2uuYjq77ppXsQkmhVKKChEJ0kUM6xZ1QuY2KVMJKzyT+Do3X3TWJtmtM2W4YEFLAjXPSrbnvcNd+70/05gFduvNin7JLhGtaqK6ul2IwxzFrs+8Uwntmx7YVT0SdQzMVAewmq/0fzl+c1XFOaVZ7meg1vULasvdlDKFsHl8HFG6y9DFZnv8/l1NIj63aC1p9N6haNeZ8HWK0Zncu+grZ9UdUAMdWgRRNnYqH3oxPBY09xZ6m/c0M9DLB+KZZJhtS0cfJQyYHMTi0D8w6KTeJvLyZKDsB4xZmWh8xMbo9W3ZjxwK7EMre9KAIJbLwOuQ7mIX5iEpjeijID0WLA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=FpzFL60U6RCuOXJnV1NuAoKOHpcPjNzJjkUeOkVdgW6kz5pTJ5jgkzdvafiUda1IkZms6Yw+hzK4nZIALQCa2IM8Z6vP8iC3vJsEE1gSapZKv+mFHxPS50daCJNWXdsJFPjXJNuVNpFeOk6rfYjLSFd2spURW4lv3ThcH5i/Fmse+OwspNSUbt9Tm/lA/XTk+eH6XzVIhkYBAR9bs4z5pCZ9DTKUEF6VVRGwfaBuTLFZ5X6G8T0dMJxPa5Mb7XnrpvvnwcSbstlh6nNFdjSk7gYQyBAvf9nsT23GZ/EeVXrmOSlIGQ4KsWP2UzgAihe47J51UQksMhmrd7+HILBiag==
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Oleksandr Tyshchenko <olekstysh@xxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Tue, 11 Jul 2023 18:40:42 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHZs5ESsjHhLtlMU0qzCbDGds6cO6+0XLQAgABxyYA=
  • Thread-topic: [RFC PATCH] pci: introduce per-domain PCI rwlock

Hi Jan,

Jan Beulich <jbeulich@xxxxxxxx> writes:

> On 11.07.2023 02:46, Volodymyr Babchuk wrote:
>> Add per-domain d->pci_lock that protects access to
>> d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
>> that underlying pdev will not disappear under feet. Later it will also
>> protect pdev->vpci structure and pdev access in modify_bars().
>> 
>> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
>> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@xxxxxxxx>
>> 
>> ---
>> 
>> This patch should be part of VPCI series, but I am posting it as a
>> sinle-patch RFC to discuss changes to x86 MM and IOMMU code.
>
> To aid review / judgement extending the commit message would help, to
> outline around which function invocations (and for what reason) the
> lock now needs to be held. Furthermore lock nesting rules want writing
> down (perhaps next to the declaration of the lock). Therefore comments
> below are merely preliminary and likely incomplete.

I added lock in places where underlying code touches d->pdev_list. My
intention was to lock parts of code that might depend on list
contents. This is straightforward in case we are traversing the list, but
it is much more complicated (for me at least) in cases where
has_arch_pdevs() macro is involved. Prior to my patch uses of
has_arch_pdevs() weren't protected by pci lock at all. This begs
question: do wee need to protect it now? And if we need, which portion
of the code needs to be protected? I did my best trying to isolated the
affected parts of the code.

>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -2381,12 +2381,14 @@ int hvm_set_cr0(unsigned long value, bool may_defer)
>>          }
>>      }
>>  
>> +    read_lock(&d->pci_lock);
>>      if ( ((value ^ old_value) & X86_CR0_CD) &&
>>           is_iommu_enabled(d) && hvm_funcs.handle_cd &&
>>           (!rangeset_is_empty(d->iomem_caps) ||
>>            !rangeset_is_empty(d->arch.ioport_caps) ||
>>            has_arch_pdevs(d)) )
>>          alternative_vcall(hvm_funcs.handle_cd, v, value);
>> +    read_unlock(&d->pci_lock);
>
> handle_cd() is non-trivial - did you you audit it for safety of
> holding a lock around it?

Well, I only vmx_handle_cd() implements this call. I scanned through it
and didn't found any other PCI-related things inside. It acquires
v->arch.hvm.vmx.vmcs_lock, but I didn't found potential for dead locks.

On other hand - do we really need to call in under d->pci_lock? What bad
will happen if has_arch_pdevs(d) will become false during handle_cd()
execution?

>
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -858,12 +858,15 @@ get_page_from_l1e(
>>          return 0;
>>      }
>>  
>> +    read_lock(&l1e_owner->pci_lock);
>>      if ( unlikely(l1f & l1_disallow_mask(l1e_owner)) )
>>      {
>>          gdprintk(XENLOG_WARNING, "Bad L1 flags %x\n",
>>                   l1f & l1_disallow_mask(l1e_owner));
>> +        read_unlock(&l1e_owner->pci_lock);
>
> In cases like this one I think you want to avoid holding the lock
> across the printk(). This can easily be arranged for by latching
> l1_disallow_mask()'s return value into a new local variable.

Sure, will rework.

>> --- a/xen/arch/x86/mm/p2m-pod.c
>> +++ b/xen/arch/x86/mm/p2m-pod.c
>> @@ -349,10 +349,12 @@ p2m_pod_set_mem_target(struct domain *d, unsigned long 
>> target)
>>  
>>      ASSERT( pod_target >= p2m->pod.count );
>>  
>> +    read_lock(&d->pci_lock);
>>      if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
>>          ret = -ENOTEMPTY;
>>      else
>>          ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/);
>> +    read_unlock(&d->pci_lock);
>
> Hmm, is it necessary to hold the lock across the function call?

Well, I am not sure. Will it be okay to just check has_arch_pdevs()
while holding a lock? What if it would change it's result in the next
instant?


>> --- a/xen/arch/x86/mm/paging.c
>> +++ b/xen/arch/x86/mm/paging.c
>> @@ -205,21 +205,27 @@ static int paging_log_dirty_enable(struct domain *d)
>>  {
>>      int ret;
>>  
>> +    read_lock(&d->pci_lock);
>>      if ( has_arch_pdevs(d) )
>>      {
>>          /*
>>           * Refuse to turn on global log-dirty mode
>>           * if the domain is sharing the P2M with the IOMMU.
>>           */
>> +        read_unlock(&d->pci_lock);
>>          return -EINVAL;
>>      }
>>  
>>      if ( paging_mode_log_dirty(d) )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return -EINVAL;
>> +    }
>>  
>>      domain_pause(d);
>>      ret = d->arch.paging.log_dirty.ops->enable(d);
>>      domain_unpause(d);
>> +    read_unlock(&d->pci_lock);
>
> This means a relatively long potential lock holding time. I wonder
> whether lock release shouldn't be delegated to the ->enable() hook,
> as it could do so immediately after setting the flag that would
> then prevent assignment of devices.

For me it looks a bit fragile: we need to rely on some hook to release a
lock, that wasn't acquired by the said hook. But I can do this. It
should be released after setting PG_log_dirty, correct?

BTW, I can see that hap_enable_log_dirty() uses
read_atomic(&p2m->ioreq.entry_count), but p2m_entry_modify() does just
p2m->ioreq.entry_count++ and p2m->ioreq.entry_count--;
This looks inconsistent. Also, looks like hap_enable_log_dirty() does
not hold &p2m->ioreq.lock while accessing entry_count, so its value can
change right after read_atomic().


>
>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -102,6 +102,8 @@ static bool any_pdev_behind_iommu(const struct domain *d,
>>  {
>>      const struct pci_dev *pdev;
>>  
>> +    ASSERT(rw_is_locked(&d->pci_lock));
>> +
>>      for_each_pdev ( d, pdev )
>>      {
>>          if ( pdev == exclude )
>> @@ -467,17 +469,24 @@ static int cf_check reassign_device(
>>  
>>      if ( !QUARANTINE_SKIP(target, pdev) )
>>      {
>> +    read_lock(&target->pci_lock);
>>          rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
>>          if ( rc )
>>              return rc;
>> +    read_unlock(&target->pci_lock);
>
> You need to drop the lock before the if().

Yes, thanks.

>
> Also nit: No hard tabs here please.
>
>>      }
>>      else
>>          amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
>
> Related to my initial comment at the top: It wants clarifying for example
> why "setup" needs to lock held, but "disable" doesn't.
>

Because amd_iommu_disable_domain_device() does not access d->pdev_list,
while amd_iommu_setup_domain_device() does.

Anyway, I am interested in AMD IOMMU's maintainer opinion there - what
is the correct scope for lock?

>>      if ( devfn == pdev->devfn && pdev->domain != target )
>>      {
>> -        list_move(&pdev->domain_list, &target->pdev_list);
>> -        pdev->domain = target;
>> +        write_lock(&pdev->domain->pci_lock);
>
> Shorter as write_lock(&source->pci_lock)? (Also in the VT-d counterpart
> then.)

Ah yes, sure.

>
>> @@ -748,7 +750,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>      if ( !pdev->domain )
>>      {
>>          pdev->domain = hardware_domain;
>> +        write_lock(&hardware_domain->pci_lock);
>>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
>> +        write_unlock(&hardware_domain->pci_lock);
>
> What about the companion pci_remove_device()?

Missed this. Thanks.

[...]

>> @@ -2765,6 +2767,7 @@ static int cf_check reassign_device_ownership(
>>  
>>      if ( !QUARANTINE_SKIP(target, pdev->arch.vtd.pgd_maddr) )
>>      {
>> +        read_lock(&target->pci_lock);
>>          if ( !has_arch_pdevs(target) )
>>              vmx_pi_hooks_assign(target);
>
> I'm afraid this and the unhook side locking isn't sufficient to guarantee
> no races. Things still depend on the domctl and/or pcidevs lock being
> held around this.

I have no intention to drop pcidevs lock at this time. Honestly, I am
not sure that we will be able to do this without major rework of IOMMU
code.

> As which points acquiring the lock here (and below) is
> of questionable value. In any event I think this warrants code comments.

Well, it would be good to take the lock for the first half of the function
where we deal with `target`, but we also accessing `source` at the same
time. To prevent ABBA dead lock I opted to number of finer-grained lock
acquisitions.

As for "questionable value", I am agree with you. But, if we want to
protect/serialize access to d->pdev_list, we need to use lock there.

> Possibly the same also applies to check_cleanup_domid_map() and friends.
>
>> @@ -2780,21 +2783,26 @@ static int cf_check reassign_device_ownership(
>>  #endif
>>  
>>          ret = domain_context_mapping(target, devfn, pdev);
>> +        read_unlock(&target->pci_lock);
>
> Other calls to domain_context_mapping() aren't wrapped like this. Same
> for domain_context_unmap(), wrapped exactly once below.
>

Will add.

>>          if ( !ret && pdev->devfn == devfn &&
>>               !QUARANTINE_SKIP(source, pdev->arch.vtd.pgd_maddr) )
>>          {
>>              const struct acpi_drhd_unit *drhd = 
>> acpi_find_matched_drhd_unit(pdev);
>>  
>> +            read_lock(&source->pci_lock);
>>              if ( drhd )
>>                  check_cleanup_domid_map(source, pdev, drhd->iommu);
>> +            read_unlock(&source->pci_lock);
>
> Acquiring the lock inside the if() ought to suffice here.
>
> Jan


Roger, what is your opinion on this? If you remember, you proposed to
extend vpci_lock to protect d->pdev_list as well to deal with potential
ABBA issue in modify_bars().  But as you can see, this either leads to
another ABBA in reassign_device_ownership() or to (as Jan pointed out)
questionable value of this new lock in some cases.

-- 
WBR, Volodymyr

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.