Xen project Mailing List

Re: [RFC PATCH] pci: introduce per-domain PCI rwlock

From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>

Date: Tue, 11 Jul 2023 18:40:11 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/p+MxtRDwvtYjlUJxdd/9xXXLDc+sRhmhaimPrYvPIQ=; b=FZ3Fr2uuYjq77ppXsQkmhVKKChEJ0kUM6xZ1QuY2KVMJKzyT+Do3X3TWJtmtM2W4YEFLAjXPSrbnvcNd+70/05gFduvNin7JLhGtaqK6ul2IwxzFrs+8Uwntmx7YVT0SdQzMVAewmq/0fzl+c1XFOaVZ7meg1vULasvdlDKFsHl8HFG6y9DFZnv8/l1NIj63aC1p9N6haNeZ8HWK0Zncu+grZ9UdUAMdWgRRNnYqH3oxPBY09xZ6m/c0M9DLB+KZZJhtS0cfJQyYHMTi0D8w6KTeJvLyZKDsB4xZmWh8xMbo9W3ZjxwK7EMre9KAIJbLwOuQ7mIX5iEpjeijID0WLA==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=FpzFL60U6RCuOXJnV1NuAoKOHpcPjNzJjkUeOkVdgW6kz5pTJ5jgkzdvafiUda1IkZms6Yw+hzK4nZIALQCa2IM8Z6vP8iC3vJsEE1gSapZKv+mFHxPS50daCJNWXdsJFPjXJNuVNpFeOk6rfYjLSFd2spURW4lv3ThcH5i/Fmse+OwspNSUbt9Tm/lA/XTk+eH6XzVIhkYBAR9bs4z5pCZ9DTKUEF6VVRGwfaBuTLFZ5X6G8T0dMJxPa5Mb7XnrpvvnwcSbstlh6nNFdjSk7gYQyBAvf9nsT23GZ/EeVXrmOSlIGQ4KsWP2UzgAihe47J51UQksMhmrd7+HILBiag==

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Oleksandr Tyshchenko <olekstysh@xxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 11 Jul 2023 18:40:42 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHZs5ESsjHhLtlMU0qzCbDGds6cO6+0XLQAgABxyYA=

Thread-topic: [RFC PATCH] pci: introduce per-domain PCI rwlock

Hi Jan, Jan Beulich <jbeulich@xxxxxxxx> writes: > On 11.07.2023 02:46, Volodymyr Babchuk wrote: >> Add per-domain d->pci_lock that protects access to >> d->pdev_list. Purpose of this lock is to give guarantees to VPCI code >> that underlying pdev will not disappear under feet. Later it will also >> protect pdev->vpci structure and pdev access in modify_bars(). >> >> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> >> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx> >> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@xxxxxxxx> >> >> --- >> >> This patch should be part of VPCI series, but I am posting it as a >> sinle-patch RFC to discuss changes to x86 MM and IOMMU code. > > To aid review / judgement extending the commit message would help, to > outline around which function invocations (and for what reason) the > lock now needs to be held. Furthermore lock nesting rules want writing > down (perhaps next to the declaration of the lock). Therefore comments > below are merely preliminary and likely incomplete. I added lock in places where underlying code touches d->pdev_list. My intention was to lock parts of code that might depend on list contents. This is straightforward in case we are traversing the list, but it is much more complicated (for me at least) in cases where has_arch_pdevs() macro is involved. Prior to my patch uses of has_arch_pdevs() weren't protected by pci lock at all. This begs question: do wee need to protect it now? And if we need, which portion of the code needs to be protected? I did my best trying to isolated the affected parts of the code. >> --- a/xen/arch/x86/hvm/hvm.c >> +++ b/xen/arch/x86/hvm/hvm.c >> @@ -2381,12 +2381,14 @@ int hvm_set_cr0(unsigned long value, bool may_defer) >> } >> } >> >> + read_lock(&d->pci_lock); >> if ( ((value ^ old_value) & X86_CR0_CD) && >> is_iommu_enabled(d) && hvm_funcs.handle_cd && >> (!rangeset_is_empty(d->iomem_caps) || >> !rangeset_is_empty(d->arch.ioport_caps) || >> has_arch_pdevs(d)) ) >> alternative_vcall(hvm_funcs.handle_cd, v, value); >> + read_unlock(&d->pci_lock); > > handle_cd() is non-trivial - did you you audit it for safety of > holding a lock around it? Well, I only vmx_handle_cd() implements this call. I scanned through it and didn't found any other PCI-related things inside. It acquires v->arch.hvm.vmx.vmcs_lock, but I didn't found potential for dead locks. On other hand - do we really need to call in under d->pci_lock? What bad will happen if has_arch_pdevs(d) will become false during handle_cd() execution? > >> --- a/xen/arch/x86/mm.c >> +++ b/xen/arch/x86/mm.c >> @@ -858,12 +858,15 @@ get_page_from_l1e( >> return 0; >> } >> >> + read_lock(&l1e_owner->pci_lock); >> if ( unlikely(l1f & l1_disallow_mask(l1e_owner)) ) >> { >> gdprintk(XENLOG_WARNING, "Bad L1 flags %x\n", >> l1f & l1_disallow_mask(l1e_owner)); >> + read_unlock(&l1e_owner->pci_lock); > > In cases like this one I think you want to avoid holding the lock > across the printk(). This can easily be arranged for by latching > l1_disallow_mask()'s return value into a new local variable. Sure, will rework. >> --- a/xen/arch/x86/mm/p2m-pod.c >> +++ b/xen/arch/x86/mm/p2m-pod.c >> @@ -349,10 +349,12 @@ p2m_pod_set_mem_target(struct domain *d, unsigned long >> target) >> >> ASSERT( pod_target >= p2m->pod.count ); >> >> + read_lock(&d->pci_lock); >> if ( has_arch_pdevs(d) || cache_flush_permitted(d) ) >> ret = -ENOTEMPTY; >> else >> ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/); >> + read_unlock(&d->pci_lock); > > Hmm, is it necessary to hold the lock across the function call? Well, I am not sure. Will it be okay to just check has_arch_pdevs() while holding a lock? What if it would change it's result in the next instant? >> --- a/xen/arch/x86/mm/paging.c >> +++ b/xen/arch/x86/mm/paging.c >> @@ -205,21 +205,27 @@ static int paging_log_dirty_enable(struct domain *d) >> { >> int ret; >> >> + read_lock(&d->pci_lock); >> if ( has_arch_pdevs(d) ) >> { >> /* >> * Refuse to turn on global log-dirty mode >> * if the domain is sharing the P2M with the IOMMU. >> */ >> + read_unlock(&d->pci_lock); >> return -EINVAL; >> } >> >> if ( paging_mode_log_dirty(d) ) >> + { >> + read_unlock(&d->pci_lock); >> return -EINVAL; >> + } >> >> domain_pause(d); >> ret = d->arch.paging.log_dirty.ops->enable(d); >> domain_unpause(d); >> + read_unlock(&d->pci_lock); > > This means a relatively long potential lock holding time. I wonder > whether lock release shouldn't be delegated to the ->enable() hook, > as it could do so immediately after setting the flag that would > then prevent assignment of devices. For me it looks a bit fragile: we need to rely on some hook to release a lock, that wasn't acquired by the said hook. But I can do this. It should be released after setting PG_log_dirty, correct? BTW, I can see that hap_enable_log_dirty() uses read_atomic(&p2m->ioreq.entry_count), but p2m_entry_modify() does just p2m->ioreq.entry_count++ and p2m->ioreq.entry_count--; This looks inconsistent. Also, looks like hap_enable_log_dirty() does not hold &p2m->ioreq.lock while accessing entry_count, so its value can change right after read_atomic(). > >> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c >> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c >> @@ -102,6 +102,8 @@ static bool any_pdev_behind_iommu(const struct domain *d, >> { >> const struct pci_dev *pdev; >> >> + ASSERT(rw_is_locked(&d->pci_lock)); >> + >> for_each_pdev ( d, pdev ) >> { >> if ( pdev == exclude ) >> @@ -467,17 +469,24 @@ static int cf_check reassign_device( >> >> if ( !QUARANTINE_SKIP(target, pdev) ) >> { >> + read_lock(&target->pci_lock); >> rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev); >> if ( rc ) >> return rc; >> + read_unlock(&target->pci_lock); > > You need to drop the lock before the if(). Yes, thanks. > > Also nit: No hard tabs here please. > >> } >> else >> amd_iommu_disable_domain_device(source, iommu, devfn, pdev); > > Related to my initial comment at the top: It wants clarifying for example > why "setup" needs to lock held, but "disable" doesn't. > Because amd_iommu_disable_domain_device() does not access d->pdev_list, while amd_iommu_setup_domain_device() does. Anyway, I am interested in AMD IOMMU's maintainer opinion there - what is the correct scope for lock? >> if ( devfn == pdev->devfn && pdev->domain != target ) >> { >> - list_move(&pdev->domain_list, &target->pdev_list); >> - pdev->domain = target; >> + write_lock(&pdev->domain->pci_lock); > > Shorter as write_lock(&source->pci_lock)? (Also in the VT-d counterpart > then.) Ah yes, sure. > >> @@ -748,7 +750,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, >> if ( !pdev->domain ) >> { >> pdev->domain = hardware_domain; >> + write_lock(&hardware_domain->pci_lock); >> list_add(&pdev->domain_list, &hardware_domain->pdev_list); >> + write_unlock(&hardware_domain->pci_lock); > > What about the companion pci_remove_device()? Missed this. Thanks. [...] >> @@ -2765,6 +2767,7 @@ static int cf_check reassign_device_ownership( >> >> if ( !QUARANTINE_SKIP(target, pdev->arch.vtd.pgd_maddr) ) >> { >> + read_lock(&target->pci_lock); >> if ( !has_arch_pdevs(target) ) >> vmx_pi_hooks_assign(target); > > I'm afraid this and the unhook side locking isn't sufficient to guarantee > no races. Things still depend on the domctl and/or pcidevs lock being > held around this. I have no intention to drop pcidevs lock at this time. Honestly, I am not sure that we will be able to do this without major rework of IOMMU code. > As which points acquiring the lock here (and below) is > of questionable value. In any event I think this warrants code comments. Well, it would be good to take the lock for the first half of the function where we deal with `target`, but we also accessing `source` at the same time. To prevent ABBA dead lock I opted to number of finer-grained lock acquisitions. As for "questionable value", I am agree with you. But, if we want to protect/serialize access to d->pdev_list, we need to use lock there. > Possibly the same also applies to check_cleanup_domid_map() and friends. > >> @@ -2780,21 +2783,26 @@ static int cf_check reassign_device_ownership( >> #endif >> >> ret = domain_context_mapping(target, devfn, pdev); >> + read_unlock(&target->pci_lock); > > Other calls to domain_context_mapping() aren't wrapped like this. Same > for domain_context_unmap(), wrapped exactly once below. > Will add. >> if ( !ret && pdev->devfn == devfn && >> !QUARANTINE_SKIP(source, pdev->arch.vtd.pgd_maddr) ) >> { >> const struct acpi_drhd_unit *drhd = >> acpi_find_matched_drhd_unit(pdev); >> >> + read_lock(&source->pci_lock); >> if ( drhd ) >> check_cleanup_domid_map(source, pdev, drhd->iommu); >> + read_unlock(&source->pci_lock); > > Acquiring the lock inside the if() ought to suffice here. > > Jan Roger, what is your opinion on this? If you remember, you proposed to extend vpci_lock to protect d->pdev_list as well to deal with potential ABBA issue in modify_bars(). But as you can see, this either leads to another ABBA in reassign_device_ownership() or to (as Jan pointed out) questionable value of this new lock in some cases. -- WBR, Volodymyr

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.