Xen project Mailing List

Re: [RFC PATCH] pci: introduce per-domain PCI rwlock

To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>

Date: Wed, 12 Jul 2023 11:15:04 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ts4v3umcPW/luaRlyh4QjGIcut16SrfPkRDGRVu08Ck=; b=LdAj4sj4SRRVcOlinc/EeAMKeeGcFp7yyamHxy57tbvIBc4fyWSM7clTUZlx8FDbk/7RN6WcfLLbUhKpr5Q3Gv/Vrbq28tZOmb9gCGwt92CW/oHUOBShyXpErTMrStPJLPlK6gMFG+rd7q8sJvly1t2XDmPylXXHu9EEtvMwQyrUnSub+6e2UFdWD8Qc5SxZsSjHsCePqb+x+qhNCRR02p48JpDyVhbemO6GETZRyaiNCea5lJKutfnDZdhsRsQ8T+HZfAfoJ82XIyv9+AdhMcAexYZdoRUXl6gVCn6Ri6rv5Z0lCPRCeC3eojHHRcvQJh/iu6TIKbpit60pBdCPyw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Cp87fV6KVPF87LjmSTFxVEPu7ZMshvERllRfhAO5HOALXsrNiqtRnIBSDb8u9VQ4L7HtEwmrP8QICAxpNuVLoRQVoma6zGbIYickcNSlI/DAQJO3ACtTApRlGo+SyGA0ngkEF9k4L+w1AprqX5u1nQiOa4J/lYg1Csiij2CqKVEtA9m+7lun6omXmk4sNxpnhadKGAt/v5YNYoE/uPBnWuaWB1XCgvnCiiaFS+MSIEzPuWL/gy6FrLWOkbjXTmbiwf2/+oE95fLOvgh0a2fHNzoKZjfY89iL3C+XHeAJlzWeQL63w/FLjFjSki6TnFhGvmQo0RDirAGlNc06prQqEA==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Oleksandr Tyshchenko <olekstysh@xxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Wed, 12 Jul 2023 09:15:21 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Up front remark: I'm sorry for some possibly unhelpful replies below. I, for one, am of the opinion that some of the issues you ask about are to be looked into by whoever wants / needs to rework the locking model. After all this (likely) is why nobody has dared to make an attempt before the need became apparent. On 11.07.2023 20:40, Volodymyr Babchuk wrote: > Jan Beulich <jbeulich@xxxxxxxx> writes: >> On 11.07.2023 02:46, Volodymyr Babchuk wrote: >>> Add per-domain d->pci_lock that protects access to >>> d->pdev_list. Purpose of this lock is to give guarantees to VPCI code >>> that underlying pdev will not disappear under feet. Later it will also >>> protect pdev->vpci structure and pdev access in modify_bars(). >>> >>> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> >>> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx> >>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@xxxxxxxx> >>> >>> --- >>> >>> This patch should be part of VPCI series, but I am posting it as a >>> sinle-patch RFC to discuss changes to x86 MM and IOMMU code. >> >> To aid review / judgement extending the commit message would help, to >> outline around which function invocations (and for what reason) the >> lock now needs to be held. Furthermore lock nesting rules want writing >> down (perhaps next to the declaration of the lock). Therefore comments >> below are merely preliminary and likely incomplete. > > I added lock in places where underlying code touches d->pdev_list. My > intention was to lock parts of code that might depend on list > contents. This is straightforward in case we are traversing the list, but > it is much more complicated (for me at least) in cases where > has_arch_pdevs() macro is involved. Prior to my patch uses of > has_arch_pdevs() weren't protected by pci lock at all. This begs > question: do wee need to protect it now? And if we need, which portion > of the code needs to be protected? I did my best trying to isolated the > affected parts of the code. Well, yes - these questions need answering. And since you're proposing these code changes, your present understanding wants writing down, such that (a) we can use that to make corrections to the (intended) model and (b) we can match intentions with actual implementation. >>> --- a/xen/arch/x86/hvm/hvm.c >>> +++ b/xen/arch/x86/hvm/hvm.c >>> @@ -2381,12 +2381,14 @@ int hvm_set_cr0(unsigned long value, bool may_defer) >>> } >>> } >>> >>> + read_lock(&d->pci_lock); >>> if ( ((value ^ old_value) & X86_CR0_CD) && >>> is_iommu_enabled(d) && hvm_funcs.handle_cd && >>> (!rangeset_is_empty(d->iomem_caps) || >>> !rangeset_is_empty(d->arch.ioport_caps) || >>> has_arch_pdevs(d)) ) >>> alternative_vcall(hvm_funcs.handle_cd, v, value); >>> + read_unlock(&d->pci_lock); >> >> handle_cd() is non-trivial - did you you audit it for safety of >> holding a lock around it? > > Well, I only vmx_handle_cd() implements this call. I scanned through it > and didn't found any other PCI-related things inside. It acquires > v->arch.hvm.vmx.vmcs_lock, but I didn't found potential for dead locks. What about overall lock-holding time, which may affect other CPUs and hence other security contexts? > On other hand - do we really need to call in under d->pci_lock? What bad > will happen if has_arch_pdevs(d) will become false during handle_cd() > execution? Much like with log-dirty enabling, the main question is what existing races there may be plus whether things are at least not being made worse. (Ideally of course by introducing better locking, races would go away if any exist.) IOW here it would certainly be better to drop the lock before doing expensive work, but than guarantees are needed that - the state checked can't change until after the operation is complete, or - the state changing is benign. >>> --- a/xen/arch/x86/mm/p2m-pod.c >>> +++ b/xen/arch/x86/mm/p2m-pod.c >>> @@ -349,10 +349,12 @@ p2m_pod_set_mem_target(struct domain *d, unsigned >>> long target) >>> >>> ASSERT( pod_target >= p2m->pod.count ); >>> >>> + read_lock(&d->pci_lock); >>> if ( has_arch_pdevs(d) || cache_flush_permitted(d) ) >>> ret = -ENOTEMPTY; >>> else >>> ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/); >>> + read_unlock(&d->pci_lock); >> >> Hmm, is it necessary to hold the lock across the function call? > > Well, I am not sure. Will it be okay to just check has_arch_pdevs() > while holding a lock? What if it would change it's result in the next > instant? PoD and pass-through are incompatible with one another (just like global log-dirty tracking is). Therefore this and the other side (like also above for handle_cd(), and like for log-dirty below) need to make sure that a state change either can't occur or (not applicable here afaict) is benign. As outlined for log-dirty in the earlier reply, this may involve doing part of the operation under lock, until it is safe to release the lock (and yes, below for log-dirty you validly say this is somewhat fragile, but what do you do). >>> --- a/xen/arch/x86/mm/paging.c >>> +++ b/xen/arch/x86/mm/paging.c >>> @@ -205,21 +205,27 @@ static int paging_log_dirty_enable(struct domain *d) >>> { >>> int ret; >>> >>> + read_lock(&d->pci_lock); >>> if ( has_arch_pdevs(d) ) >>> { >>> /* >>> * Refuse to turn on global log-dirty mode >>> * if the domain is sharing the P2M with the IOMMU. >>> */ >>> + read_unlock(&d->pci_lock); >>> return -EINVAL; >>> } >>> >>> if ( paging_mode_log_dirty(d) ) >>> + { >>> + read_unlock(&d->pci_lock); >>> return -EINVAL; >>> + } >>> >>> domain_pause(d); >>> ret = d->arch.paging.log_dirty.ops->enable(d); >>> domain_unpause(d); >>> + read_unlock(&d->pci_lock); >> >> This means a relatively long potential lock holding time. I wonder >> whether lock release shouldn't be delegated to the ->enable() hook, >> as it could do so immediately after setting the flag that would >> then prevent assignment of devices. > > For me it looks a bit fragile: we need to rely on some hook to release a > lock, that wasn't acquired by the said hook. But I can do this. It > should be released after setting PG_log_dirty, correct? Yes (s/should/could/). > BTW, I can see that hap_enable_log_dirty() uses > read_atomic(&p2m->ioreq.entry_count), but p2m_entry_modify() does just > p2m->ioreq.entry_count++ and p2m->ioreq.entry_count--; > This looks inconsistent. Also, looks like hap_enable_log_dirty() does > not hold &p2m->ioreq.lock while accessing entry_count, so its value can > change right after read_atomic(). I'm afraid it you look closely you'll find many such inconsistencies. >>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c >>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c >>> @@ -102,6 +102,8 @@ static bool any_pdev_behind_iommu(const struct domain >>> *d, >>> { >>> const struct pci_dev *pdev; >>> >>> + ASSERT(rw_is_locked(&d->pci_lock)); >>> + >>> for_each_pdev ( d, pdev ) >>> { >>> if ( pdev == exclude ) >>> @@ -467,17 +469,24 @@ static int cf_check reassign_device( >>> >>> if ( !QUARANTINE_SKIP(target, pdev) ) >>> { >>> + read_lock(&target->pci_lock); >>> rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev); >>> if ( rc ) >>> return rc; >>> + read_unlock(&target->pci_lock); >> >> You need to drop the lock before the if(). > > Yes, thanks. > >> >> Also nit: No hard tabs here please. >> >>> } >>> else >>> amd_iommu_disable_domain_device(source, iommu, devfn, pdev); >> >> Related to my initial comment at the top: It wants clarifying for example >> why "setup" needs to lock held, but "disable" doesn't. >> > > Because amd_iommu_disable_domain_device() does not access d->pdev_list, > while amd_iommu_setup_domain_device() does. I was guessing that might be the reason, but to be honest while looking at the function I can't spot that access. I clearly must be overlooking something, which may be that the access is in a called function. Yet as soon as this isn't obvious, a code comment can make a significant difference. > Anyway, I am interested in AMD IOMMU's maintainer opinion there - what > is the correct scope for lock? To determine that (and to save readers like me from re-doing the work you must have done already) is why I gave the earlier comment. >>> @@ -2765,6 +2767,7 @@ static int cf_check reassign_device_ownership( >>> >>> if ( !QUARANTINE_SKIP(target, pdev->arch.vtd.pgd_maddr) ) >>> { >>> + read_lock(&target->pci_lock); >>> if ( !has_arch_pdevs(target) ) >>> vmx_pi_hooks_assign(target); >> >> I'm afraid this and the unhook side locking isn't sufficient to guarantee >> no races. Things still depend on the domctl and/or pcidevs lock being >> held around this. > > I have no intention to drop pcidevs lock at this time. Honestly, I am > not sure that we will be able to do this without major rework of IOMMU > code. Of course, and my remark wasn't intended to hint in such a direction. Instead ... >> As which points acquiring the lock here (and below) is >> of questionable value. In any event I think this warrants code comments. > > Well, it would be good to take the lock for the first half of the function > where we deal with `target`, but we also accessing `source` at the same > time. To prevent ABBA dead lock I opted to number of finer-grained lock > acquisitions. > > As for "questionable value", I am agree with you. But, if we want to > protect/serialize access to d->pdev_list, we need to use lock there. ... I did ask to assess whether acquiring the lock (without other locking changed) is useful, and to put down the result of this in a comment - whether or not the lock acquire is retained here. IOW in one case the comment may say "lock not acquired here because ..." whereas in the other case the comment might be "lock acquired here despite ..." Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.