Xen project Mailing List

Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>

From: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>

Date: Mon, 7 Feb 2022 11:08:39 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ne6e1hMeSJwkPeCDDln4Jm0UqnvdDWnuRSV13DYT+oc=; b=Qjc/ckUlTcO5DMM6sH9lsqK+jcmL4U0eTNDs75xR9N0SWcmRrRpNgiOhQHk6eVFtcuYKDcBwwMsLpBmxExoma5nVoSGpUgw/VbCXAUTrE47uppOXk2qrU807UPcY/MPGmjAt4Bf6YlHvKsmJhxtmZK5MDiLCtIWLvhGH839eri8FT/AJzYwPAG+RkRQjSMot7/l30vgp405MEL8qlrDDsKB36zB2b7MmNEs89epOxCmdDdQRi4rtLKj6oomwmRHGekPnY86OBfA9+OCWvDjDdMNTBKtU37V4/VACTLuqw2syki8dodkg/1k+DlCfKHxYMTy765EGt1Xa9kjesB1xUg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=eau21nHmx+VqIpeGpCEG3owgNE9cbRQo9kvDyFvpW4zefX75236XbMqwtq2orEy7qnDBZ7/cnFp87l/h6r33Wg9VMoTQTl1sJxD4KRbdS2H3AEZnxCyub1uRzJUklChqCfEuB+OLWGG9w1TWzTlFNfyvSIUflL42i+ZNv1ixeBtpMxKtBBtScWPdFh0HnGRSvcO7F76mpV6gsG/nN+n5o7s4ydK8FN0FEsH3GgtRDOxEFQsTzZMtc8dOed6SDtTjYXyXSvokECKrfkhbem2QdvHtYzXqe1HhOLqyk6W3FdkbCbI/4r/9Ln8726rTbm7kWqwlPWEFG4+wZDnWaQFKgw==

Cc: "julien@xxxxxxx" <julien@xxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, Oleksandr Tyshchenko <Oleksandr_Tyshchenko@xxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Artem Mygaiev <Artem_Mygaiev@xxxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>, "george.dunlap@xxxxxxxxxx" <george.dunlap@xxxxxxxxxx>, "paul@xxxxxxx" <paul@xxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>

Delivery-date: Mon, 07 Feb 2022 11:20:46 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHYGZFc/MnzQOjwVEeBBUHLSW0md6yDBUkAgAASSACAAATYAIAAD/WAgAAKNgCAAAbfgIAABnuAgAAQvgCAAAMCAIAAAY4AgAADxICAABrnAIAABAgAgAR3CoA=

Thread-topic: [PATCH v6 03/13] vpci: move lock outside of struct vpci

Hello, On 04.02.22 16:57, Roger Pau Monné wrote: > On Fri, Feb 04, 2022 at 02:43:07PM +0000, Oleksandr Andrushchenko wrote: >> >> On 04.02.22 15:06, Roger Pau Monné wrote: >>> On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote: >>>> On 04.02.22 14:47, Jan Beulich wrote: >>>>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote: >>>>>> On 04.02.22 13:37, Jan Beulich wrote: >>>>>>> On 04.02.2022 12:13, Roger Pau Monné wrote: >>>>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote: >>>>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote: >>>>>>>>>> On 04.02.22 11:15, Jan Beulich wrote: >>>>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote: >>>>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote: >>>>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote: >>>>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev >>>>>>>>>>>>>> *pdev, uint16_t cmd, bool rom_only) >>>>>>>>>>>>>> continue; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> + spin_lock(&tmp->vpci_lock); >>>>>>>>>>>>>> + if ( !tmp->vpci ) >>>>>>>>>>>>>> + { >>>>>>>>>>>>>> + spin_unlock(&tmp->vpci_lock); >>>>>>>>>>>>>> + continue; >>>>>>>>>>>>>> + } >>>>>>>>>>>>>> for ( i = 0; i < >>>>>>>>>>>>>> ARRAY_SIZE(tmp->vpci->header.bars); i++ ) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> const struct vpci_bar *bar = >>>>>>>>>>>>>> &tmp->vpci->header.bars[i]; >>>>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct >>>>>>>>>>>>>> pci_dev *pdev, uint16_t cmd, bool rom_only) >>>>>>>>>>>>>> rc = rangeset_remove_range(mem, start, end); >>>>>>>>>>>>>> if ( rc ) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> + spin_unlock(&tmp->vpci_lock); >>>>>>>>>>>>>> printk(XENLOG_G_WARNING "Failed to remove >>>>>>>>>>>>>> [%lx, %lx]: %d\n", >>>>>>>>>>>>>> start, end, rc); >>>>>>>>>>>>>> rangeset_destroy(mem); >>>>>>>>>>>>>> return rc; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> + spin_unlock(&tmp->vpci_lock); >>>>>>>>>>>>>> } >>>>>>>>>>>>> At the first glance this simply looks like another unjustified >>>>>>>>>>>>> (in the >>>>>>>>>>>>> description) change, as you're not converting anything here but >>>>>>>>>>>>> you >>>>>>>>>>>>> actually add locking (and I realize this was there before, so I'm >>>>>>>>>>>>> sorry >>>>>>>>>>>>> for not pointing this out earlier). >>>>>>>>>>>> Well, I thought that the description already has "...the lock can >>>>>>>>>>>> be >>>>>>>>>>>> used (and in a few cases is used right away) to check whether vpci >>>>>>>>>>>> is present" and this is enough for such uses as here. >>>>>>>>>>>>> But then I wonder whether you >>>>>>>>>>>>> actually tested this, since I can't help getting the impression >>>>>>>>>>>>> that >>>>>>>>>>>>> you're introducing a live-lock: The function is called from >>>>>>>>>>>>> cmd_write() >>>>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). >>>>>>>>>>>>> Yet that >>>>>>>>>>>>> function already holds the lock, and the lock is not (currently) >>>>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - >>>>>>>>>>>>> otoh >>>>>>>>>>>>> the locking looks to be entirely unnecessary.) >>>>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire >>>>>>>>>>>> the lock. But if tmp == pdev and rom_only == true >>>>>>>>>>>> then we'll deadlock. >>>>>>>>>>>> >>>>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock >>>>>>>>>>>> if tmp != pdev >>>>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock >>>>>>>>>>> potential >>>>>>>>>>> between the two locks. >>>>>>>>>> I am not sure I can suggest a better solution here >>>>>>>>>> @Roger, @Jan, could you please help here? >>>>>>>>> Well, first of all I'd like to mention that while it may have been >>>>>>>>> okay to >>>>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when >>>>>>>>> dealing >>>>>>>>> with DomU-s' lists of PCI devices. The requirement really applies to >>>>>>>>> the >>>>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that >>>>>>>>> there it probably wants to be a try-lock. >>>>>>>>> >>>>>>>>> Next I'd like to point out that here we have the still pending issue >>>>>>>>> of >>>>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC >>>>>>>>> patch >>>>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the >>>>>>>>> solution >>>>>>>>> here, I think it wants to at least account for the extra need there. >>>>>>>> Yes, sorry, I should take care of that. >>>>>>>> >>>>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with >>>>>>>>> avoiding >>>>>>>>> the deadlock, as it's imo not an option at all to acquire that lock >>>>>>>>> everywhere else you access ->vpci (or else the vpci lock itself would >>>>>>>>> be >>>>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths >>>>>>>>> would acquire it in read mode, and here you'd acquire it in write >>>>>>>>> mode (in >>>>>>>>> the former case around the vpci lock, while in the latter case there >>>>>>>>> may >>>>>>>>> then not be any need to acquire the individual vpci locks at all). >>>>>>>>> FTAOD: >>>>>>>>> I haven't fully thought through all implications (and hence whether >>>>>>>>> this is >>>>>>>>> viable in the first place); I expect you will, documenting what you've >>>>>>>>> found in the resulting patch description. Of course the double lock >>>>>>>>> acquire/release would then likely want hiding in helper functions. >>>>>>>> I've been also thinking about this, and whether it's really worth to >>>>>>>> have a per-device lock rather than a per-domain one that protects all >>>>>>>> vpci regions of the devices assigned to the domain. >>>>>>>> >>>>>>>> The OS is likely to serialize accesses to the PCI config space anyway, >>>>>>>> and the only place I could see a benefit of having per-device locks is >>>>>>>> in the handling of MSI-X tables, as the handling of the mask bit is >>>>>>>> likely very performance sensitive, so adding a per-domain lock there >>>>>>>> could be a bottleneck. >>>>>>> Hmm, with method 1 accesses serializing globally is basically >>>>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move >>>>>>> to) permit(ting) parallel accesses, with serialization perhaps done >>>>>>> only at device level. See our own pci_config_lock, which applies to >>>>>>> only method 1 accesses; we don't look to be serializing MMCFG >>>>>>> accesses at all. >>>>>>> >>>>>>>> We could alternatively do a per-domain rwlock for vpci and special case >>>>>>>> the MSI-X area to also have a per-device specific lock. At which point >>>>>>>> it becomes fairly similar to what you propose. >>>>>> @Jan, @Roger >>>>>> >>>>>> 1. d->vpci_lock - rwlock <- this protects vpci >>>>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables >>>>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't >>>>>> really depend on vPCI? >>>>> If so, perhaps indeed better the latter. But as said in reply to Roger, >>>>> I'm not convinced (yet) that doing away with the per-device lock is a >>>>> good move. As said there - we're ourselves doing fully parallel MMCFG >>>>> accesses, so OSes ought to be fine to do so, too. >>>> But with pdev->vpci_lock we face ABBA... >>> I think it would be easier to start with a per-domain rwlock that >>> guarantees pdev->vpci cannot be removed under our feet. This would be >>> taken in read mode in vpci_{read,write} and in write mode when >>> removing a device from a domain. >>> >>> Then there are also other issues regarding vPCI locking that need to >>> be fixed, but that lock would likely be a start. >> Or let's see the problem at a different angle: this is the only place >> which breaks the use of pdev->vpci_lock. Because all other places >> do not try to acquire the lock of any two devices at a time. >> So, what if we re-work the offending piece of code instead? >> That way we do not break parallel access and have the lock per-device >> which might also be a plus. >> >> By re-work I mean, that instead of reading already mapped regions >> from tmp we can employ a d->pci_mapped_regions range set which >> will hold all the already mapped ranges. And when it is needed to access >> that range set we use pcidevs_lock which seems to be rare. >> So, modify_bars will rely on pdev->vpci_lock + pcidevs_lock and >> ABBA won't be possible at all. > Sadly that won't replace the usage of the loop in modify_bars. This is > not (exclusively) done in order to prevent mapping the same region > multiple times, but rather to prevent unmapping of regions as long as > there's an enabled BAR that's using it. > > If you wanted to use something like d->pci_mapped_regions it would > have to keep reference counts to regions, in order to know when a > mapping is no longer required by any BAR on the system with memory > decoding enabled. I missed this path, thank you I tried to analyze the locking in pci/vpci. First of all some context to refresh the target we want: the rationale behind moving pdev->vpci->lock outside is to be able dynamically create and destroy pdev->vpci. So, for that reason lock needs to be moved outside of the pdev->vpci. Some of the callers of the vPCI code and locking used: ====================================== vpci_mmio_read/vpci_mmcfg_read ====================================== - vpci_ecam_read - vpci_read !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!! - msix: - control_read - header: - guest_bar_read - msi: - control_read - address_read/address_hi_read - data_read - mask_read ====================================== vpci_mmio_write/vpci_mmcfg_write ====================================== - vpci_ecam_write - vpci_write !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!! - msix: - control_write - header: - bar_write/guest_bar_write - cmd_write/guest_cmd_write - rom_write - all write handlers may call modify_bars modify_bars - msi: - control_write - address_write/address_hi_write - data_write - mask_write ====================================== pci_add_device: locked with pcidevs_lock ====================================== - vpci_add_handlers ++++++++ pdev->vpci_lock is used ++++++++ ====================================== pci_remove_device: locked with pcidevs_lock ====================================== - vpci_remove_device ++++++++ pdev->vpci_lock is used ++++++++ - pci_cleanup_msi - free_pdev ====================================== XEN_DOMCTL_assign_device: locked with pcidevs_lock ====================================== - assign_device - vpci_deassign_device - pdev_msix_assign - vpci_assign_device - vpci_add_handlers ++++++++ pdev->vpci_lock is used ++++++++ ====================================== XEN_DOMCTL_deassign_device: locked with pcidevs_lock ====================================== - deassign_device - vpci_deassign_device ++++++++ pdev->vpci_lock is used ++++++++ - vpci_remove_device ====================================== modify_bars is a special case: this is the only function which tries to lock two pci_dev devices: it is done to check for overlaps with other BARs which may have been already mapped or unmapped. So, this is the only case which may deadlock because of pci_dev->vpci_lock. ====================================== Bottom line: ====================================== 1. vpci_{read|write} are not protected with pcidevs_lock and can run in parallel with pci_remove_device which can remove pdev after vpci_{read|write} acquired the pdev pointer. This may lead to a fail due to pdev dereference. So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock. 2. The only offending place which is in the way of pci_dev->vpci_lock is modify_bars. If it can be re-worked to track already mapped and unmapped regions then we can avoid having a possible deadlock and can use pci_dev->vpci_lock (rangesets won't help here as we also need refcounting be implemented). If pcidevs_lock is used for vpci_{read|write} then no deadlock is possible, but modify_bars code must be re-worked not to lock itself (pdev->vpci_lock and tmp->vpci_lock when pdev == tmp, this is minor). 3. We may think about a per-domain rwlock and pdev->vpci_lock, so this solves modify_bars's two pdevs access. But this doesn't solve possible pdev de-reference in vpci_{read|write} vs pci_remove_device. @Roger, @Jan, I would like to hear what do you think about the above analysis and how can we proceed with locking re-work? Thank you in advance, Oleksandr

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.