Xen project Mailing List

Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>

From: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>

Date: Mon, 7 Feb 2022 13:53:34 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=i9BpWkJI0qKeVQk+iJq1PdNxbNzJBvxpV5JA6TXAlYQ=; b=CjLXIn5OfpvTwNoc5P6uXLF14mWY+Hgbpqn4GCBm0gMqjTb4cInh+oJg510MeJ3C0aY4TTva/o7s0FRQyQENedMrCr3/rQwuCsST4CAbUzFJe1vUFLA/MzwDyZ/U11rH4oL5Ojd5puAk38QAJNHO3+5wTkkohYtGvDj7IHCh7WJAN1wSXi6F5cbFCW8Qb7lRCI8FnPwLSDcldEtd/6fUHJ/CafHjN/jQRCMire/kryQFsgggw4DCgjv5NagkRPa3xBd7UOFBHL8ODeAZAMdbdBjJZAQgT1L05qoMIx44qbNiy/FcXVaoLUQ4zfmJIopJu05tBGphUwhvioycfKhCsA==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=mxtcviru0bND9VXyWEcTB6LEdiK6Ddy1lLopw7uDYn7B8x3ddCkacQbTYM7Ie6VLx906o2eEHA5iTX7I9OTvYcIO5VQ2CxJHBwoECze1Ian+7+tgpvKHkT8P/LW+hME4wkOu3oyEAWstgaobz+ihNlvt20f0ViHSBWc7gpTP9MxK66HhpdnaROc+Tg2w+gWmyUU9m3co8LOt3UoccaDyVF5wJIbyvgfCCzxHFdj0S9IBtEfoiTLGEJ/R14FYekHZeZSSMWSU32yy7Z7CcZf8UZGxDYnv95VALf2UsjUNlp4TCHpXgoGRvlhzQfAnDd1RxIDnTGiLjOJN+z7Fds0gKQ==

Cc: "julien@xxxxxxx" <julien@xxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, Oleksandr Tyshchenko <Oleksandr_Tyshchenko@xxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Artem Mygaiev <Artem_Mygaiev@xxxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>, "george.dunlap@xxxxxxxxxx" <george.dunlap@xxxxxxxxxx>, "paul@xxxxxxx" <paul@xxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>

Delivery-date: Mon, 07 Feb 2022 13:53:53 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHYGZFc/MnzQOjwVEeBBUHLSW0md6yDBUkAgAASSACAAATYAIAAD/WAgAAKNgCAAAbfgIAABnuAgAAQvgCAAAMCAIAAAY4AgAADxICAABrnAIAABAgAgAR3CoCAABt5gIAAEpuA

Thread-topic: [PATCH v6 03/13] vpci: move lock outside of struct vpci

On 07.02.22 14:46, Roger Pau Monné wrote: > On Mon, Feb 07, 2022 at 11:08:39AM +0000, Oleksandr Andrushchenko wrote: >> Hello, >> >> On 04.02.22 16:57, Roger Pau Monné wrote: >>> On Fri, Feb 04, 2022 at 02:43:07PM +0000, Oleksandr Andrushchenko wrote: >>>> On 04.02.22 15:06, Roger Pau Monné wrote: >>>>> On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote: >>>>>> On 04.02.22 14:47, Jan Beulich wrote: >>>>>>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote: >>>>>>>> On 04.02.22 13:37, Jan Beulich wrote: >>>>>>>>> On 04.02.2022 12:13, Roger Pau Monné wrote: >>>>>>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote: >>>>>>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote: >>>>>>>>>>>> On 04.02.22 11:15, Jan Beulich wrote: >>>>>>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote: >>>>>>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote: >>>>>>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote: >>>>>>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct >>>>>>>>>>>>>>>> pci_dev *pdev, uint16_t cmd, bool rom_only) >>>>>>>>>>>>>>>> continue; >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> + spin_lock(&tmp->vpci_lock); >>>>>>>>>>>>>>>> + if ( !tmp->vpci ) >>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>> + spin_unlock(&tmp->vpci_lock); >>>>>>>>>>>>>>>> + continue; >>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>> for ( i = 0; i < >>>>>>>>>>>>>>>> ARRAY_SIZE(tmp->vpci->header.bars); i++ ) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> const struct vpci_bar *bar = >>>>>>>>>>>>>>>> &tmp->vpci->header.bars[i]; >>>>>>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct >>>>>>>>>>>>>>>> pci_dev *pdev, uint16_t cmd, bool rom_only) >>>>>>>>>>>>>>>> rc = rangeset_remove_range(mem, start, >>>>>>>>>>>>>>>> end); >>>>>>>>>>>>>>>> if ( rc ) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> + spin_unlock(&tmp->vpci_lock); >>>>>>>>>>>>>>>> printk(XENLOG_G_WARNING "Failed to >>>>>>>>>>>>>>>> remove [%lx, %lx]: %d\n", >>>>>>>>>>>>>>>> start, end, rc); >>>>>>>>>>>>>>>> rangeset_destroy(mem); >>>>>>>>>>>>>>>> return rc; >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> + spin_unlock(&tmp->vpci_lock); >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> At the first glance this simply looks like another unjustified >>>>>>>>>>>>>>> (in the >>>>>>>>>>>>>>> description) change, as you're not converting anything here but >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> actually add locking (and I realize this was there before, so >>>>>>>>>>>>>>> I'm sorry >>>>>>>>>>>>>>> for not pointing this out earlier). >>>>>>>>>>>>>> Well, I thought that the description already has "...the lock >>>>>>>>>>>>>> can be >>>>>>>>>>>>>> used (and in a few cases is used right away) to check whether >>>>>>>>>>>>>> vpci >>>>>>>>>>>>>> is present" and this is enough for such uses as here. >>>>>>>>>>>>>>> But then I wonder whether you >>>>>>>>>>>>>>> actually tested this, since I can't help getting the impression >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> you're introducing a live-lock: The function is called from >>>>>>>>>>>>>>> cmd_write() >>>>>>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). >>>>>>>>>>>>>>> Yet that >>>>>>>>>>>>>>> function already holds the lock, and the lock is not (currently) >>>>>>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - >>>>>>>>>>>>>>> otoh >>>>>>>>>>>>>>> the locking looks to be entirely unnecessary.) >>>>>>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to >>>>>>>>>>>>>> acquire >>>>>>>>>>>>>> the lock. But if tmp == pdev and rom_only == true >>>>>>>>>>>>>> then we'll deadlock. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock >>>>>>>>>>>>>> if tmp != pdev >>>>>>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock >>>>>>>>>>>>> potential >>>>>>>>>>>>> between the two locks. >>>>>>>>>>>> I am not sure I can suggest a better solution here >>>>>>>>>>>> @Roger, @Jan, could you please help here? >>>>>>>>>>> Well, first of all I'd like to mention that while it may have been >>>>>>>>>>> okay to >>>>>>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when >>>>>>>>>>> dealing >>>>>>>>>>> with DomU-s' lists of PCI devices. The requirement really applies >>>>>>>>>>> to the >>>>>>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except >>>>>>>>>>> that >>>>>>>>>>> there it probably wants to be a try-lock. >>>>>>>>>>> >>>>>>>>>>> Next I'd like to point out that here we have the still pending >>>>>>>>>>> issue of >>>>>>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC >>>>>>>>>>> patch >>>>>>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the >>>>>>>>>>> solution >>>>>>>>>>> here, I think it wants to at least account for the extra need there. >>>>>>>>>> Yes, sorry, I should take care of that. >>>>>>>>>> >>>>>>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with >>>>>>>>>>> avoiding >>>>>>>>>>> the deadlock, as it's imo not an option at all to acquire that lock >>>>>>>>>>> everywhere else you access ->vpci (or else the vpci lock itself >>>>>>>>>>> would be >>>>>>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other >>>>>>>>>>> paths >>>>>>>>>>> would acquire it in read mode, and here you'd acquire it in write >>>>>>>>>>> mode (in >>>>>>>>>>> the former case around the vpci lock, while in the latter case >>>>>>>>>>> there may >>>>>>>>>>> then not be any need to acquire the individual vpci locks at all). >>>>>>>>>>> FTAOD: >>>>>>>>>>> I haven't fully thought through all implications (and hence whether >>>>>>>>>>> this is >>>>>>>>>>> viable in the first place); I expect you will, documenting what >>>>>>>>>>> you've >>>>>>>>>>> found in the resulting patch description. Of course the double lock >>>>>>>>>>> acquire/release would then likely want hiding in helper functions. >>>>>>>>>> I've been also thinking about this, and whether it's really worth to >>>>>>>>>> have a per-device lock rather than a per-domain one that protects all >>>>>>>>>> vpci regions of the devices assigned to the domain. >>>>>>>>>> >>>>>>>>>> The OS is likely to serialize accesses to the PCI config space >>>>>>>>>> anyway, >>>>>>>>>> and the only place I could see a benefit of having per-device locks >>>>>>>>>> is >>>>>>>>>> in the handling of MSI-X tables, as the handling of the mask bit is >>>>>>>>>> likely very performance sensitive, so adding a per-domain lock there >>>>>>>>>> could be a bottleneck. >>>>>>>>> Hmm, with method 1 accesses serializing globally is basically >>>>>>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move >>>>>>>>> to) permit(ting) parallel accesses, with serialization perhaps done >>>>>>>>> only at device level. See our own pci_config_lock, which applies to >>>>>>>>> only method 1 accesses; we don't look to be serializing MMCFG >>>>>>>>> accesses at all. >>>>>>>>> >>>>>>>>>> We could alternatively do a per-domain rwlock for vpci and special >>>>>>>>>> case >>>>>>>>>> the MSI-X area to also have a per-device specific lock. At which >>>>>>>>>> point >>>>>>>>>> it becomes fairly similar to what you propose. >>>>>>>> @Jan, @Roger >>>>>>>> >>>>>>>> 1. d->vpci_lock - rwlock <- this protects vpci >>>>>>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables >>>>>>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't >>>>>>>> really depend on vPCI? >>>>>>> If so, perhaps indeed better the latter. But as said in reply to Roger, >>>>>>> I'm not convinced (yet) that doing away with the per-device lock is a >>>>>>> good move. As said there - we're ourselves doing fully parallel MMCFG >>>>>>> accesses, so OSes ought to be fine to do so, too. >>>>>> But with pdev->vpci_lock we face ABBA... >>>>> I think it would be easier to start with a per-domain rwlock that >>>>> guarantees pdev->vpci cannot be removed under our feet. This would be >>>>> taken in read mode in vpci_{read,write} and in write mode when >>>>> removing a device from a domain. >>>>> >>>>> Then there are also other issues regarding vPCI locking that need to >>>>> be fixed, but that lock would likely be a start. >>>> Or let's see the problem at a different angle: this is the only place >>>> which breaks the use of pdev->vpci_lock. Because all other places >>>> do not try to acquire the lock of any two devices at a time. >>>> So, what if we re-work the offending piece of code instead? >>>> That way we do not break parallel access and have the lock per-device >>>> which might also be a plus. >>>> >>>> By re-work I mean, that instead of reading already mapped regions >>>> from tmp we can employ a d->pci_mapped_regions range set which >>>> will hold all the already mapped ranges. And when it is needed to access >>>> that range set we use pcidevs_lock which seems to be rare. >>>> So, modify_bars will rely on pdev->vpci_lock + pcidevs_lock and >>>> ABBA won't be possible at all. >>> Sadly that won't replace the usage of the loop in modify_bars. This is >>> not (exclusively) done in order to prevent mapping the same region >>> multiple times, but rather to prevent unmapping of regions as long as >>> there's an enabled BAR that's using it. >>> >>> If you wanted to use something like d->pci_mapped_regions it would >>> have to keep reference counts to regions, in order to know when a >>> mapping is no longer required by any BAR on the system with memory >>> decoding enabled. >> I missed this path, thank you >> >> I tried to analyze the locking in pci/vpci. >> >> First of all some context to refresh the target we want: >> the rationale behind moving pdev->vpci->lock outside >> is to be able dynamically create and destroy pdev->vpci. >> So, for that reason lock needs to be moved outside of the pdev->vpci. >> >> Some of the callers of the vPCI code and locking used: >> >> ====================================== >> vpci_mmio_read/vpci_mmcfg_read >> ====================================== >> - vpci_ecam_read >> - vpci_read >> !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!! >> - msix: >> - control_read >> - header: >> - guest_bar_read >> - msi: >> - control_read >> - address_read/address_hi_read >> - data_read >> - mask_read >> >> ====================================== >> vpci_mmio_write/vpci_mmcfg_write >> ====================================== >> - vpci_ecam_write >> - vpci_write >> !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!! >> - msix: >> - control_write >> - header: >> - bar_write/guest_bar_write >> - cmd_write/guest_cmd_write >> - rom_write >> - all write handlers may call modify_bars >> modify_bars >> - msi: >> - control_write >> - address_write/address_hi_write >> - data_write >> - mask_write >> >> ====================================== >> pci_add_device: locked with pcidevs_lock >> ====================================== >> - vpci_add_handlers >> ++++++++ pdev->vpci_lock is used ++++++++ >> >> ====================================== >> pci_remove_device: locked with pcidevs_lock >> ====================================== >> - vpci_remove_device >> ++++++++ pdev->vpci_lock is used ++++++++ >> - pci_cleanup_msi >> - free_pdev >> >> ====================================== >> XEN_DOMCTL_assign_device: locked with pcidevs_lock >> ====================================== >> - assign_device >> - vpci_deassign_device >> - pdev_msix_assign >> - vpci_assign_device >> - vpci_add_handlers >> ++++++++ pdev->vpci_lock is used ++++++++ >> >> ====================================== >> XEN_DOMCTL_deassign_device: locked with pcidevs_lock >> ====================================== >> - deassign_device >> - vpci_deassign_device >> ++++++++ pdev->vpci_lock is used ++++++++ >> - vpci_remove_device >> >> >> ====================================== >> modify_bars is a special case: this is the only function which tries to lock >> two pci_dev devices: it is done to check for overlaps with other BARs which >> may have been >> already mapped or unmapped. >> >> So, this is the only case which may deadlock because of pci_dev->vpci_lock. >> ====================================== >> >> Bottom line: >> ====================================== >> >> 1. vpci_{read|write} are not protected with pcidevs_lock and can run in >> parallel with pci_remove_device which can remove pdev after vpci_{read|write} >> acquired the pdev pointer. This may lead to a fail due to pdev dereference. >> >> So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock. > We would like to take the pcidevs_lock only while fetching the device > (ie: pci_get_pdev_by_domain), afterwards it should be fine to lock the > device using a vpci specific lock so calls to vpci_{read,write} can be > partially concurrent across multiple domains. This means this can't be done a pre-req patch, but as a part of the patch which changes locking. > > In fact I think Jan had already pointed out that the pci lock would > need taking while searching for the device in vpci_{read,write}. I was referring to the time after we found pdev and it is currently possible to free pdev while using it after the search > > It seems to me that if you implement option 3 below taking the > per-domain rwlock in read mode in vpci_{read|write} will already > protect you from the device being removed if the same per-domain lock > is taken in write mode in vpci_remove_device. Yes, it should. Again this can't be done as a pre-req patch because this relies on pdev->vpci_lock > >> 2. The only offending place which is in the way of pci_dev->vpci_lock is >> modify_bars. If it can be re-worked to track already mapped and unmapped >> regions then we can avoid having a possible deadlock and can use >> pci_dev->vpci_lock (rangesets won't help here as we also need refcounting be >> implemented). > I think a refcounting based solution will be very complex to > implement. I'm however happy to be proven wrong. I can't estimate, but I have a feeling that all these plays around locking is just because of this single piece of code. No other place suffer from pdev->vpci_lock and no d->lock > >> If pcidevs_lock is used for vpci_{read|write} then no deadlock is possible, >> but modify_bars code must be re-worked not to lock itself (pdev->vpci_lock >> and >> tmp->vpci_lock when pdev == tmp, this is minor). > Taking the pcidevs lock (a global lock) is out of the picture IMO, as > it's going to serialize all calls of vpci_{read|write}, and would > create too much contention on the pcidevs lock. I understand that. But if we would like to fix the existing code I see no other alternative. > >> 3. We may think about a per-domain rwlock and pdev->vpci_lock, so this solves >> modify_bars's two pdevs access. But this doesn't solve possible pdev >> de-reference in vpci_{read|write} vs pci_remove_device. > pci_remove device will call vpci_remove_device, so as long as > vpci_remove_device taken the per-domain lock in write (exclusive) mode > it should be fine. I think I need to see if there are any other places which similarly require the write lock > >> @Roger, @Jan, I would like to hear what do you think about the above analysis >> and how can we proceed with locking re-work? > I think the per-domain rwlock seems like a good option. I would do > that as a pre-patch. It is. But it seems it won't solve the thing we started this adventure for: With per-domain read lock and still ABBA in modify_bars (hope the below is correctly seen with a monospace font): cpu0: vpci_write-> d->RLock -> pdev1->lock -> rom_write -> modify_bars: tmp (pdev2) ->lock cpu1: vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock There is no API to upgrade read lock to write lock in modify_bars which could help, so in both cases vpci_write should take write lock. Am I missing something here? > > Thanks, Roger. Thank you, Oleksandr

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.