[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure


  • To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Fri, 7 Jul 2023 08:32:55 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=W5VxBTkIf+j4Tu2a5hY04lbMSs8PaSW52Z3IZ6PkoOs=; b=cO7xA3SwUVBpZ6EVGgFGA08br5dt0hno3fB5WmylnjtvN4SiTNo0ilXpNtQRovjte87nMi67dv5iiCWq6mcJacH7hUvLbo7MOgjd7HhGDXkhOKA2S+oqw1r7bsTdszcy61SO/rmYHqS5v/HR9uAGu1wsfTgnSVlX6jtENIuBrR05He4Ur2o/iJ5NOtyiTVqANKQ/nc7IWZ03hXoHyM+wKrfUlcYEo/jTRvHzRL+g70P/XXCyegxoFGyUy6ooIPdzgaI2nZB6R58NVQASllyQB7jSO8/6qST2XEhXxLEWXjljCa3HDhHEuOQx0jHYMcgwqHsQJFjrpIEipx+LAr0xZA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=MVm6sU2tDJulPpjwo2GkZhPHpdSloum47QDVX6gUo6AGYmp7oIn1FSbYz3/SSnnurJoy+n/2tldsHkcj8MrQGQ2ykLqxVd/fOiM/gJDSnFylg0bX+/dZbF+M5T2U3mthOBFtV6SbYaiiJlqunL3ik43PUM0WjbweSQu91ei31S++L1xKOaPFej9cslehbdzyxLYTsf/P99e9X/NwK/NBWlMDC4iz0TBAAvxPkXHS7Z5zIfw2MHEzs3uw9sT16UzgeB8sGxdgKCjC4mwUKIQnH3o2bQ2yJW7fQ4nMoZTsS3HuNJEtcFTeBN2ibeCYFSz5KYEN0IM9foq2uG/LsBrw6Q==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>
  • Delivery-date: Fri, 07 Jul 2023 06:33:47 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 07.07.2023 04:02, Volodymyr Babchuk wrote:
> 
> Hi Jan,
> 
> Jan Beulich <jbeulich@xxxxxxxx> writes:
> 
>> On 05.07.2023 10:59, Roger Pau Monné wrote:
>>> On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
>>>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>>>> I am currently implementing your proposal (along with Jan's
>>>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>>>> reassign_device() call, which has this piece of code:
>>>>>
>>>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>>>
>>>>> My immediate change was:
>>>>>
>>>>>         write_lock(&pdev->domain->pci_lock);
>>>>>         list_del(&pdev->domain_list);
>>>>>         write_unlock(&pdev->domain->pci_lock);
>>>>>
>>>>>         write_lock(&target->pci_lock);
>>>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>>>         write_unlock(&target->pci_lock);
>>>>>
>>>>> But this will not work because reassign_device is called from
>>>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>>>> take a d->pci_lock early.
>>>>>
>>>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>>>> list one at time:
>>>>>
>>>>> int pci_release_devices(struct domain *d)
>>>>> {
>>>>>     struct pci_dev *pdev;
>>>>>     u8 bus, devfn;
>>>>>     int ret;
>>>>>
>>>>>     pcidevs_lock();
>>>>>     write_lock(&d->pci_lock);
>>>>>     ret = arch_pci_clean_pirqs(d);
>>>>>     if ( ret )
>>>>>     {
>>>>>         pcidevs_unlock();
>>>>>         write_unlock(&d->pci_lock);
>>>>>         return ret;
>>>>>     }
>>>>>
>>>>>     while ( !list_empty(&d->pdev_list) )
>>>>>     {
>>>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>>>         bus = pdev->bus;
>>>>>         devfn = pdev->devfn;
>>>>>         list_del(&pdev->domain_list);
>>>>>         write_unlock(&d->pci_lock);
>>>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>>>         write_lock(&d->pci_lock);
>>>>
>>>> I think it needs doing almost like this, but with two more tweaks and
>>>> no list_del() right here (first and foremost to avoid needing to
>>>> figure whether removing early isn't going to subtly break anything;
>>>> see below for an error case that would end up with changed behavior):
>>>>
>>>>     while ( !list_empty(&d->pdev_list) )
>>>>     {
>>>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, 
>>>> struct pci_dev, domain_list);
>>>>         uint16_t seg = pdev->seg;
>>>>         uint8_t bus = pdev->bus;
>>>>         uint8_t devfn = pdev->devfn;
>>>>
>>>>         write_unlock(&d->pci_lock);
>>>
>>> I think you need to remove the device from the pdev_list before
>>> dropping the lock, or else release could race with other operations.
>>>
>>> That's unlikely, but still if the lock is dropped and the routine
>>> needs to operate on the device it is better remove such device from
>>> the domain so other operations cannot get a reference to it.
>>>
>>> Otherwise you could modify reassign_device() implementations so they
>>> require the caller to hold the source->pci_lock when calling the
>>> routine, but that's ugly because the lock would need to be dropped in
>>> order to reassign the device from source to target domains.
>>>
>>> Another option would be to move the whole d->pdev_list to a local
>>> variable (so that d->pdev_list would be empty) and then iterate over
>>> it without the d->pci_lock.  On failure you would take the lock and
>>> add the failing device back into d->pdev_list.
>>
>> Conceptually I like this last variant, but like the individual list_del()
>> it requires auditing code for no dependency on the device still being on
>> that list. In fact deassign_device()'s use of pci_get_pdev() does. The
>> function would then need changing to have struct pci_dev * passed in.
>> Yet who knows where else there are uses of pci_get_pdev() lurking.
> 
> Okay, so I changed deassign_device() signature and reworked the loop
> in pci_release_devices() in a such way:
> 
>     INIT_LIST_HEAD(&tmp_list);
>     /* Move all entries to tmp_list, so we can drop d->pci_lock */
>     list_splice_init(&d->pdev_list, &tmp_list);
>     write_unlock(&d->pci_lock);
> 
>     list_for_each_entry_safe ( pdev, tmp, &tmp_list, domain_list )
>     {
>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>         rc = deassign_device(d, pdev);
>         if ( rc )
>         {
>             /* Return device back to the domain list */
>             write_lock(&d->pci_lock);
>             list_add(&pdev->domain_list, &d->pdev_list);
>             write_unlock(&d->pci_lock);
>             func_ret = rc;
>         }
>     }
> 
> 
> Also, I checked for all pci_get_pdev() calls and found that struct
> domain (the first parameter) is passed only in handful of places:
> 
> *** xen/drivers/vpci/vpci.c:
> vpci_read[504]                 pdev = pci_get_pdev(d, sbdf);
> vpci_read[506]                 pdev = pci_get_pdev(dom_xen, sbdf);
> vpci_write[621]                pdev = pci_get_pdev(d, sbdf);
> vpci_write[623]                pdev = pci_get_pdev(dom_xen, sbdf);
> 
> *** xen/arch/x86/irq.c:
> map_domain_pirq[2166]          pdev = pci_get_pdev(d, msi->sbdf);
> 
> *** xen/drivers/passthrough/pci.c:
> XEN_GUEST_HANDLE_PARAM[1712]   pdev = pci_get_pdev(d, machine_sbdf);
> 
> The last one is due to my change to deassign_device() signature.

And which is going to continue to return NULL when earlier on you've
emptied the list. The purpose of passing in struct pdev * was to
eliminate this call. (Yet there may be further reasons why eliminating
this call actually isn't correct.)

> ==============================
> 
> d->pdev_list can be accessed there:
> 
> *** xen/drivers/passthrough/amd/pci_amd_iommu.c:
> reassign_device[489]           list_add(&pdev->domain_list, 
> &target->pdev_list);
> 
> *** xen/drivers/passthrough/pci.c:
> _pci_hide_device[463]          list_add(&pdev->domain_list, 
> &dom_xen->pdev_list);
> pci_get_pdev[561]              list_for_each_entry ( pdev, &d->pdev_list, 
> domain_list )
> pci_add_device[759]            list_add(&pdev->domain_list, 
> &hardware_domain->pdev_list);
> pci_release_devices[917]       list_splice_init(&d->pdev_list, &tmp_list);
> pci_release_devices[922]       pdev = list_entry(&d->pdev_list, struct 
> pci_dev, domain_list);
> pci_release_devices[928]       list_add(&pdev->domain_list, &d->pdev_list);
> _setup_hwdom_pci_devices[1155] list_add(&pdev->domain_list, 
> &ctxt->d->pdev_list);
> 
> *** xen/drivers/passthrough/vtd/iommu.c:
> reassign_device_ownership[2819] list_add(&pdev->domain_list, 
> &target->pdev_list);
> 
> *** xen/include/xen/pci.h:
> for_each_pdev[149]             list_for_each_entry(pdev, 
> &(domain)->pdev_list, domain_list)
> has_arch_pdevs[151]            #define has_arch_pdevs(d) 
> (!list_empty(&(d)->pdev_list))
> 
> ==============================
> 
> And has_arch_pdevs() is used there:
> 
> *** xen/arch/x86/hvm/hvm.c:
> hvm_set_cr0[2388]              has_arch_pdevs(d)) )
> 
> *** xen/arch/x86/hvm/vmx/vmcs.c:
> vmx_do_resume[1892]            if ( has_arch_pdevs(v->domain) && !iommu_snoop
> 
> *** xen/arch/x86/mm.c:
> l1_disallow_mask[172]          !has_arch_pdevs(d) && \
> 
> *** xen/arch/x86/mm/p2m-pod.c:
> p2m_pod_set_mem_target[352]    if ( has_arch_pdevs(d) || 
> cache_flush_permitted(d) )
> guest_physmap_mark_populate_on_demand[1404] if ( has_arch_pdevs(d) || 
> cache_flush_permitted(d) )
> 
> *** xen/arch/x86/mm/paging.c:
> paging_log_dirty_enable[208]   if ( has_arch_pdevs(d) )
> 
> *** xen/drivers/passthrough/vtd/iommu.c:
> reassign_device_ownership[2773] if ( !has_arch_pdevs(target) )
> reassign_device_ownership[2807] if ( !has_arch_pdevs(target) )
> reassign_device_ownership[2825] if ( !has_arch_pdevs(source) )
> 
> 
> has_arch_pdevs() bothers me most, actually, because it is not always
> obvious how to add locking for the callers. I am planning to rework it
> in the following way:

Locking is only one aspect. As above, another is whether the function
might wrongly return "false" when you prematurely empty the list of
devices.

> #define has_arch_pdevs_unlocked(d) (!list_empty(&(d)->pdev_list))
> 
> static inline bool has_arch_pdevs(struct domain *d)
> {
>     bool ret;
> 
>     read_lock(&d->pci_lock);
>     ret = has_arch_pdevs_unlocked(d);
>     read_unlock(&d->pci_lock);
> 
>     return ret;
> }

No, this follows a pattern that earlier was said isn't acceptable:
The result of such a check is meaningless to the caller without holding
the lock past actually consuming the result. It simply is stale by the
time you return to the caller. (There may be special cases where other
constraints eliminate the concern, like maybe during domain
construction or domain cleanup, but such need carefully considering in
each individual case. Which in particular means there shouldn't be any
common-use helper functions doing what is unsafe in the general case.)

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.