|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
Hi Roger,
Roger Pau Monné <roger.pau@xxxxxxxxxx> writes:
> On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>>
>> Introduce a per-domain read/write lock to check whether vpci is present,
>> so we are sure there are no accesses to the contents of the vpci struct
>> if not. This lock can be used (and in a few cases is used right away)
>> so that vpci removal can be performed while holding the lock in write
>> mode. Previously such removal could race with vpci_read for example.
>>
>> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> from being removed.
>>
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if done
>> under the read lock, requires vpci->lock to be acquired on both devices
>> being compared, which may produce a deadlock. It is not possible to
>> upgrade read lock to write lock in such a case. So, in order to prevent
>> the deadlock, check which registers are going to be written and acquire
>> the lock in the appropriate mode from the beginning.
>>
>> All other code, which doesn't lead to pdev->vpci destruction and does not
>> access multiple pdevs at the same time, can still use a combination of the
>> read lock and pdev->vpci->lock.
>>
>> 3. Optimize if ROM BAR write lock required detection by caching offset
>> of the ROM BAR register in vpci->header->rom_reg which depends on
>> header's type.
>>
>> 4. Reduce locked region in vpci_remove_device as it is now possible
>> to set pdev->vpci to NULL early right after the write lock is acquired.
>>
>> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> initialize many more fields of the struct vpci before assigning it to
>> pdev->vpci.
>>
>> 6. vpci_{add|remove}_register are required to be called with the write lock
>> held, but it is not feasible to add an assert there as it requires
>> struct domain to be passed for that. So, add a comment about this requirement
>> to these and other functions with the equivalent constraints.
>>
>> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>>
>> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> prior the call and re-acquire the locks after. After re-acquiring the
>> lock there is no need to check if pdev->vpci exists:
>> - in apply_map because of the context it is called (no race condition
>> possible)
>> - for MSI/MSI-X debug code because it is called at the end of
>> pdev->vpci access and no further access to pdev->vpci is made
>>
>> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> and if so, allow reading or writing the hardware register directly. This is
>> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> added the write will need to be ignored and read return all 0's for the
>> guests, while Dom0 can still access the registers directly.
>>
>> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> the pcidev's lock.
>>
>> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>>
>> 12. This is based on the discussion at [1].
>>
>> [1]
>> https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@xxxxxxxxx/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$
>> [lore[.]kernel[.]org]
>>
>> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
>> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>
> Thanks.
>
> I haven't looked in full detail, but I'm afraid there's an ABBA
> deadlock with the per-domain vpci lock and the pcidevs lock in
> modify_bars() vs vpci_add_handlers() and vpci_remove_device().
>
> I've made some comments below.
Thank you for the review. I believe that it is a good idea to have a
per-domain pdev_list lock. See my answers below.
>> ---
>> This was checked on x86: with and without PVH Dom0.
>> ---
>> xen/arch/x86/hvm/vmsi.c | 7 +++
>> xen/common/domain.c | 3 +
>> xen/drivers/passthrough/pci.c | 5 ++
>> xen/drivers/vpci/header.c | 52 +++++++++++++++++
>> xen/drivers/vpci/msi.c | 25 +++++++-
>> xen/drivers/vpci/msix.c | 51 +++++++++++++---
>> xen/drivers/vpci/vpci.c | 107 +++++++++++++++++++++++++---------
>> xen/include/xen/pci.h | 1 +
>> xen/include/xen/sched.h | 3 +
>> xen/include/xen/vpci.h | 6 ++
>> 10 files changed, 223 insertions(+), 37 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 3cd4923060..f188450b1b 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -895,6 +895,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>> {
>> unsigned int i;
>>
>> + ASSERT(rw_is_locked(&msix->pdev->domain->vpci_rwlock));
>> + ASSERT(pcidevs_locked());
>> +
>> for ( i = 0; i < msix->max_entries; i++ )
>> {
>> const struct vpci_msix_entry *entry = &msix->entries[i];
>> @@ -913,7 +916,11 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>> struct pci_dev *pdev = msix->pdev;
>>
>> spin_unlock(&msix->pdev->vpci->lock);
>> + pcidevs_unlock();
>> + read_unlock(&pdev->domain->vpci_rwlock);
>> process_pending_softirqs();
>> + read_lock(&pdev->domain->vpci_rwlock);
>> + pcidevs_lock();
>
> Why do you need the pcidevs_lock here? Isn't it enough to have the
> per-domain rwlock taken in read mode in order to prevent devices from
> being de-assigned from the domain?
>
Yes, looks like you are right. I'll remove pcidevs_lock()
>> /* NB: we assume that pdev cannot go away for an alive domain.
>> */
>> if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>> return -EBUSY;
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index 6a440590fe..a4640acb62 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -622,6 +622,9 @@ struct domain *domain_create(domid_t domid,
>>
>> #ifdef CONFIG_HAS_PCI
>> INIT_LIST_HEAD(&d->pdev_list);
>> +#ifdef CONFIG_HAS_VPCI
>> + rwlock_init(&d->vpci_rwlock);
>> +#endif
>> #endif
>>
>> /* All error paths can depend on the above setup. */
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 07d1986d33..0afcb59af0 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>> spin_lock_recursive(&_pcidevs_lock);
>> }
>>
>> +int pcidevs_trylock(void)
>> +{
>> + return spin_trylock_recursive(&_pcidevs_lock);
>> +}
>> +
>> void pcidevs_unlock(void)
>> {
>> spin_unlock_recursive(&_pcidevs_lock);
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index ec2e978a4e..23b5223adf 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -152,12 +152,14 @@ bool vpci_process_pending(struct vcpu *v)
>> if ( rc == -ERESTART )
>> return true;
>>
>> + read_lock(&v->domain->vpci_rwlock);
>> spin_lock(&v->vpci.pdev->vpci->lock);
>> /* Disable memory decoding unconditionally on failure. */
>> modify_decoding(v->vpci.pdev,
>> rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY :
>> v->vpci.cmd,
>> !rc && v->vpci.rom_only);
>> spin_unlock(&v->vpci.pdev->vpci->lock);
>> + read_unlock(&v->domain->vpci_rwlock);
>
> I think you likely want to expand the scope of the domain vpci lock in
> order to also protect the data in v->vpci? So that vPCI device
> removal can clear this data if required without racing with
> vpci_process_pending(). Otherwise you need to pause the domain when
> removing vPCI.
Yes, you are right.
>
>>
>> rangeset_destroy(v->vpci.mem);
>> v->vpci.mem = NULL;
>> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const
>> struct pci_dev *pdev,
>> struct map_data data = { .d = d, .map = true };
>> int rc;
>>
>> + ASSERT(rw_is_write_locked(&d->vpci_rwlock));
>> +
>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) ==
>> -ERESTART )
>> + {
>> + /*
>> + * It's safe to drop and reacquire the lock in this context
>> + * without risking pdev disappearing because devices cannot be
>> + * removed until the initial domain has been started.
>> + */
>> + write_unlock(&d->vpci_rwlock);
>> process_pending_softirqs();
>> + write_lock(&d->vpci_rwlock);
>> + }
>> +
>> rangeset_destroy(mem);
>> if ( !rc )
>> modify_decoding(pdev, cmd, false);
>> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev
>> *pdev,
>> raise_softirq(SCHEDULE_SOFTIRQ);
>> }
>>
>> +/* This must hold domain's vpci_rwlock in write mode. */
>
> Why it must be in write mode?
>
> Is this to avoid ABBA deadlocks as not taking the per-domain lock in
> write mode would then force us to take each pdev->vpci->lock in order
> to prevent concurrent modifications of remote devices?
Yes, exactly this. This is mentioned in the commit message:
2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if done
under the read lock, requires vpci->lock to be acquired on both devices
being compared, which may produce a deadlock. It is not possible to
upgrade read lock to write lock in such a case. So, in order to prevent
the deadlock, check which registers are going to be written and acquire
the lock in the appropriate mode from the beginning.
>
>> static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool
>> rom_only)
>> {
>> struct vpci_header *header = &pdev->vpci->header;
>> @@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev,
>> uint16_t cmd, bool rom_only)
>> * Check for overlaps with other BARs. Note that only BARs that are
>> * currently mapped (enabled) are checked for overlaps.
>> */
>> + pcidevs_lock();
>
> This can be problematic I'm afraid, as it's a guest controlled path
> that allows applying unrestricted contention on the pcidevs lock. I'm
> unsure whether multiple guests could be able to trigger the watchdog
> if given enough devices/vcpus to be able to perform enough
> simultaneous calls to modify_bars().
>
> Maybe you could expand the per-domain vpci lock to also protect
> modifications of domain->pdev_list?
>
> IOW: kind of a per-domain pdev_lock.
This might work, but I am not sure if we need to expand vpci lock. Maybe
it is better to add another lock that protects domain->pdev_list? I
afraid that combined lock will be confusing, as it protects two
semi-related entities.
>
>> for_each_pdev ( pdev->domain, tmp )
>> {
>> if ( tmp == pdev )
>> @@ -326,10 +342,12 @@ static int modify_bars(const struct pci_dev *pdev,
>> uint16_t cmd, bool rom_only)
>> printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>> start, end, rc);
>> rangeset_destroy(mem);
>> + pcidevs_unlock();
>> return rc;
>> }
>> }
>> }
>> + pcidevs_unlock();
>>
>> ASSERT(dev);
>>
>> @@ -482,6 +500,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>> struct vpci_bar *bars = header->bars;
>> int rc;
>>
>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>> switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>> {
>> case PCI_HEADER_TYPE_NORMAL:
>> @@ -570,6 +590,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>> }
>> }
>>
>> + ASSERT(!header->rom_reg);
>> +
>> /* Check expansion ROM. */
>> rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
>> if ( rc > 0 && size )
>> @@ -586,12 +608,42 @@ static int cf_check init_bars(struct pci_dev *pdev)
>> 4, rom);
>> if ( rc )
>> rom->type = VPCI_BAR_EMPTY;
>> +
>> + header->rom_reg = rom_reg;
>> }
>>
>> return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
>> }
>> REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>
>> +static bool overlap(unsigned int r1_offset, unsigned int r1_size,
>> + unsigned int r2_offset, unsigned int r2_size)
>> +{
>> + /* Return true if there is an overlap. */
>> + return r1_offset < r2_offset + r2_size && r2_offset < r1_offset +
>> r1_size;
>
> Hm, we already have a vpci_register_cmp(), which does a very similar
> comparison. I wonder if it would be possible to somehow use that
> here.
>
Yeah, I'll revisit this part.
>> +}
>> +
>> +bool vpci_header_need_write_lock(const struct pci_dev *pdev,
>> + unsigned int start, unsigned int size)
>> +{
>> + /*
>> + * Writing the command register and ROM BAR register may trigger
>> + * modify_bars to run, which in turn may access multiple pdevs while
>> + * checking for the existing BAR's overlap. The overlapping check, if
>> done
>> + * under the read lock, requires vpci->lock to be acquired on both
>> devices
>> + * being compared, which may produce a deadlock. At the same time it is
>> not
>> + * possible to upgrade read lock to write lock in such a case.
>> + * Check which registers are going to be written and return true if lock
>> + * needs to be acquired in write mode.
>> + */
>> + if ( overlap(start, size, PCI_COMMAND, 2) ||
>> + (pdev->vpci->header.rom_reg &&
>> + overlap(start, size, pdev->vpci->header.rom_reg, 4)) )
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> /*
>> * Local variables:
>> * mode: C
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 8f2b59e61a..e7d1f916a0 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>> uint16_t control;
>> int ret;
>>
>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>> if ( !pos )
>> return 0;
>>
>> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>>
>> void vpci_dump_msi(void)
>> {
>> - const struct domain *d;
>> + struct domain *d;
>>
>> rcu_read_lock(&domlist_read_lock);
>> for_each_domain ( d )
>> @@ -277,6 +279,15 @@ void vpci_dump_msi(void)
>>
>> printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>>
>> + if ( !read_trylock(&d->vpci_rwlock) )
>> + continue;
>> +
>> + if ( !pcidevs_trylock() )
>> + {
>> + read_unlock(&d->vpci_rwlock);
>> + continue;
>> + }
>> +
>> for_each_pdev ( d, pdev )
>> {
>> const struct vpci_msi *msi;
>> @@ -318,14 +329,22 @@ void vpci_dump_msi(void)
>> * holding the lock.
>> */
>> printk("unable to print all MSI-X entries: %d\n", rc);
>> - process_pending_softirqs();
>> - continue;
>> + goto pdev_done;
>> }
>> }
>>
>> spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> + pcidevs_unlock();
>> + read_unlock(&d->vpci_rwlock);
>> +
>> process_pending_softirqs();
>> +
>> + read_lock(&d->vpci_rwlock);
>> + pcidevs_lock();
>> }
>> + pcidevs_unlock();
>> + read_unlock(&d->vpci_rwlock);
>> }
>> rcu_read_unlock(&domlist_read_lock);
>> }
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index 25bde77586..b5a7dfdf9c 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -143,6 +143,7 @@ static void cf_check control_write(
>> pci_conf_write16(pdev->sbdf, reg, val);
>> }
>>
>> +/* This must hold domain's vpci_rwlock in write mode. */
>> static struct vpci_msix *msix_find(const struct domain *d, unsigned long
>> addr)
>> {
>> struct vpci_msix *msix;
>> @@ -163,7 +164,13 @@ static struct vpci_msix *msix_find(const struct domain
>> *d, unsigned long addr)
>>
>> static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>> {
>> - return !!msix_find(v->domain, addr);
>> + int rc;
>> +
>> + read_lock(&v->domain->vpci_rwlock);
>> + rc = !!msix_find(v->domain, addr);
>> + read_unlock(&v->domain->vpci_rwlock);
>> +
>> + return rc;
>> }
>>
>> static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
>> @@ -358,21 +365,34 @@ static int adjacent_read(const struct domain *d, const
>> struct vpci_msix *msix,
>> static int cf_check msix_read(
>> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long
>> *data)
>> {
>> - const struct domain *d = v->domain;
>> - struct vpci_msix *msix = msix_find(d, addr);
>> + struct domain *d = v->domain;
>> + struct vpci_msix *msix;
>> const struct vpci_msix_entry *entry;
>> unsigned int offset;
>>
>> *data = ~0ul;
>>
>> + read_lock(&d->vpci_rwlock);
>> +
>> + msix = msix_find(d, addr);
>> if ( !msix )
>> + {
>> + read_unlock(&d->vpci_rwlock);
>> return X86EMUL_RETRY;
>> + }
>>
>> if ( adjacent_handle(msix, addr) )
>> - return adjacent_read(d, msix, addr, len, data);
>> + {
>> + int rc = adjacent_read(d, msix, addr, len, data);
>> + read_unlock(&d->vpci_rwlock);
>> + return rc;
>> + }
>>
>> if ( !access_allowed(msix->pdev, addr, len) )
>> + {
>> + read_unlock(&d->vpci_rwlock);
>> return X86EMUL_OKAY;
>> + }
>>
>> spin_lock(&msix->pdev->vpci->lock);
>> entry = get_entry(msix, addr);
>> @@ -404,6 +424,7 @@ static int cf_check msix_read(
>> break;
>> }
>> spin_unlock(&msix->pdev->vpci->lock);
>> + read_unlock(&d->vpci_rwlock);
>>
>> return X86EMUL_OKAY;
>> }
>> @@ -491,19 +512,32 @@ static int adjacent_write(const struct domain *d,
>> const struct vpci_msix *msix,
>> static int cf_check msix_write(
>> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long
>> data)
>> {
>> - const struct domain *d = v->domain;
>> - struct vpci_msix *msix = msix_find(d, addr);
>> + struct domain *d = v->domain;
>> + struct vpci_msix *msix;
>> struct vpci_msix_entry *entry;
>> unsigned int offset;
>>
>> + read_lock(&d->vpci_rwlock);
>> +
>> + msix = msix_find(d, addr);
>> if ( !msix )
>> + {
>> + read_unlock(&d->vpci_rwlock);
>> return X86EMUL_RETRY;
>> + }
>>
>> if ( adjacent_handle(msix, addr) )
>> - return adjacent_write(d, msix, addr, len, data);
>> + {
>> + int rc = adjacent_write(d, msix, addr, len, data);
>> + read_unlock(&d->vpci_rwlock);
>> + return rc;
>> + }
>>
>> if ( !access_allowed(msix->pdev, addr, len) )
>> + {
>> + read_unlock(&d->vpci_rwlock);
>> return X86EMUL_OKAY;
>> + }
>>
>> spin_lock(&msix->pdev->vpci->lock);
>> entry = get_entry(msix, addr);
>> @@ -579,6 +613,7 @@ static int cf_check msix_write(
>> break;
>> }
>> spin_unlock(&msix->pdev->vpci->lock);
>> + read_unlock(&d->vpci_rwlock);
>>
>> return X86EMUL_OKAY;
>> }
>> @@ -665,6 +700,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>> struct vpci_msix *msix;
>> int rc;
>>
>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>> msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>> PCI_CAP_ID_MSIX);
>> if ( !msix_offset )
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 652807a4a4..1270174e78 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
>>
>> void vpci_remove_device(struct pci_dev *pdev)
>> {
>> - if ( !has_vpci(pdev->domain) || !pdev->vpci )
>> + struct vpci *vpci;
>> +
>> + if ( !has_vpci(pdev->domain) )
>> return;
>>
>> - spin_lock(&pdev->vpci->lock);
>> + write_lock(&pdev->domain->vpci_rwlock);
>> + if ( !pdev->vpci )
>> + {
>> + write_unlock(&pdev->domain->vpci_rwlock);
>> + return;
>> + }
>> +
>> + vpci = pdev->vpci;
>> + pdev->vpci = NULL;
>> + write_unlock(&pdev->domain->vpci_rwlock);
>> +
>> while ( !list_empty(&pdev->vpci->handlers) )
>> {
>> - struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
>> + struct vpci_register *r = list_first_entry(&vpci->handlers,
>> struct vpci_register,
>> node);
>>
>> list_del(&r->node);
>> xfree(r);
>> }
>> - spin_unlock(&pdev->vpci->lock);
>> +
>> if ( pdev->vpci->msix )
>> {
>> unsigned int i;
>> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>> if ( pdev->vpci->msix->table[i] )
>> iounmap(pdev->vpci->msix->table[i]);
>> }
>> - xfree(pdev->vpci->msix);
>> - xfree(pdev->vpci->msi);
>> - xfree(pdev->vpci);
>> - pdev->vpci = NULL;
>> + xfree(vpci->msix);
>> + xfree(vpci->msi);
>> + xfree(vpci);
>> }
>>
>> int vpci_add_handlers(struct pci_dev *pdev)
>> {
>> + struct vpci *vpci;
>> unsigned int i;
>> int rc = 0;
>>
>> if ( !has_vpci(pdev->domain) )
>> return 0;
>>
>> + vpci = xzalloc(struct vpci);
>> + if ( !vpci )
>> + return -ENOMEM;
>> +
>> + INIT_LIST_HEAD(&vpci->handlers);
>> + spin_lock_init(&vpci->lock);
>> +
>> + write_lock(&pdev->domain->vpci_rwlock);
>
> I think the usage of the vpci per-domain lock here and in
> vpci_remove_device() create an ABBA deadlock situation with the usage
> of it in modify_bars().
>
> Both vpci_add_handlers() and vpci_remove_device() get called with the
> pcidevs lock taken, and take the per-domain vpci lock afterwards,
> while modify_bars() does the inverse locking, gets called with the
> per-domain vpci lock taken and then take the pcidevs lock inside the
> function.
You suggested to use separate per-domain pdev_list lock in
modify_bars. Looks like this can help with the ABBA deadlock.
>
>> +
>> /* We should not get here twice for the same device. */
>> ASSERT(!pdev->vpci);
>>
>> - pdev->vpci = xzalloc(struct vpci);
>> - if ( !pdev->vpci )
>> - return -ENOMEM;
>> -
>> - INIT_LIST_HEAD(&pdev->vpci->handlers);
>> - spin_lock_init(&pdev->vpci->lock);
>> + pdev->vpci = vpci;
>>
>> for ( i = 0; i < NUM_VPCI_INIT; i++ )
>> {
>> @@ -91,6 +107,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
>> if ( rc )
>> break;
>> }
>> + write_unlock(&pdev->domain->vpci_rwlock);
>>
>> if ( rc )
>> vpci_remove_device(pdev);
>> @@ -139,6 +156,7 @@ uint32_t cf_check vpci_hw_read32(
>> return pci_conf_read32(pdev->sbdf, reg);
>> }
>>
>> +/* This must hold domain's vpci_rwlock in write mode. */
>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>> vpci_write_t *write_handler, unsigned int offset,
>> unsigned int size, void *data)
>> @@ -162,8 +180,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t
>> *read_handler,
>> r->offset = offset;
>> r->private = data;
>>
>> - spin_lock(&vpci->lock);
>
> If you remove the lock here we need to assert that the per-domain vpci
> lock is taken in write mode.
Yep, assert will be better than comment above.
>
>> -
>> /* The list of handlers must be kept sorted at all times. */
>> list_for_each ( prev, &vpci->handlers )
>> {
>> @@ -175,25 +191,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t
>> *read_handler,
>> break;
>> if ( cmp == 0 )
>> {
>> - spin_unlock(&vpci->lock);
>> xfree(r);
>> return -EEXIST;
>> }
>> }
>>
>> list_add_tail(&r->node, prev);
>> - spin_unlock(&vpci->lock);
>>
>> return 0;
>> }
>>
>> +/* This must hold domain's vpci_rwlock in write mode. */
>> int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>> unsigned int size)
>> {
>> const struct vpci_register r = { .offset = offset, .size = size };
>> struct vpci_register *rm;
>>
>> - spin_lock(&vpci->lock);
>> list_for_each_entry ( rm, &vpci->handlers, node )
>> {
>> int cmp = vpci_register_cmp(&r, rm);
>> @@ -205,14 +219,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned
>> int offset,
>> if ( !cmp && rm->offset == offset && rm->size == size )
>> {
>> list_del(&rm->node);
>> - spin_unlock(&vpci->lock);
>> xfree(rm);
>> return 0;
>> }
>> if ( cmp <= 0 )
>> break;
>> }
>> - spin_unlock(&vpci->lock);
>
> Same here about the per-domain lock being taken.
>
>>
>> return -ENOENT;
>> }
>> @@ -320,7 +332,7 @@ static uint32_t merge_result(uint32_t data, uint32_t
>> new, unsigned int size,
>>
>> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>> {
>> - const struct domain *d = current->domain;
>> + struct domain *d = current->domain;
>> const struct pci_dev *pdev;
>> const struct vpci_register *r;
>> unsigned int data_offset = 0;
>> @@ -333,10 +345,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
>> unsigned int size)
>> }
>>
>> /* Find the PCI dev matching the address. */
>> + pcidevs_lock();
>> pdev = pci_get_pdev(d, sbdf);
>> - if ( !pdev || !pdev->vpci )
>> + pcidevs_unlock();
>
> I think it's worth looking into expanding the per-domain vpci-lock to
> a pdevs lock or similar in order to protect the pdev_list also if
> possible. So that we can avoid taking the pcidevs lock here.
Agree
--
WBR, Volodymyr
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |