Xen project Mailing List

Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>

Date: Wed, 21 Jun 2023 22:07:20 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Cp0b19tD1nMYsybyGekZTVaOEAB61qlBtXWBsfxLsUg=; b=KziH9IQBMnMRC1h+lktqA1Chekg0hRTh9lC6a4NMuI92kFxqqbIilQlaYj1tenqxu3NAUQeL6XdHO2HbY5Pgb75hioclOF3KXZKGNg+Y+IypemDW/ooGYWJEcKgJMon4USPJeqI1O3nnGh7Cz2K9/cQ3oWfhjLEjPxI+cdtTxqSfAX1vDxPwGnM0mQrcOyWSQsYO+iQ0pbfkLLaoyitoCcZwyabNf15wcRTLZipodFg3XaCgwD8WpuUwDE7g+KeEBXGC109cGrkBnRgWRBCeZ0A7aDatcNlPYouieRsw7zrVg0ra/YdEOdORMKZFNXt9OKZr/W0rYiJp/PZdwchM5A==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=F2XZuh99FJNKa3lAcsjzbCgTXcu+F5k1ILqIOTN4iuA+jp3rVml+vrEiZlfvLMsGWw0rVnrFgO6Sblw8JVbempYP5F37cBUhM5BTBA1sQRkCXjBr/crSkCZu6yxwJQ48OHZ/WgyyyrGO+2ECYo5Gq8tFsoi3uj8dvbt0r/ic6WoWYnlKdW/l78Hycx886Dl1CJfDEfNrgRJSoaTdWlrejMf9GLxZDQOejW5sJD0/bUzeWPV8gKhCScGoOR8ayArNxtCCk2mDjOabmdD2ZTacVf+Qro0OHIxjHNVsTccpjrpdySKEHqEGLKqqSbVyUiGRxIclUu19/cBT1zIQeSzZkQ==

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>

Delivery-date: Wed, 21 Jun 2023 22:08:01 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHZneJZdDE+HOoSO0y6VB9l0x1hDq+No9OAgAgqUAA=

Thread-topic: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure

Hi Roger, Roger Pau Monné <roger.pau@xxxxxxxxxx> writes: > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote: >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> >> >> Introduce a per-domain read/write lock to check whether vpci is present, >> so we are sure there are no accesses to the contents of the vpci struct >> if not. This lock can be used (and in a few cases is used right away) >> so that vpci removal can be performed while holding the lock in write >> mode. Previously such removal could race with vpci_read for example. >> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure >> from being removed. >> >> 2. Writing the command register and ROM BAR register may trigger >> modify_bars to run, which in turn may access multiple pdevs while >> checking for the existing BAR's overlap. The overlapping check, if done >> under the read lock, requires vpci->lock to be acquired on both devices >> being compared, which may produce a deadlock. It is not possible to >> upgrade read lock to write lock in such a case. So, in order to prevent >> the deadlock, check which registers are going to be written and acquire >> the lock in the appropriate mode from the beginning. >> >> All other code, which doesn't lead to pdev->vpci destruction and does not >> access multiple pdevs at the same time, can still use a combination of the >> read lock and pdev->vpci->lock. >> >> 3. Optimize if ROM BAR write lock required detection by caching offset >> of the ROM BAR register in vpci->header->rom_reg which depends on >> header's type. >> >> 4. Reduce locked region in vpci_remove_device as it is now possible >> to set pdev->vpci to NULL early right after the write lock is acquired. >> >> 5. Reduce locked region in vpci_add_handlers as it is possible to >> initialize many more fields of the struct vpci before assigning it to >> pdev->vpci. >> >> 6. vpci_{add|remove}_register are required to be called with the write lock >> held, but it is not feasible to add an assert there as it requires >> struct domain to be passed for that. So, add a comment about this requirement >> to these and other functions with the equivalent constraints. >> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate. >> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock >> prior the call and re-acquire the locks after. After re-acquiring the >> lock there is no need to check if pdev->vpci exists: >> - in apply_map because of the context it is called (no race condition >> possible) >> - for MSI/MSI-X debug code because it is called at the end of >> pdev->vpci access and no further access to pdev->vpci is made >> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock >> and if so, allow reading or writing the hardware register directly. This is >> acceptable as we only deal with Dom0 as of now. Once DomU support is >> added the write will need to be ignored and read return all 0's for the >> guests, while Dom0 can still access the registers directly. >> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking >> the pcidev's lock. >> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain >> while accessing pdevs in vpci code. >> >> 12. This is based on the discussion at [1]. >> >> [1] >> https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@xxxxxxxxx/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ >> [lore[.]kernel[.]org] >> >> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> >> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> > > Thanks. > > I haven't looked in full detail, but I'm afraid there's an ABBA > deadlock with the per-domain vpci lock and the pcidevs lock in > modify_bars() vs vpci_add_handlers() and vpci_remove_device(). > > I've made some comments below. Thank you for the review. I believe that it is a good idea to have a per-domain pdev_list lock. See my answers below. >> --- >> This was checked on x86: with and without PVH Dom0. >> --- >> xen/arch/x86/hvm/vmsi.c | 7 +++ >> xen/common/domain.c | 3 + >> xen/drivers/passthrough/pci.c | 5 ++ >> xen/drivers/vpci/header.c | 52 +++++++++++++++++ >> xen/drivers/vpci/msi.c | 25 +++++++- >> xen/drivers/vpci/msix.c | 51 +++++++++++++--- >> xen/drivers/vpci/vpci.c | 107 +++++++++++++++++++++++++--------- >> xen/include/xen/pci.h | 1 + >> xen/include/xen/sched.h | 3 + >> xen/include/xen/vpci.h | 6 ++ >> 10 files changed, 223 insertions(+), 37 deletions(-) >> >> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c >> index 3cd4923060..f188450b1b 100644 >> --- a/xen/arch/x86/hvm/vmsi.c >> +++ b/xen/arch/x86/hvm/vmsi.c >> @@ -895,6 +895,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) >> { >> unsigned int i; >> >> + ASSERT(rw_is_locked(&msix->pdev->domain->vpci_rwlock)); >> + ASSERT(pcidevs_locked()); >> + >> for ( i = 0; i < msix->max_entries; i++ ) >> { >> const struct vpci_msix_entry *entry = &msix->entries[i]; >> @@ -913,7 +916,11 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) >> struct pci_dev *pdev = msix->pdev; >> >> spin_unlock(&msix->pdev->vpci->lock); >> + pcidevs_unlock(); >> + read_unlock(&pdev->domain->vpci_rwlock); >> process_pending_softirqs(); >> + read_lock(&pdev->domain->vpci_rwlock); >> + pcidevs_lock(); > > Why do you need the pcidevs_lock here? Isn't it enough to have the > per-domain rwlock taken in read mode in order to prevent devices from > being de-assigned from the domain? > Yes, looks like you are right. I'll remove pcidevs_lock() >> /* NB: we assume that pdev cannot go away for an alive domain. >> */ >> if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) ) >> return -EBUSY; >> diff --git a/xen/common/domain.c b/xen/common/domain.c >> index 6a440590fe..a4640acb62 100644 >> --- a/xen/common/domain.c >> +++ b/xen/common/domain.c >> @@ -622,6 +622,9 @@ struct domain *domain_create(domid_t domid, >> >> #ifdef CONFIG_HAS_PCI >> INIT_LIST_HEAD(&d->pdev_list); >> +#ifdef CONFIG_HAS_VPCI >> + rwlock_init(&d->vpci_rwlock); >> +#endif >> #endif >> >> /* All error paths can depend on the above setup. */ >> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c >> index 07d1986d33..0afcb59af0 100644 >> --- a/xen/drivers/passthrough/pci.c >> +++ b/xen/drivers/passthrough/pci.c >> @@ -57,6 +57,11 @@ void pcidevs_lock(void) >> spin_lock_recursive(&_pcidevs_lock); >> } >> >> +int pcidevs_trylock(void) >> +{ >> + return spin_trylock_recursive(&_pcidevs_lock); >> +} >> + >> void pcidevs_unlock(void) >> { >> spin_unlock_recursive(&_pcidevs_lock); >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >> index ec2e978a4e..23b5223adf 100644 >> --- a/xen/drivers/vpci/header.c >> +++ b/xen/drivers/vpci/header.c >> @@ -152,12 +152,14 @@ bool vpci_process_pending(struct vcpu *v) >> if ( rc == -ERESTART ) >> return true; >> >> + read_lock(&v->domain->vpci_rwlock); >> spin_lock(&v->vpci.pdev->vpci->lock); >> /* Disable memory decoding unconditionally on failure. */ >> modify_decoding(v->vpci.pdev, >> rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : >> v->vpci.cmd, >> !rc && v->vpci.rom_only); >> spin_unlock(&v->vpci.pdev->vpci->lock); >> + read_unlock(&v->domain->vpci_rwlock); > > I think you likely want to expand the scope of the domain vpci lock in > order to also protect the data in v->vpci? So that vPCI device > removal can clear this data if required without racing with > vpci_process_pending(). Otherwise you need to pause the domain when > removing vPCI. Yes, you are right. > >> >> rangeset_destroy(v->vpci.mem); >> v->vpci.mem = NULL; >> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const >> struct pci_dev *pdev, >> struct map_data data = { .d = d, .map = true }; >> int rc; >> >> + ASSERT(rw_is_write_locked(&d->vpci_rwlock)); >> + >> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == >> -ERESTART ) >> + { >> + /* >> + * It's safe to drop and reacquire the lock in this context >> + * without risking pdev disappearing because devices cannot be >> + * removed until the initial domain has been started. >> + */ >> + write_unlock(&d->vpci_rwlock); >> process_pending_softirqs(); >> + write_lock(&d->vpci_rwlock); >> + } >> + >> rangeset_destroy(mem); >> if ( !rc ) >> modify_decoding(pdev, cmd, false); >> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev >> *pdev, >> raise_softirq(SCHEDULE_SOFTIRQ); >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ > > Why it must be in write mode? > > Is this to avoid ABBA deadlocks as not taking the per-domain lock in > write mode would then force us to take each pdev->vpci->lock in order > to prevent concurrent modifications of remote devices? Yes, exactly this. This is mentioned in the commit message: 2. Writing the command register and ROM BAR register may trigger modify_bars to run, which in turn may access multiple pdevs while checking for the existing BAR's overlap. The overlapping check, if done under the read lock, requires vpci->lock to be acquired on both devices being compared, which may produce a deadlock. It is not possible to upgrade read lock to write lock in such a case. So, in order to prevent the deadlock, check which registers are going to be written and acquire the lock in the appropriate mode from the beginning. > >> static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool >> rom_only) >> { >> struct vpci_header *header = &pdev->vpci->header; >> @@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev, >> uint16_t cmd, bool rom_only) >> * Check for overlaps with other BARs. Note that only BARs that are >> * currently mapped (enabled) are checked for overlaps. >> */ >> + pcidevs_lock(); > > This can be problematic I'm afraid, as it's a guest controlled path > that allows applying unrestricted contention on the pcidevs lock. I'm > unsure whether multiple guests could be able to trigger the watchdog > if given enough devices/vcpus to be able to perform enough > simultaneous calls to modify_bars(). > > Maybe you could expand the per-domain vpci lock to also protect > modifications of domain->pdev_list? > > IOW: kind of a per-domain pdev_lock. This might work, but I am not sure if we need to expand vpci lock. Maybe it is better to add another lock that protects domain->pdev_list? I afraid that combined lock will be confusing, as it protects two semi-related entities. > >> for_each_pdev ( pdev->domain, tmp ) >> { >> if ( tmp == pdev ) >> @@ -326,10 +342,12 @@ static int modify_bars(const struct pci_dev *pdev, >> uint16_t cmd, bool rom_only) >> printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n", >> start, end, rc); >> rangeset_destroy(mem); >> + pcidevs_unlock(); >> return rc; >> } >> } >> } >> + pcidevs_unlock(); >> >> ASSERT(dev); >> >> @@ -482,6 +500,8 @@ static int cf_check init_bars(struct pci_dev *pdev) >> struct vpci_bar *bars = header->bars; >> int rc; >> >> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >> + >> switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f ) >> { >> case PCI_HEADER_TYPE_NORMAL: >> @@ -570,6 +590,8 @@ static int cf_check init_bars(struct pci_dev *pdev) >> } >> } >> >> + ASSERT(!header->rom_reg); >> + >> /* Check expansion ROM. */ >> rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM); >> if ( rc > 0 && size ) >> @@ -586,12 +608,42 @@ static int cf_check init_bars(struct pci_dev *pdev) >> 4, rom); >> if ( rc ) >> rom->type = VPCI_BAR_EMPTY; >> + >> + header->rom_reg = rom_reg; >> } >> >> return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0; >> } >> REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE); >> >> +static bool overlap(unsigned int r1_offset, unsigned int r1_size, >> + unsigned int r2_offset, unsigned int r2_size) >> +{ >> + /* Return true if there is an overlap. */ >> + return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + >> r1_size; > > Hm, we already have a vpci_register_cmp(), which does a very similar > comparison. I wonder if it would be possible to somehow use that > here. > Yeah, I'll revisit this part. >> +} >> + >> +bool vpci_header_need_write_lock(const struct pci_dev *pdev, >> + unsigned int start, unsigned int size) >> +{ >> + /* >> + * Writing the command register and ROM BAR register may trigger >> + * modify_bars to run, which in turn may access multiple pdevs while >> + * checking for the existing BAR's overlap. The overlapping check, if >> done >> + * under the read lock, requires vpci->lock to be acquired on both >> devices >> + * being compared, which may produce a deadlock. At the same time it is >> not >> + * possible to upgrade read lock to write lock in such a case. >> + * Check which registers are going to be written and return true if lock >> + * needs to be acquired in write mode. >> + */ >> + if ( overlap(start, size, PCI_COMMAND, 2) || >> + (pdev->vpci->header.rom_reg && >> + overlap(start, size, pdev->vpci->header.rom_reg, 4)) ) >> + return true; >> + >> + return false; >> +} >> + >> /* >> * Local variables: >> * mode: C >> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c >> index 8f2b59e61a..e7d1f916a0 100644 >> --- a/xen/drivers/vpci/msi.c >> +++ b/xen/drivers/vpci/msi.c >> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev) >> uint16_t control; >> int ret; >> >> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >> + >> if ( !pos ) >> return 0; >> >> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW); >> >> void vpci_dump_msi(void) >> { >> - const struct domain *d; >> + struct domain *d; >> >> rcu_read_lock(&domlist_read_lock); >> for_each_domain ( d ) >> @@ -277,6 +279,15 @@ void vpci_dump_msi(void) >> >> printk("vPCI MSI/MSI-X d%d\n", d->domain_id); >> >> + if ( !read_trylock(&d->vpci_rwlock) ) >> + continue; >> + >> + if ( !pcidevs_trylock() ) >> + { >> + read_unlock(&d->vpci_rwlock); >> + continue; >> + } >> + >> for_each_pdev ( d, pdev ) >> { >> const struct vpci_msi *msi; >> @@ -318,14 +329,22 @@ void vpci_dump_msi(void) >> * holding the lock. >> */ >> printk("unable to print all MSI-X entries: %d\n", rc); >> - process_pending_softirqs(); >> - continue; >> + goto pdev_done; >> } >> } >> >> spin_unlock(&pdev->vpci->lock); >> + pdev_done: >> + pcidevs_unlock(); >> + read_unlock(&d->vpci_rwlock); >> + >> process_pending_softirqs(); >> + >> + read_lock(&d->vpci_rwlock); >> + pcidevs_lock(); >> } >> + pcidevs_unlock(); >> + read_unlock(&d->vpci_rwlock); >> } >> rcu_read_unlock(&domlist_read_lock); >> } >> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c >> index 25bde77586..b5a7dfdf9c 100644 >> --- a/xen/drivers/vpci/msix.c >> +++ b/xen/drivers/vpci/msix.c >> @@ -143,6 +143,7 @@ static void cf_check control_write( >> pci_conf_write16(pdev->sbdf, reg, val); >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> static struct vpci_msix *msix_find(const struct domain *d, unsigned long >> addr) >> { >> struct vpci_msix *msix; >> @@ -163,7 +164,13 @@ static struct vpci_msix *msix_find(const struct domain >> *d, unsigned long addr) >> >> static int cf_check msix_accept(struct vcpu *v, unsigned long addr) >> { >> - return !!msix_find(v->domain, addr); >> + int rc; >> + >> + read_lock(&v->domain->vpci_rwlock); >> + rc = !!msix_find(v->domain, addr); >> + read_unlock(&v->domain->vpci_rwlock); >> + >> + return rc; >> } >> >> static bool access_allowed(const struct pci_dev *pdev, unsigned long addr, >> @@ -358,21 +365,34 @@ static int adjacent_read(const struct domain *d, const >> struct vpci_msix *msix, >> static int cf_check msix_read( >> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long >> *data) >> { >> - const struct domain *d = v->domain; >> - struct vpci_msix *msix = msix_find(d, addr); >> + struct domain *d = v->domain; >> + struct vpci_msix *msix; >> const struct vpci_msix_entry *entry; >> unsigned int offset; >> >> *data = ~0ul; >> >> + read_lock(&d->vpci_rwlock); >> + >> + msix = msix_find(d, addr); >> if ( !msix ) >> + { >> + read_unlock(&d->vpci_rwlock); >> return X86EMUL_RETRY; >> + } >> >> if ( adjacent_handle(msix, addr) ) >> - return adjacent_read(d, msix, addr, len, data); >> + { >> + int rc = adjacent_read(d, msix, addr, len, data); >> + read_unlock(&d->vpci_rwlock); >> + return rc; >> + } >> >> if ( !access_allowed(msix->pdev, addr, len) ) >> + { >> + read_unlock(&d->vpci_rwlock); >> return X86EMUL_OKAY; >> + } >> >> spin_lock(&msix->pdev->vpci->lock); >> entry = get_entry(msix, addr); >> @@ -404,6 +424,7 @@ static int cf_check msix_read( >> break; >> } >> spin_unlock(&msix->pdev->vpci->lock); >> + read_unlock(&d->vpci_rwlock); >> >> return X86EMUL_OKAY; >> } >> @@ -491,19 +512,32 @@ static int adjacent_write(const struct domain *d, >> const struct vpci_msix *msix, >> static int cf_check msix_write( >> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long >> data) >> { >> - const struct domain *d = v->domain; >> - struct vpci_msix *msix = msix_find(d, addr); >> + struct domain *d = v->domain; >> + struct vpci_msix *msix; >> struct vpci_msix_entry *entry; >> unsigned int offset; >> >> + read_lock(&d->vpci_rwlock); >> + >> + msix = msix_find(d, addr); >> if ( !msix ) >> + { >> + read_unlock(&d->vpci_rwlock); >> return X86EMUL_RETRY; >> + } >> >> if ( adjacent_handle(msix, addr) ) >> - return adjacent_write(d, msix, addr, len, data); >> + { >> + int rc = adjacent_write(d, msix, addr, len, data); >> + read_unlock(&d->vpci_rwlock); >> + return rc; >> + } >> >> if ( !access_allowed(msix->pdev, addr, len) ) >> + { >> + read_unlock(&d->vpci_rwlock); >> return X86EMUL_OKAY; >> + } >> >> spin_lock(&msix->pdev->vpci->lock); >> entry = get_entry(msix, addr); >> @@ -579,6 +613,7 @@ static int cf_check msix_write( >> break; >> } >> spin_unlock(&msix->pdev->vpci->lock); >> + read_unlock(&d->vpci_rwlock); >> >> return X86EMUL_OKAY; >> } >> @@ -665,6 +700,8 @@ static int cf_check init_msix(struct pci_dev *pdev) >> struct vpci_msix *msix; >> int rc; >> >> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >> + >> msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func, >> PCI_CAP_ID_MSIX); >> if ( !msix_offset ) >> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c >> index 652807a4a4..1270174e78 100644 >> --- a/xen/drivers/vpci/vpci.c >> +++ b/xen/drivers/vpci/vpci.c >> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[]; >> >> void vpci_remove_device(struct pci_dev *pdev) >> { >> - if ( !has_vpci(pdev->domain) || !pdev->vpci ) >> + struct vpci *vpci; >> + >> + if ( !has_vpci(pdev->domain) ) >> return; >> >> - spin_lock(&pdev->vpci->lock); >> + write_lock(&pdev->domain->vpci_rwlock); >> + if ( !pdev->vpci ) >> + { >> + write_unlock(&pdev->domain->vpci_rwlock); >> + return; >> + } >> + >> + vpci = pdev->vpci; >> + pdev->vpci = NULL; >> + write_unlock(&pdev->domain->vpci_rwlock); >> + >> while ( !list_empty(&pdev->vpci->handlers) ) >> { >> - struct vpci_register *r = list_first_entry(&pdev->vpci->handlers, >> + struct vpci_register *r = list_first_entry(&vpci->handlers, >> struct vpci_register, >> node); >> >> list_del(&r->node); >> xfree(r); >> } >> - spin_unlock(&pdev->vpci->lock); >> + >> if ( pdev->vpci->msix ) >> { >> unsigned int i; >> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev) >> if ( pdev->vpci->msix->table[i] ) >> iounmap(pdev->vpci->msix->table[i]); >> } >> - xfree(pdev->vpci->msix); >> - xfree(pdev->vpci->msi); >> - xfree(pdev->vpci); >> - pdev->vpci = NULL; >> + xfree(vpci->msix); >> + xfree(vpci->msi); >> + xfree(vpci); >> } >> >> int vpci_add_handlers(struct pci_dev *pdev) >> { >> + struct vpci *vpci; >> unsigned int i; >> int rc = 0; >> >> if ( !has_vpci(pdev->domain) ) >> return 0; >> >> + vpci = xzalloc(struct vpci); >> + if ( !vpci ) >> + return -ENOMEM; >> + >> + INIT_LIST_HEAD(&vpci->handlers); >> + spin_lock_init(&vpci->lock); >> + >> + write_lock(&pdev->domain->vpci_rwlock); > > I think the usage of the vpci per-domain lock here and in > vpci_remove_device() create an ABBA deadlock situation with the usage > of it in modify_bars(). > > Both vpci_add_handlers() and vpci_remove_device() get called with the > pcidevs lock taken, and take the per-domain vpci lock afterwards, > while modify_bars() does the inverse locking, gets called with the > per-domain vpci lock taken and then take the pcidevs lock inside the > function. You suggested to use separate per-domain pdev_list lock in modify_bars. Looks like this can help with the ABBA deadlock. > >> + >> /* We should not get here twice for the same device. */ >> ASSERT(!pdev->vpci); >> >> - pdev->vpci = xzalloc(struct vpci); >> - if ( !pdev->vpci ) >> - return -ENOMEM; >> - >> - INIT_LIST_HEAD(&pdev->vpci->handlers); >> - spin_lock_init(&pdev->vpci->lock); >> + pdev->vpci = vpci; >> >> for ( i = 0; i < NUM_VPCI_INIT; i++ ) >> { >> @@ -91,6 +107,7 @@ int vpci_add_handlers(struct pci_dev *pdev) >> if ( rc ) >> break; >> } >> + write_unlock(&pdev->domain->vpci_rwlock); >> >> if ( rc ) >> vpci_remove_device(pdev); >> @@ -139,6 +156,7 @@ uint32_t cf_check vpci_hw_read32( >> return pci_conf_read32(pdev->sbdf, reg); >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, >> vpci_write_t *write_handler, unsigned int offset, >> unsigned int size, void *data) >> @@ -162,8 +180,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t >> *read_handler, >> r->offset = offset; >> r->private = data; >> >> - spin_lock(&vpci->lock); > > If you remove the lock here we need to assert that the per-domain vpci > lock is taken in write mode. Yep, assert will be better than comment above. > >> - >> /* The list of handlers must be kept sorted at all times. */ >> list_for_each ( prev, &vpci->handlers ) >> { >> @@ -175,25 +191,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t >> *read_handler, >> break; >> if ( cmp == 0 ) >> { >> - spin_unlock(&vpci->lock); >> xfree(r); >> return -EEXIST; >> } >> } >> >> list_add_tail(&r->node, prev); >> - spin_unlock(&vpci->lock); >> >> return 0; >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> int vpci_remove_register(struct vpci *vpci, unsigned int offset, >> unsigned int size) >> { >> const struct vpci_register r = { .offset = offset, .size = size }; >> struct vpci_register *rm; >> >> - spin_lock(&vpci->lock); >> list_for_each_entry ( rm, &vpci->handlers, node ) >> { >> int cmp = vpci_register_cmp(&r, rm); >> @@ -205,14 +219,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned >> int offset, >> if ( !cmp && rm->offset == offset && rm->size == size ) >> { >> list_del(&rm->node); >> - spin_unlock(&vpci->lock); >> xfree(rm); >> return 0; >> } >> if ( cmp <= 0 ) >> break; >> } >> - spin_unlock(&vpci->lock); > > Same here about the per-domain lock being taken. > >> >> return -ENOENT; >> } >> @@ -320,7 +332,7 @@ static uint32_t merge_result(uint32_t data, uint32_t >> new, unsigned int size, >> >> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) >> { >> - const struct domain *d = current->domain; >> + struct domain *d = current->domain; >> const struct pci_dev *pdev; >> const struct vpci_register *r; >> unsigned int data_offset = 0; >> @@ -333,10 +345,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, >> unsigned int size) >> } >> >> /* Find the PCI dev matching the address. */ >> + pcidevs_lock(); >> pdev = pci_get_pdev(d, sbdf); >> - if ( !pdev || !pdev->vpci ) >> + pcidevs_unlock(); > > I think it's worth looking into expanding the per-domain vpci-lock to > a pdevs lock or similar in order to protect the pdev_list also if > possible. So that we can avoid taking the pcidevs lock here. Agree -- WBR, Volodymyr

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.