[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> > > Use a previously introduced per-domain read/write lock to check > whether vpci is present, so we are sure there are no accesses to the > contents of the vpci struct if not. This lock can be used (and in a > few cases is used right away) so that vpci removal can be performed > while holding the lock in write mode. Previously such removal could > race with vpci_read for example. This I think needs to state the locking order of the per-domain pci_lock wrt the vpci->lock. AFAICT that's d->pci_lock first, then vpci->lock. > 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure > from being removed. > > 2. Writing the command register and ROM BAR register may trigger > modify_bars to run, which in turn may access multiple pdevs while > checking for the existing BAR's overlap. The overlapping check, if > done under the read lock, requires vpci->lock to be acquired on both > devices being compared, which may produce a deadlock. It is not > possible to upgrade read lock to write lock in such a case. So, in > order to prevent the deadlock, use d->pci_lock instead. To prevent > deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock, > always lock hwdom first. > > All other code, which doesn't lead to pdev->vpci destruction and does > not access multiple pdevs at the same time, can still use a > combination of the read lock and pdev->vpci->lock. > > 3. Drop const qualifier where the new rwlock is used and this is > appropriate. > > 4. Do not call process_pending_softirqs with any locks held. For that > unlock prior the call and re-acquire the locks after. After > re-acquiring the lock there is no need to check if pdev->vpci exists: > - in apply_map because of the context it is called (no race condition > possible) > - for MSI/MSI-X debug code because it is called at the end of > pdev->vpci access and no further access to pdev->vpci is made I assume that's vpci_msix_arch_print(): there are further accesses to pdev->vpci, but those use the msix local variable, which holds a copy of the pointer in pdev->vpci->msix, so that last sentence is not true I'm afraid. However the code already try to cater for the pdev going away, and hence it's IMO fine. IOW: your change doesn't make this any better or worse. > > 5. Introduce pcidevs_trylock, so there is a possibility to try locking > the pcidev's lock. I'm confused by this addition, the more that's no used anywhere. Can you defer the addition until the patch that makes use of it? > > 6. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain > while accessing pdevs in vpci code. > > Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> > Suggested-by: Jan Beulich <jbeulich@xxxxxxxx> > Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@xxxxxxxx> > > --- > > Changes in v8: > - changed d->vpci_lock to d->pci_lock > - introducing d->pci_lock in a separate patch > - extended locked region in vpci_process_pending > - removed pcidevs_lockis vpci_dump_msi() > - removed some changes as they are not needed with > the new locking scheme > - added handling for hwdom && dom_xen case > --- > xen/arch/x86/hvm/vmsi.c | 4 +++ > xen/drivers/passthrough/pci.c | 7 +++++ > xen/drivers/vpci/header.c | 18 ++++++++++++ > xen/drivers/vpci/msi.c | 14 ++++++++-- > xen/drivers/vpci/msix.c | 52 ++++++++++++++++++++++++++++++----- > xen/drivers/vpci/vpci.c | 46 +++++++++++++++++++++++++++++-- > xen/include/xen/pci.h | 1 + > 7 files changed, 129 insertions(+), 13 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c > index 3cd4923060..8c1bd66b9c 100644 > --- a/xen/arch/x86/hvm/vmsi.c > +++ b/xen/arch/x86/hvm/vmsi.c > @@ -895,6 +895,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) > { > unsigned int i; > > + ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock)); > + > for ( i = 0; i < msix->max_entries; i++ ) > { > const struct vpci_msix_entry *entry = &msix->entries[i]; > @@ -913,7 +915,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) > struct pci_dev *pdev = msix->pdev; > > spin_unlock(&msix->pdev->vpci->lock); > + read_unlock(&pdev->domain->pci_lock); > process_pending_softirqs(); > + read_lock(&pdev->domain->pci_lock); This should be a read_trylock(), much like the spin_trylock() below. > /* NB: we assume that pdev cannot go away for an alive domain. */ > if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) ) > return -EBUSY; > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c > index 5b4632ead2..6f8692cd9c 100644 > --- a/xen/drivers/passthrough/pci.c > +++ b/xen/drivers/passthrough/pci.c > @@ -57,6 +57,11 @@ void pcidevs_lock(void) > spin_lock_recursive(&_pcidevs_lock); > } > > +int pcidevs_trylock(void) > +{ > + return spin_trylock_recursive(&_pcidevs_lock); > +} > + > void pcidevs_unlock(void) > { > spin_unlock_recursive(&_pcidevs_lock); > @@ -1144,7 +1149,9 @@ static void __hwdom_init setup_one_hwdom_device(const > struct setup_hwdom *ctxt, > } while ( devfn != pdev->devfn && > PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) ); > > + write_lock(&ctxt->d->pci_lock); > err = vpci_add_handlers(pdev); > + write_unlock(&ctxt->d->pci_lock); > if ( err ) > printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n", > ctxt->d->domain_id, err); > diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c > index b41556d007..2780fcae72 100644 > --- a/xen/drivers/vpci/header.c > +++ b/xen/drivers/vpci/header.c > @@ -152,6 +152,7 @@ bool vpci_process_pending(struct vcpu *v) > if ( rc == -ERESTART ) > return true; > > + write_lock(&v->domain->pci_lock); > spin_lock(&v->vpci.pdev->vpci->lock); > /* Disable memory decoding unconditionally on failure. */ > modify_decoding(v->vpci.pdev, > @@ -170,6 +171,7 @@ bool vpci_process_pending(struct vcpu *v) > * failure. > */ > vpci_remove_device(v->vpci.pdev); > + write_unlock(&v->domain->pci_lock); > } The handling in vpci_process_pending() wrt vpci_remove_device() is racy and will need some thinking to get it solved. Your change doesn't make it any worse, but I would also be fine with adding a note in the commit message that vpci_process_pending() is not adjusted to use the new lock because it needs to be reworked first in order to be safe against a concurrent vpci_remove_device() call. > > return false; > @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const > struct pci_dev *pdev, > struct map_data data = { .d = d, .map = true }; > int rc; > > + ASSERT(rw_is_locked(&d->pci_lock)); > + > while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == > -ERESTART ) > + { > + /* > + * It's safe to drop and reacquire the lock in this context > + * without risking pdev disappearing because devices cannot be > + * removed until the initial domain has been started. > + */ > + read_unlock(&d->pci_lock); > process_pending_softirqs(); > + read_lock(&d->pci_lock); > + } Since this is init only code you could likely forego the usage of the locks, but I guess that's more churn than just using them. In any case, as this gets called from modify_bars() the locks need to be dropped/taken in write mode (see comment below). > rangeset_destroy(mem); > if ( !rc ) > modify_decoding(pdev, cmd, false); > @@ -223,6 +237,8 @@ static int modify_bars(const struct pci_dev *pdev, > uint16_t cmd, bool rom_only) > unsigned int i; > int rc; > > + ASSERT(rw_is_locked(&pdev->domain->pci_lock)); The lock here needs to be taken in write mode I think, so the code can safely iterate over the contents of each pdev->vpci assigned to the domain. > + > if ( !mem ) > return -ENOMEM; > > @@ -502,6 +518,8 @@ static int cf_check init_bars(struct pci_dev *pdev) > struct vpci_bar *bars = header->bars; > int rc; > > + ASSERT(rw_is_locked(&pdev->domain->pci_lock)); > + > switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f ) > { > case PCI_HEADER_TYPE_NORMAL: > diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c > index 8f2b59e61a..e63152c224 100644 > --- a/xen/drivers/vpci/msi.c > +++ b/xen/drivers/vpci/msi.c > @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev) > uint16_t control; > int ret; > > + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); I'm confused by the difference in lock requirements between init_bars() and init_msi(). In the former you assert for the lock being taken in read mode, while the later asserts for write mode. We want to do initialization in write mode, so that modify_bars() called by init_bars() has exclusive access to the contents of pdev->vpci. > + > if ( !pos ) > return 0; > > @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW); > > void vpci_dump_msi(void) > { > - const struct domain *d; > + struct domain *d; > > rcu_read_lock(&domlist_read_lock); > for_each_domain ( d ) > @@ -277,6 +279,9 @@ void vpci_dump_msi(void) > > printk("vPCI MSI/MSI-X d%d\n", d->domain_id); > > + if ( !read_trylock(&d->pci_lock) ) > + continue; > + > for_each_pdev ( d, pdev ) > { > const struct vpci_msi *msi; > @@ -318,14 +323,17 @@ void vpci_dump_msi(void) > * holding the lock. > */ > printk("unable to print all MSI-X entries: %d\n", rc); > - process_pending_softirqs(); > - continue; > + goto pdev_done; > } > } > > spin_unlock(&pdev->vpci->lock); > + pdev_done: > + read_unlock(&d->pci_lock); > process_pending_softirqs(); > + read_lock(&d->pci_lock); read_trylock(). This is not very safe, as the list could be modified while the lock is dropped, but it's a debug key handler so I'm not very concerned. However we should at least add a comment that this relies on the list not being altered while the lock is dropped. > } > + read_unlock(&d->pci_lock); > } > rcu_read_unlock(&domlist_read_lock); > } > diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c > index 25bde77586..9481274579 100644 > --- a/xen/drivers/vpci/msix.c > +++ b/xen/drivers/vpci/msix.c > @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain > *d, unsigned long addr) > { > struct vpci_msix *msix; > > + ASSERT(rw_is_locked(&d->pci_lock)); Hm, here you are iterating over pdev->vpci->header.bars for multiple devices, so I think in addition to the pci_lock in read mode we should also take the vpci->lock for each pdev. I think I would like to rework msix_find() so it's msix_get() and returns with the appropriate vpci->lock taken. Anyway, that's for a different patch, the usage of the lock in read mode seems correct, albeit I might want to move the read_lock() call inside of msix_get() in the future. > + > list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next ) > { > const struct vpci_bar *bars = msix->pdev->vpci->header.bars; > @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain > *d, unsigned long addr) > > static int cf_check msix_accept(struct vcpu *v, unsigned long addr) > { > - return !!msix_find(v->domain, addr); > + int rc; > + > + read_lock(&v->domain->pci_lock); > + rc = !!msix_find(v->domain, addr); > + read_unlock(&v->domain->pci_lock); > + > + return rc; > } > > static bool access_allowed(const struct pci_dev *pdev, unsigned long addr, > @@ -358,21 +366,34 @@ static int adjacent_read(const struct domain *d, const > struct vpci_msix *msix, > static int cf_check msix_read( > struct vcpu *v, unsigned long addr, unsigned int len, unsigned long > *data) > { > - const struct domain *d = v->domain; > - struct vpci_msix *msix = msix_find(d, addr); > + struct domain *d = v->domain; > + struct vpci_msix *msix; > const struct vpci_msix_entry *entry; > unsigned int offset; > > *data = ~0ul; > > + read_lock(&d->pci_lock); > + > + msix = msix_find(d, addr); > if ( !msix ) > + { > + read_unlock(&d->pci_lock); > return X86EMUL_RETRY; > + } > > if ( adjacent_handle(msix, addr) ) > - return adjacent_read(d, msix, addr, len, data); > + { > + int rc = adjacent_read(d, msix, addr, len, data); Nit: missing newline (here and below). > + read_unlock(&d->pci_lock); > + return rc; > + } > > if ( !access_allowed(msix->pdev, addr, len) ) > + { > + read_unlock(&d->pci_lock); > return X86EMUL_OKAY; > + } > > spin_lock(&msix->pdev->vpci->lock); > entry = get_entry(msix, addr); > @@ -404,6 +425,7 @@ static int cf_check msix_read( > break; > } > spin_unlock(&msix->pdev->vpci->lock); > + read_unlock(&d->pci_lock); > > return X86EMUL_OKAY; > } > @@ -491,19 +513,32 @@ static int adjacent_write(const struct domain *d, const > struct vpci_msix *msix, > static int cf_check msix_write( > struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data) > { > - const struct domain *d = v->domain; > - struct vpci_msix *msix = msix_find(d, addr); > + struct domain *d = v->domain; > + struct vpci_msix *msix; > struct vpci_msix_entry *entry; > unsigned int offset; > > + read_lock(&d->pci_lock); > + > + msix = msix_find(d, addr); > if ( !msix ) > + { > + read_unlock(&d->pci_lock); > return X86EMUL_RETRY; > + } > > if ( adjacent_handle(msix, addr) ) > - return adjacent_write(d, msix, addr, len, data); > + { > + int rc = adjacent_write(d, msix, addr, len, data); > + read_unlock(&d->pci_lock); > + return rc; > + } > > if ( !access_allowed(msix->pdev, addr, len) ) > + { > + read_unlock(&d->pci_lock); > return X86EMUL_OKAY; > + } > > spin_lock(&msix->pdev->vpci->lock); > entry = get_entry(msix, addr); > @@ -579,6 +614,7 @@ static int cf_check msix_write( > break; > } > spin_unlock(&msix->pdev->vpci->lock); > + read_unlock(&d->pci_lock); > > return X86EMUL_OKAY; > } > @@ -665,6 +701,8 @@ static int cf_check init_msix(struct pci_dev *pdev) > struct vpci_msix *msix; > int rc; > > + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); > + > msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func, > PCI_CAP_ID_MSIX); > if ( !msix_offset ) > diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c > index d73fa76302..f22cbf2112 100644 > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[]; > > void vpci_remove_device(struct pci_dev *pdev) > { > + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); > + > if ( !has_vpci(pdev->domain) || !pdev->vpci ) > return; > > @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev) > const unsigned long *ro_map; > int rc = 0; > > + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); > + > if ( !has_vpci(pdev->domain) ) > return 0; > > @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t > new, unsigned int size, > > uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) > { > - const struct domain *d = current->domain; > + struct domain *d = current->domain; > const struct pci_dev *pdev; > const struct vpci_register *r; > unsigned int data_offset = 0; > uint32_t data = ~(uint32_t)0; > + rwlock_t *lock; > > if ( !size ) > { > @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, > unsigned int size) > * Find the PCI dev matching the address, which for hwdom also requires > * consulting DomXEN. Passthrough everything that's not trapped. > */ > + lock = &d->pci_lock; > + read_lock(lock); > pdev = pci_get_pdev(d, sbdf); > if ( !pdev && is_hardware_domain(d) ) > + { > + read_unlock(lock); > + lock = &dom_xen->pci_lock; > + read_lock(lock); > pdev = pci_get_pdev(dom_xen, sbdf); > + } > if ( !pdev || !pdev->vpci ) > + { > + read_unlock(lock); > return vpci_read_hw(sbdf, reg, size); > + } > > spin_lock(&pdev->vpci->lock); > > @@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, > unsigned int size) > ASSERT(data_offset < size); > } > spin_unlock(&pdev->vpci->lock); > + read_unlock(lock); > > if ( data_offset < size ) > { > @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev > *pdev, > r->private); > } > > +/* Helper function to unlock locks taken by vpci_write in proper order */ > +static void unlock_locks(struct domain *d) > +{ > + ASSERT(rw_is_locked(&d->pci_lock)); > + > + if ( is_hardware_domain(d) ) > + { > + ASSERT(rw_is_locked(&d->pci_lock)); > + read_unlock(&dom_xen->pci_lock); > + } > + read_unlock(&d->pci_lock); > +} > + > void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, > uint32_t data) > { > - const struct domain *d = current->domain; > + struct domain *d = current->domain; > const struct pci_dev *pdev; > const struct vpci_register *r; > unsigned int data_offset = 0; > @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, > unsigned int size, > > /* > * Find the PCI dev matching the address, which for hwdom also requires > - * consulting DomXEN. Passthrough everything that's not trapped. > + * consulting DomXEN. Passthrough everything that's not trapped. > + * If this is hwdom, we need to hold locks for both domain in case if > + * modify_bars is called() Typo: the () wants to be at the end of modify_bars(). > */ > + read_lock(&d->pci_lock); > + > + /* dom_xen->pci_lock always should be taken second to prevent deadlock */ > + if ( is_hardware_domain(d) ) > + read_lock(&dom_xen->pci_lock); For modify_bars() we also want the locks to be in write mode (at least the hw one), so that the position of the BARs can't be changed while modify_bars() is iterating over them. Is this something that will be done in a followup change? > + > pdev = pci_get_pdev(d, sbdf); > if ( !pdev && is_hardware_domain(d) ) > pdev = pci_get_pdev(dom_xen, sbdf); > @@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, > unsigned int size, > > if ( !ro_map || !test_bit(sbdf.bdf, ro_map) ) > vpci_write_hw(sbdf, reg, size, data); > + > + unlock_locks(d); > return; > } > > @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, > unsigned int size, > ASSERT(data_offset < size); > } > spin_unlock(&pdev->vpci->lock); > + unlock_locks(d); There's one issue here, some handlers will cal pcidevs_lock(), which will result in a lock over inversion, as in the previous patch we agreed that the locking order was pcidevs_lock first, d->pci_lock after. For example the MSI control_write() handler will call vpci_msi_arch_enable() which takes the pcidevs lock. I think I will have to look into using a dedicated lock for MSI related handling, as that's the only place where I think we have this pattern of taking the pcidevs_lock after the d->pci_lock. Thanks, Roger.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |