[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci



Hello,

On 04.02.22 16:57, Roger Pau Monné wrote:
> On Fri, Feb 04, 2022 at 02:43:07PM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 04.02.22 15:06, Roger Pau Monné wrote:
>>> On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote:
>>>> On 04.02.22 14:47, Jan Beulich wrote:
>>>>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
>>>>>> On 04.02.22 13:37, Jan Beulich wrote:
>>>>>>> On 04.02.2022 12:13, Roger Pau Monné wrote:
>>>>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>>>>>>>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev 
>>>>>>>>>>>>>> *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>>>                        continue;
>>>>>>>>>>>>>>                }
>>>>>>>>>>>>>>        
>>>>>>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>>>>>>>>> +        if ( !tmp->vpci )
>>>>>>>>>>>>>> +        {
>>>>>>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>> +            continue;
>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>                for ( i = 0; i < 
>>>>>>>>>>>>>> ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>>>>>>>>                {
>>>>>>>>>>>>>>                    const struct vpci_bar *bar = 
>>>>>>>>>>>>>> &tmp->vpci->header.bars[i];
>>>>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct 
>>>>>>>>>>>>>> pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>>>                    rc = rangeset_remove_range(mem, start, end);
>>>>>>>>>>>>>>                    if ( rc )
>>>>>>>>>>>>>>                    {
>>>>>>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>                        printk(XENLOG_G_WARNING "Failed to remove 
>>>>>>>>>>>>>> [%lx, %lx]: %d\n",
>>>>>>>>>>>>>>                               start, end, rc);
>>>>>>>>>>>>>>                        rangeset_destroy(mem);
>>>>>>>>>>>>>>                        return rc;
>>>>>>>>>>>>>>                    }
>>>>>>>>>>>>>>                }
>>>>>>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>            }
>>>>>>>>>>>>> At the first glance this simply looks like another unjustified 
>>>>>>>>>>>>> (in the
>>>>>>>>>>>>> description) change, as you're not converting anything here but 
>>>>>>>>>>>>> you
>>>>>>>>>>>>> actually add locking (and I realize this was there before, so I'm 
>>>>>>>>>>>>> sorry
>>>>>>>>>>>>> for not pointing this out earlier).
>>>>>>>>>>>> Well, I thought that the description already has "...the lock can 
>>>>>>>>>>>> be
>>>>>>>>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>>>>>>>>> is present" and this is enough for such uses as here.
>>>>>>>>>>>>>        But then I wonder whether you
>>>>>>>>>>>>> actually tested this, since I can't help getting the impression 
>>>>>>>>>>>>> that
>>>>>>>>>>>>> you're introducing a live-lock: The function is called from 
>>>>>>>>>>>>> cmd_write()
>>>>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). 
>>>>>>>>>>>>> Yet that
>>>>>>>>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - 
>>>>>>>>>>>>> otoh
>>>>>>>>>>>>> the locking looks to be entirely unnecessary.)
>>>>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>>>>>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>>>>>>>>> then we'll deadlock.
>>>>>>>>>>>>
>>>>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>>>>>>>>> if tmp != pdev
>>>>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock 
>>>>>>>>>>> potential
>>>>>>>>>>> between the two locks.
>>>>>>>>>> I am not sure I can suggest a better solution here
>>>>>>>>>> @Roger, @Jan, could you please help here?
>>>>>>>>> Well, first of all I'd like to mention that while it may have been 
>>>>>>>>> okay to
>>>>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when 
>>>>>>>>> dealing
>>>>>>>>> with DomU-s' lists of PCI devices. The requirement really applies to 
>>>>>>>>> the
>>>>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>>>>>>>>> there it probably wants to be a try-lock.
>>>>>>>>>
>>>>>>>>> Next I'd like to point out that here we have the still pending issue 
>>>>>>>>> of
>>>>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC 
>>>>>>>>> patch
>>>>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the 
>>>>>>>>> solution
>>>>>>>>> here, I think it wants to at least account for the extra need there.
>>>>>>>> Yes, sorry, I should take care of that.
>>>>>>>>
>>>>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with 
>>>>>>>>> avoiding
>>>>>>>>> the deadlock, as it's imo not an option at all to acquire that lock
>>>>>>>>> everywhere else you access ->vpci (or else the vpci lock itself would 
>>>>>>>>> be
>>>>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>>>>>>>>> would acquire it in read mode, and here you'd acquire it in write 
>>>>>>>>> mode (in
>>>>>>>>> the former case around the vpci lock, while in the latter case there 
>>>>>>>>> may
>>>>>>>>> then not be any need to acquire the individual vpci locks at all). 
>>>>>>>>> FTAOD:
>>>>>>>>> I haven't fully thought through all implications (and hence whether 
>>>>>>>>> this is
>>>>>>>>> viable in the first place); I expect you will, documenting what you've
>>>>>>>>> found in the resulting patch description. Of course the double lock
>>>>>>>>> acquire/release would then likely want hiding in helper functions.
>>>>>>>> I've been also thinking about this, and whether it's really worth to
>>>>>>>> have a per-device lock rather than a per-domain one that protects all
>>>>>>>> vpci regions of the devices assigned to the domain.
>>>>>>>>
>>>>>>>> The OS is likely to serialize accesses to the PCI config space anyway,
>>>>>>>> and the only place I could see a benefit of having per-device locks is
>>>>>>>> in the handling of MSI-X tables, as the handling of the mask bit is
>>>>>>>> likely very performance sensitive, so adding a per-domain lock there
>>>>>>>> could be a bottleneck.
>>>>>>> Hmm, with method 1 accesses serializing globally is basically
>>>>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move
>>>>>>> to) permit(ting) parallel accesses, with serialization perhaps done
>>>>>>> only at device level. See our own pci_config_lock, which applies to
>>>>>>> only method 1 accesses; we don't look to be serializing MMCFG
>>>>>>> accesses at all.
>>>>>>>
>>>>>>>> We could alternatively do a per-domain rwlock for vpci and special case
>>>>>>>> the MSI-X area to also have a per-device specific lock. At which point
>>>>>>>> it becomes fairly similar to what you propose.
>>>>>> @Jan, @Roger
>>>>>>
>>>>>> 1. d->vpci_lock - rwlock <- this protects vpci
>>>>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
>>>>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
>>>>>> really depend on vPCI?
>>>>> If so, perhaps indeed better the latter. But as said in reply to Roger,
>>>>> I'm not convinced (yet) that doing away with the per-device lock is a
>>>>> good move. As said there - we're ourselves doing fully parallel MMCFG
>>>>> accesses, so OSes ought to be fine to do so, too.
>>>> But with pdev->vpci_lock we face ABBA...
>>> I think it would be easier to start with a per-domain rwlock that
>>> guarantees pdev->vpci cannot be removed under our feet. This would be
>>> taken in read mode in vpci_{read,write} and in write mode when
>>> removing a device from a domain.
>>>
>>> Then there are also other issues regarding vPCI locking that need to
>>> be fixed, but that lock would likely be a start.
>> Or let's see the problem at a different angle: this is the only place
>> which breaks the use of pdev->vpci_lock. Because all other places
>> do not try to acquire the lock of any two devices at a time.
>> So, what if we re-work the offending piece of code instead?
>> That way we do not break parallel access and have the lock per-device
>> which might also be a plus.
>>
>> By re-work I mean, that instead of reading already mapped regions
>> from tmp we can employ a d->pci_mapped_regions range set which
>> will hold all the already mapped ranges. And when it is needed to access
>> that range set we use pcidevs_lock which seems to be rare.
>> So, modify_bars will rely on pdev->vpci_lock + pcidevs_lock and
>> ABBA won't be possible at all.
> Sadly that won't replace the usage of the loop in modify_bars. This is
> not (exclusively) done in order to prevent mapping the same region
> multiple times, but rather to prevent unmapping of regions as long as
> there's an enabled BAR that's using it.
>
> If you wanted to use something like d->pci_mapped_regions it would
> have to keep reference counts to regions, in order to know when a
> mapping is no longer required by any BAR on the system with memory
> decoding enabled.
I missed this path, thank you

I tried to analyze the locking in pci/vpci.

First of all some context to refresh the target we want:
the rationale behind moving pdev->vpci->lock outside
is to be able dynamically create and destroy pdev->vpci.
So, for that reason lock needs to be moved outside of the pdev->vpci.

Some of the callers of the vPCI code and locking used:

======================================
vpci_mmio_read/vpci_mmcfg_read
======================================
   - vpci_ecam_read
   - vpci_read
    !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
    - msix:
     - control_read
    - header:
     - guest_bar_read
    - msi:
     - control_read
     - address_read/address_hi_read
     - data_read
     - mask_read

======================================
vpci_mmio_write/vpci_mmcfg_write
======================================
   - vpci_ecam_write
   - vpci_write
    !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
    - msix:
     - control_write
    - header:
     - bar_write/guest_bar_write
     - cmd_write/guest_cmd_write
     - rom_write
      - all write handlers may call modify_bars
       modify_bars
    - msi:
     - control_write
     - address_write/address_hi_write
     - data_write
     - mask_write

======================================
pci_add_device: locked with pcidevs_lock
======================================
   - vpci_add_handlers
    ++++++++ pdev->vpci_lock is used ++++++++

======================================
pci_remove_device: locked with pcidevs_lock
======================================
- vpci_remove_device
   ++++++++ pdev->vpci_lock is used ++++++++
- pci_cleanup_msi
- free_pdev

======================================
XEN_DOMCTL_assign_device: locked with pcidevs_lock
======================================
- assign_device
  - vpci_deassign_device
  - pdev_msix_assign
  - vpci_assign_device
   - vpci_add_handlers
     ++++++++ pdev->vpci_lock is used ++++++++

======================================
XEN_DOMCTL_deassign_device: locked with pcidevs_lock
======================================
- deassign_device
  - vpci_deassign_device
    ++++++++ pdev->vpci_lock is used ++++++++
   - vpci_remove_device


======================================
modify_bars is a special case: this is the only function which tries to lock
two pci_dev devices: it is done to check for overlaps with other BARs which may 
have been
already mapped or unmapped.

So, this is the only case which may deadlock because of pci_dev->vpci_lock.
======================================

Bottom line:
======================================

1. vpci_{read|write} are not protected with pcidevs_lock and can run in
parallel with pci_remove_device which can remove pdev after vpci_{read|write}
acquired the pdev pointer. This may lead to a fail due to pdev dereference.

So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.

2. The only offending place which is in the way of pci_dev->vpci_lock is
modify_bars. If it can be re-worked to track already mapped and unmapped
regions then we can avoid having a possible deadlock and can use
pci_dev->vpci_lock (rangesets won't help here as we also need refcounting be
implemented).

If pcidevs_lock is used for vpci_{read|write} then no deadlock is possible,
but modify_bars code must be re-worked not to lock itself (pdev->vpci_lock and
tmp->vpci_lock when pdev == tmp, this is minor).

3. We may think about a per-domain rwlock and pdev->vpci_lock, so this solves
modify_bars's two pdevs access. But this doesn't solve possible pdev
de-reference in vpci_{read|write} vs pci_remove_device.

@Roger, @Jan, I would like to hear what do you think about the above analysis
and how can we proceed with locking re-work?

Thank you in advance,
Oleksandr

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.