[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci




On 07.02.22 14:46, Roger Pau Monné wrote:
> On Mon, Feb 07, 2022 at 11:08:39AM +0000, Oleksandr Andrushchenko wrote:
>> Hello,
>>
>> On 04.02.22 16:57, Roger Pau Monné wrote:
>>> On Fri, Feb 04, 2022 at 02:43:07PM +0000, Oleksandr Andrushchenko wrote:
>>>> On 04.02.22 15:06, Roger Pau Monné wrote:
>>>>> On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote:
>>>>>> On 04.02.22 14:47, Jan Beulich wrote:
>>>>>>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
>>>>>>>> On 04.02.22 13:37, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 12:13, Roger Pau Monné wrote:
>>>>>>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>>>>>>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>>>>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct 
>>>>>>>>>>>>>>>> pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>>>>>                         continue;
>>>>>>>>>>>>>>>>                 }
>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>>>>>>>>>>> +        if ( !tmp->vpci )
>>>>>>>>>>>>>>>> +        {
>>>>>>>>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>>> +            continue;
>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>>                 for ( i = 0; i < 
>>>>>>>>>>>>>>>> ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>>>>>>>>>>                 {
>>>>>>>>>>>>>>>>                     const struct vpci_bar *bar = 
>>>>>>>>>>>>>>>> &tmp->vpci->header.bars[i];
>>>>>>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct 
>>>>>>>>>>>>>>>> pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>>>>>                     rc = rangeset_remove_range(mem, start, 
>>>>>>>>>>>>>>>> end);
>>>>>>>>>>>>>>>>                     if ( rc )
>>>>>>>>>>>>>>>>                     {
>>>>>>>>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>>>                         printk(XENLOG_G_WARNING "Failed to 
>>>>>>>>>>>>>>>> remove [%lx, %lx]: %d\n",
>>>>>>>>>>>>>>>>                                start, end, rc);
>>>>>>>>>>>>>>>>                         rangeset_destroy(mem);
>>>>>>>>>>>>>>>>                         return rc;
>>>>>>>>>>>>>>>>                     }
>>>>>>>>>>>>>>>>                 }
>>>>>>>>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>>>             }
>>>>>>>>>>>>>>> At the first glance this simply looks like another unjustified 
>>>>>>>>>>>>>>> (in the
>>>>>>>>>>>>>>> description) change, as you're not converting anything here but 
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> actually add locking (and I realize this was there before, so 
>>>>>>>>>>>>>>> I'm sorry
>>>>>>>>>>>>>>> for not pointing this out earlier).
>>>>>>>>>>>>>> Well, I thought that the description already has "...the lock 
>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>> used (and in a few cases is used right away) to check whether 
>>>>>>>>>>>>>> vpci
>>>>>>>>>>>>>> is present" and this is enough for such uses as here.
>>>>>>>>>>>>>>>         But then I wonder whether you
>>>>>>>>>>>>>>> actually tested this, since I can't help getting the impression 
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> you're introducing a live-lock: The function is called from 
>>>>>>>>>>>>>>> cmd_write()
>>>>>>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). 
>>>>>>>>>>>>>>> Yet that
>>>>>>>>>>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - 
>>>>>>>>>>>>>>> otoh
>>>>>>>>>>>>>>> the locking looks to be entirely unnecessary.)
>>>>>>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to 
>>>>>>>>>>>>>> acquire
>>>>>>>>>>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>>>>>>>>>>> then we'll deadlock.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>>>>>>>>>>> if tmp != pdev
>>>>>>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock 
>>>>>>>>>>>>> potential
>>>>>>>>>>>>> between the two locks.
>>>>>>>>>>>> I am not sure I can suggest a better solution here
>>>>>>>>>>>> @Roger, @Jan, could you please help here?
>>>>>>>>>>> Well, first of all I'd like to mention that while it may have been 
>>>>>>>>>>> okay to
>>>>>>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when 
>>>>>>>>>>> dealing
>>>>>>>>>>> with DomU-s' lists of PCI devices. The requirement really applies 
>>>>>>>>>>> to the
>>>>>>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except 
>>>>>>>>>>> that
>>>>>>>>>>> there it probably wants to be a try-lock.
>>>>>>>>>>>
>>>>>>>>>>> Next I'd like to point out that here we have the still pending 
>>>>>>>>>>> issue of
>>>>>>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC 
>>>>>>>>>>> patch
>>>>>>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the 
>>>>>>>>>>> solution
>>>>>>>>>>> here, I think it wants to at least account for the extra need there.
>>>>>>>>>> Yes, sorry, I should take care of that.
>>>>>>>>>>
>>>>>>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with 
>>>>>>>>>>> avoiding
>>>>>>>>>>> the deadlock, as it's imo not an option at all to acquire that lock
>>>>>>>>>>> everywhere else you access ->vpci (or else the vpci lock itself 
>>>>>>>>>>> would be
>>>>>>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other 
>>>>>>>>>>> paths
>>>>>>>>>>> would acquire it in read mode, and here you'd acquire it in write 
>>>>>>>>>>> mode (in
>>>>>>>>>>> the former case around the vpci lock, while in the latter case 
>>>>>>>>>>> there may
>>>>>>>>>>> then not be any need to acquire the individual vpci locks at all). 
>>>>>>>>>>> FTAOD:
>>>>>>>>>>> I haven't fully thought through all implications (and hence whether 
>>>>>>>>>>> this is
>>>>>>>>>>> viable in the first place); I expect you will, documenting what 
>>>>>>>>>>> you've
>>>>>>>>>>> found in the resulting patch description. Of course the double lock
>>>>>>>>>>> acquire/release would then likely want hiding in helper functions.
>>>>>>>>>> I've been also thinking about this, and whether it's really worth to
>>>>>>>>>> have a per-device lock rather than a per-domain one that protects all
>>>>>>>>>> vpci regions of the devices assigned to the domain.
>>>>>>>>>>
>>>>>>>>>> The OS is likely to serialize accesses to the PCI config space 
>>>>>>>>>> anyway,
>>>>>>>>>> and the only place I could see a benefit of having per-device locks 
>>>>>>>>>> is
>>>>>>>>>> in the handling of MSI-X tables, as the handling of the mask bit is
>>>>>>>>>> likely very performance sensitive, so adding a per-domain lock there
>>>>>>>>>> could be a bottleneck.
>>>>>>>>> Hmm, with method 1 accesses serializing globally is basically
>>>>>>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move
>>>>>>>>> to) permit(ting) parallel accesses, with serialization perhaps done
>>>>>>>>> only at device level. See our own pci_config_lock, which applies to
>>>>>>>>> only method 1 accesses; we don't look to be serializing MMCFG
>>>>>>>>> accesses at all.
>>>>>>>>>
>>>>>>>>>> We could alternatively do a per-domain rwlock for vpci and special 
>>>>>>>>>> case
>>>>>>>>>> the MSI-X area to also have a per-device specific lock. At which 
>>>>>>>>>> point
>>>>>>>>>> it becomes fairly similar to what you propose.
>>>>>>>> @Jan, @Roger
>>>>>>>>
>>>>>>>> 1. d->vpci_lock - rwlock <- this protects vpci
>>>>>>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
>>>>>>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
>>>>>>>> really depend on vPCI?
>>>>>>> If so, perhaps indeed better the latter. But as said in reply to Roger,
>>>>>>> I'm not convinced (yet) that doing away with the per-device lock is a
>>>>>>> good move. As said there - we're ourselves doing fully parallel MMCFG
>>>>>>> accesses, so OSes ought to be fine to do so, too.
>>>>>> But with pdev->vpci_lock we face ABBA...
>>>>> I think it would be easier to start with a per-domain rwlock that
>>>>> guarantees pdev->vpci cannot be removed under our feet. This would be
>>>>> taken in read mode in vpci_{read,write} and in write mode when
>>>>> removing a device from a domain.
>>>>>
>>>>> Then there are also other issues regarding vPCI locking that need to
>>>>> be fixed, but that lock would likely be a start.
>>>> Or let's see the problem at a different angle: this is the only place
>>>> which breaks the use of pdev->vpci_lock. Because all other places
>>>> do not try to acquire the lock of any two devices at a time.
>>>> So, what if we re-work the offending piece of code instead?
>>>> That way we do not break parallel access and have the lock per-device
>>>> which might also be a plus.
>>>>
>>>> By re-work I mean, that instead of reading already mapped regions
>>>> from tmp we can employ a d->pci_mapped_regions range set which
>>>> will hold all the already mapped ranges. And when it is needed to access
>>>> that range set we use pcidevs_lock which seems to be rare.
>>>> So, modify_bars will rely on pdev->vpci_lock + pcidevs_lock and
>>>> ABBA won't be possible at all.
>>> Sadly that won't replace the usage of the loop in modify_bars. This is
>>> not (exclusively) done in order to prevent mapping the same region
>>> multiple times, but rather to prevent unmapping of regions as long as
>>> there's an enabled BAR that's using it.
>>>
>>> If you wanted to use something like d->pci_mapped_regions it would
>>> have to keep reference counts to regions, in order to know when a
>>> mapping is no longer required by any BAR on the system with memory
>>> decoding enabled.
>> I missed this path, thank you
>>
>> I tried to analyze the locking in pci/vpci.
>>
>> First of all some context to refresh the target we want:
>> the rationale behind moving pdev->vpci->lock outside
>> is to be able dynamically create and destroy pdev->vpci.
>> So, for that reason lock needs to be moved outside of the pdev->vpci.
>>
>> Some of the callers of the vPCI code and locking used:
>>
>> ======================================
>> vpci_mmio_read/vpci_mmcfg_read
>> ======================================
>>     - vpci_ecam_read
>>     - vpci_read
>>      !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
>>      - msix:
>>       - control_read
>>      - header:
>>       - guest_bar_read
>>      - msi:
>>       - control_read
>>       - address_read/address_hi_read
>>       - data_read
>>       - mask_read
>>
>> ======================================
>> vpci_mmio_write/vpci_mmcfg_write
>> ======================================
>>     - vpci_ecam_write
>>     - vpci_write
>>      !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
>>      - msix:
>>       - control_write
>>      - header:
>>       - bar_write/guest_bar_write
>>       - cmd_write/guest_cmd_write
>>       - rom_write
>>        - all write handlers may call modify_bars
>>         modify_bars
>>      - msi:
>>       - control_write
>>       - address_write/address_hi_write
>>       - data_write
>>       - mask_write
>>
>> ======================================
>> pci_add_device: locked with pcidevs_lock
>> ======================================
>>     - vpci_add_handlers
>>      ++++++++ pdev->vpci_lock is used ++++++++
>>
>> ======================================
>> pci_remove_device: locked with pcidevs_lock
>> ======================================
>> - vpci_remove_device
>>     ++++++++ pdev->vpci_lock is used ++++++++
>> - pci_cleanup_msi
>> - free_pdev
>>
>> ======================================
>> XEN_DOMCTL_assign_device: locked with pcidevs_lock
>> ======================================
>> - assign_device
>>    - vpci_deassign_device
>>    - pdev_msix_assign
>>    - vpci_assign_device
>>     - vpci_add_handlers
>>       ++++++++ pdev->vpci_lock is used ++++++++
>>
>> ======================================
>> XEN_DOMCTL_deassign_device: locked with pcidevs_lock
>> ======================================
>> - deassign_device
>>    - vpci_deassign_device
>>      ++++++++ pdev->vpci_lock is used ++++++++
>>     - vpci_remove_device
>>
>>
>> ======================================
>> modify_bars is a special case: this is the only function which tries to lock
>> two pci_dev devices: it is done to check for overlaps with other BARs which 
>> may have been
>> already mapped or unmapped.
>>
>> So, this is the only case which may deadlock because of pci_dev->vpci_lock.
>> ======================================
>>
>> Bottom line:
>> ======================================
>>
>> 1. vpci_{read|write} are not protected with pcidevs_lock and can run in
>> parallel with pci_remove_device which can remove pdev after vpci_{read|write}
>> acquired the pdev pointer. This may lead to a fail due to pdev dereference.
>>
>> So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.
> We would like to take the pcidevs_lock only while fetching the device
> (ie: pci_get_pdev_by_domain), afterwards it should be fine to lock the
> device using a vpci specific lock so calls to vpci_{read,write} can be
> partially concurrent across multiple domains.
This means this can't be done a pre-req patch, but as a part of the
patch which changes locking.
>
> In fact I think Jan had already pointed out that the pci lock would
> need taking while searching for the device in vpci_{read,write}.
I was referring to the time after we found pdev and it is currently
possible to free pdev while using it after the search
>
> It seems to me that if you implement option 3 below taking the
> per-domain rwlock in read mode in vpci_{read|write} will already
> protect you from the device being removed if the same per-domain lock
> is taken in write mode in vpci_remove_device.
Yes, it should. Again this can't be done as a pre-req patch because
this relies on pdev->vpci_lock
>
>> 2. The only offending place which is in the way of pci_dev->vpci_lock is
>> modify_bars. If it can be re-worked to track already mapped and unmapped
>> regions then we can avoid having a possible deadlock and can use
>> pci_dev->vpci_lock (rangesets won't help here as we also need refcounting be
>> implemented).
> I think a refcounting based solution will be very complex to
> implement. I'm however happy to be proven wrong.
I can't estimate, but I have a feeling that all these plays around locking
is just because of this single piece of code. No other place suffer from
pdev->vpci_lock and no d->lock
>
>> If pcidevs_lock is used for vpci_{read|write} then no deadlock is possible,
>> but modify_bars code must be re-worked not to lock itself (pdev->vpci_lock 
>> and
>> tmp->vpci_lock when pdev == tmp, this is minor).
> Taking the pcidevs lock (a global lock) is out of the picture IMO, as
> it's going to serialize all calls of vpci_{read|write}, and would
> create too much contention on the pcidevs lock.
I understand that. But if we would like to fix the existing code I see
no other alternative.
>
>> 3. We may think about a per-domain rwlock and pdev->vpci_lock, so this solves
>> modify_bars's two pdevs access. But this doesn't solve possible pdev
>> de-reference in vpci_{read|write} vs pci_remove_device.
> pci_remove device will call vpci_remove_device, so as long as
> vpci_remove_device taken the per-domain lock in write (exclusive) mode
> it should be fine.
I think I need to see if there are any other places which similarly
require the write lock
>
>> @Roger, @Jan, I would like to hear what do you think about the above analysis
>> and how can we proceed with locking re-work?
> I think the per-domain rwlock seems like a good option. I would do
> that as a pre-patch.
It is. But it seems it won't solve the thing we started this adventure for:

With per-domain read lock and still ABBA in modify_bars (hope the below
is correctly seen with a monospace font):

cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                   
               rom_write -> modify_bars: tmp (pdev2) ->lock
cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp 
(pdev1) ->lock

There is no API to upgrade read lock to write lock in modify_bars which could 
help,
so in both cases vpci_write should take write lock.

Am I missing something here?
>
> Thanks, Roger.
Thank you,
Oleksandr

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.