|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
On 01.04.2026 16:40, Mykyta Poturai wrote:
>
>
> On 4/1/26 17:14, Jan Beulich wrote:
>> On 01.04.2026 16:07, Mykyta Poturai wrote:
>>> On 4/1/26 11:21, Jan Beulich wrote:
>>>> On 01.04.2026 09:59, Mykyta Poturai wrote:
>>>>> On 3/31/26 17:55, Jan Beulich wrote:
>>>>>> On 09.03.2026 12:08, Mykyta Poturai wrote:
>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev
>>>>>>> *pdev, uint16_t cmd,
>>>>>>>
>>>>>>> bool vpci_process_pending(struct vcpu *v)
>>>>>>> {
>>>>>>> - const struct pci_dev *pdev = v->vpci.pdev;
>>>>>>> - struct vpci_header *header = NULL;
>>>>>>> - unsigned int i;
>>>>>>> -
>>>>>>> - if ( !pdev )
>>>>>>> - return false;
>>>>>>> -
>>>>>>> - read_lock(&v->domain->pci_lock);
>>>>>>> -
>>>>>>> - if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>>>> + switch ( v->vpci.task )
>>>>>>> {
>>>>>>> - v->vpci.pdev = NULL;
>>>>>>> - read_unlock(&v->domain->pci_lock);
>>>>>>> - return false;
>>>>>>> - }
>>>>>>> -
>>>>>>> - header = &pdev->vpci->header;
>>>>>>> - for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>> + case MODIFY_MEMORY:
>>>>>>> {
>>>>>>> - struct vpci_bar *bar = &header->bars[i];
>>>>>>> - struct rangeset *mem = v->vpci.bar_mem[i];
>>>>>>> - struct map_data data = {
>>>>>>> - .d = v->domain,
>>>>>>> - .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>>>>> - .bar = bar,
>>>>>>> - };
>>>>>>> - int rc;
>>>>>>> + const struct pci_dev *pdev = v->vpci.memory.pdev;
>>>>>>> + struct vpci_header *header = NULL;
>>>>>>> + unsigned int i;
>>>>>>>
>>>>>>> - if ( rangeset_is_empty(mem) )
>>>>>>> - continue;
>>>>>>> + if ( !pdev )
>>>>>>> + break;
>>>>>>>
>>>>>>> - rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>>>> + read_lock(&v->domain->pci_lock);
>>>>>>>
>>>>>>> - if ( rc == -ERESTART )
>>>>>>> + if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>>>> {
>>>>>>> + v->vpci.memory.pdev = NULL;
>>>>>>> read_unlock(&v->domain->pci_lock);
>>>>>>> - return true;
>>>>>>> + break;
>>>>>>> }
>>>>>>>
>>>>>>> - if ( rc )
>>>>>>> + header = &pdev->vpci->header;
>>>>>>> + for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>> {
>>>>>>> - spin_lock(&pdev->vpci->lock);
>>>>>>> - /* Disable memory decoding unconditionally on failure. */
>>>>>>> - modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
>>>>>>> - false);
>>>>>>> - spin_unlock(&pdev->vpci->lock);
>>>>>>> + struct vpci_bar *bar = &header->bars[i];
>>>>>>> + struct rangeset *mem = v->vpci.bar_mem[i];
>>>>>>> + struct map_data data = {
>>>>>>> + .d = v->domain,
>>>>>>> + .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
>>>>>>> + .bar = bar,
>>>>>>> + };
>>>>>>> + int rc;
>>>>>>> +
>>>>>>> + if ( rangeset_is_empty(mem) )
>>>>>>> + continue;
>>>>>>>
>>>>>>> - /* Clean all the rangesets */
>>>>>>> - for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>> - if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>>>> - rangeset_purge(v->vpci.bar_mem[i]);
>>>>>>> + rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>>>>
>>>>>>> - v->vpci.pdev = NULL;
>>>>>>> + if ( rc == -ERESTART )
>>>>>>> + {
>>>>>>> + read_unlock(&v->domain->pci_lock);
>>>>>>> + return true;
>>>>>>> + }
>>>>>>>
>>>>>>> - read_unlock(&v->domain->pci_lock);
>>>>>>> + if ( rc )
>>>>>>> + {
>>>>>>> + spin_lock(&pdev->vpci->lock);
>>>>>>> + /* Disable memory decoding unconditionally on failure.
>>>>>>> */
>>>>>>> + modify_decoding(pdev, v->vpci.memory.cmd &
>>>>>>> ~PCI_COMMAND_MEMORY,
>>>>>>> + false);
>>>>>>> + spin_unlock(&pdev->vpci->lock);
>>>>>>> +
>>>>>>> + /* Clean all the rangesets */
>>>>>>> + for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>> + if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>>>> + rangeset_purge(v->vpci.bar_mem[i]);
>>>>>>> +
>>>>>>> + v->vpci.memory.pdev = NULL;
>>>>>>> +
>>>>>>> + read_unlock(&v->domain->pci_lock);
>>>>>>>
>>>>>>> - if ( !is_hardware_domain(v->domain) )
>>>>>>> - domain_crash(v->domain);
>>>>>>> + if ( !is_hardware_domain(v->domain) )
>>>>>>> + domain_crash(v->domain);
>>>>>>>
>>>>>>> - return false;
>>>>>>> + break;
>>>>>>> + }
>>>>>>> }
>>>>>>> - }
>>>>>>> - v->vpci.pdev = NULL;
>>>>>>> + v->vpci.memory.pdev = NULL;
>>>>>>>
>>>>>>> - spin_lock(&pdev->vpci->lock);
>>>>>>> - modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
>>>>>>> - spin_unlock(&pdev->vpci->lock);
>>>>>>> + spin_lock(&pdev->vpci->lock);
>>>>>>> + modify_decoding(pdev, v->vpci.memory.cmd,
>>>>>>> v->vpci.memory.rom_only);
>>>>>>> + spin_unlock(&pdev->vpci->lock);
>>>>>>>
>>>>>>> - read_unlock(&v->domain->pci_lock);
>>>>>>> + read_unlock(&v->domain->pci_lock);
>>>>>>> +
>>>>>>> + break;
>>>>>>> + }
>>>>>>> + case WAIT:
>>>>>>> + if ( NOW() < v->vpci.wait.end )
>>>>>>> + return true;
>>>>>>> + v->vpci.wait.callback(v->vpci.wait.data);
>>>>>>> + break;
>>>>>>
>>>>>> As just indicated in reply to patch 6, busy waiting isn't really
>>>>>> acceptable.
>>>>>> This is even more so when the waiting exceeds the typical length of a
>>>>>> scheduling timeslice.
>>>>>>
>>>>>> In that other reply I said to put the vCPU to sleep, but you need to be
>>>>>> careful
>>>>>> there too: The domain may not expect its vCPU to not make any progress
>>>>>> for such
>>>>>> an extended period of time. This may need doing entirely differently:
>>>>>> Once the
>>>>>> command register was written, you may want to record the time after which
>>>>>> accesses to the VF registers are permitted. Earlier accesses would
>>>>>> simply be
>>>>>> terminated. You may still additionally need a timer, in order to kick
>>>>>> off BAR
>>>>>> mapping after that time. (Yet better would be if the BAR mapping could
>>>>>> be
>>>>>> done during those 100ms. After all that may be a reason why this long a
>>>>>> delay
>>>>>> is specified: Firmware on the device may also require some time to set
>>>>>> up the
>>>>>> BARs accordingly.)
>>>>>
>>>>> I am not sure it would work that way. If we look at how linux
>>>>> initialized sriov, it writes VFE and MSE bits, waits 100ms and then
>>>>> expects VFs to be operational. If they are not operational at that
>>>>> moment, then it considers the operation failed and removes all VFs. If
>>>>> we also wait 100ms before enabling access, the probability of a guest
>>>>> trying to access something before we allow it would be very high.
>>>>
>>>> Well, not really. Our counting of the 100ms necessarily starts before
>>>> Dom0's.
>>>> Furthermore it may be acceptable (or even appropriate) to stall premature
>>>> accesses (because they shouldn't occur in the first place), by blocking the
>>>> vCPU at that point. A middle route may be possible: Terminate accesses in,
>>>> say, the first 90ms, and stall the vCPU for any access past that, but
>>>> before
>>>> the 100ms expired.
>>>
>>> Is there any real benefit to doing all this work instead of just waiting
>>> for Dom0 to register the FVs? Implementing it the way you described
>>> would require a relatively complex state machine and two timers per
>>> sriov-capable device. And will also probably require some hacks to
>>> handle partially initialized VFs in Xen. This adds a lot of work and
>>> many possible bugs for not a lot of benefit in my opinion.
>>
>> Odd that you ask me this question. If there was no benefit, why did you do
>> it this way?
>
> Roger asked for this approach in V1, and I saw that it can be done in a
> relatively straightforward way, so I implemented it. I didn’t exactly
> get what the benefits were, but I assumed that there are some and the
> effort is not too big to just do it if the maintainers are asking for it.
>
> Now with the posibility of redoing everything *again* and making it much
> more complex I started to really think if its really needed. So I want
> to know your and others' opinions on registering VFs with Dom0.
Well, before I voice any view here I'd need to understand why Roger wanted
it done like that. (Apparently you must have agreed sufficiently to not
ask back.)
Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |