[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Mykyta Poturai <Mykyta_Poturai@xxxxxxxx>
  • Date: Wed, 1 Apr 2026 14:40:38 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=GQtJi37Fd/Vazf7qgfK12WLcM59LS8GY90I2uV/4frA=; b=mlsDHF9GuwdxhV6TmSpBHx0PBuijZdIXXQZsraZshn3bWhd7ELtxOxN7tfpv/eJhYslOloZozrLzm/R7YlIONihOX+AVOJMOl8d9ymnwDg9Qapaor952YFabntmte1MTrCIiFYO7z44TwhV0lmyyBw3T+AR0LQCWlevJ3NAApVZH6NqrRPRGssV2fW1R7AncWX3eDhT3i8+yDk7h89JTOleIe71f2EtSL6teOEvVdNtTx0r6XQIBBBVAHttiOtIPGPAGVn1P21yvt1hbtmpGmodDCtuu3ibzm6pI7VE4mwsLX7VJyD8Lb/wdDPeCN/kvTCALQIEN5Q+PckHsY+NpnQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=ClLhBAXao+AUeBJbOxOygPkF6OtiyJicGc+chwccQiovipVaY9LfkQNicMeSo9L23qXbuUu9lCUgNYAxb2qivpXNoRePBpsBvon5EJ5on6YYD6Y3D2tXnFqsE8wbSLDd3eI1SfDlaRK91C9omkA1y+1Oi9/q5yHO29Uo4Od7kCW1KsU9Qwv0wL/HK37zy7wGYWBqy8U+EbBqK+KTdo4dobAqEkBXAAzCOpugnbR2nllSJwxiAZglByTFT56AEevWr1Qqy1bKfTze0vIMV+Xmtpii+iJqVKJbEg/YVA18x+/kcXEQ0kRk3T1Di5nQWZ457jgHilwlPhBXMi9slT17EA==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=epam.com header.i="@epam.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:x-ms-exchange-senderadcheck"
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=epam.com;
  • Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stewart Hildebrand <stewart.hildebrand@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Wed, 01 Apr 2026 14:59:48 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHcr7UNjpbdOXJQP0qdir/zrd85obXI3WYAgAEeNICAAAXvAIAAYLsAgAACF4CAAAc+AA==
  • Thread-topic: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions


On 4/1/26 17:14, Jan Beulich wrote:
> On 01.04.2026 16:07, Mykyta Poturai wrote:
>> On 4/1/26 11:21, Jan Beulich wrote:
>>> On 01.04.2026 09:59, Mykyta Poturai wrote:
>>>> On 3/31/26 17:55, Jan Beulich wrote:
>>>>> On 09.03.2026 12:08, Mykyta Poturai wrote:
>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev 
>>>>>> *pdev, uint16_t cmd,
>>>>>>     
>>>>>>     bool vpci_process_pending(struct vcpu *v)
>>>>>>     {
>>>>>> -    const struct pci_dev *pdev = v->vpci.pdev;
>>>>>> -    struct vpci_header *header = NULL;
>>>>>> -    unsigned int i;
>>>>>> -
>>>>>> -    if ( !pdev )
>>>>>> -        return false;
>>>>>> -
>>>>>> -    read_lock(&v->domain->pci_lock);
>>>>>> -
>>>>>> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>>> +    switch ( v->vpci.task )
>>>>>>         {
>>>>>> -        v->vpci.pdev = NULL;
>>>>>> -        read_unlock(&v->domain->pci_lock);
>>>>>> -        return false;
>>>>>> -    }
>>>>>> -
>>>>>> -    header = &pdev->vpci->header;
>>>>>> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>> +    case MODIFY_MEMORY:
>>>>>>         {
>>>>>> -        struct vpci_bar *bar = &header->bars[i];
>>>>>> -        struct rangeset *mem = v->vpci.bar_mem[i];
>>>>>> -        struct map_data data = {
>>>>>> -            .d = v->domain,
>>>>>> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>>>> -            .bar = bar,
>>>>>> -        };
>>>>>> -        int rc;
>>>>>> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
>>>>>> +        struct vpci_header *header = NULL;
>>>>>> +        unsigned int i;
>>>>>>     
>>>>>> -        if ( rangeset_is_empty(mem) )
>>>>>> -            continue;
>>>>>> +        if ( !pdev )
>>>>>> +            break;
>>>>>>     
>>>>>> -        rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>>> +        read_lock(&v->domain->pci_lock);
>>>>>>     
>>>>>> -        if ( rc == -ERESTART )
>>>>>> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>>>             {
>>>>>> +            v->vpci.memory.pdev = NULL;
>>>>>>                 read_unlock(&v->domain->pci_lock);
>>>>>> -            return true;
>>>>>> +            break;
>>>>>>             }
>>>>>>     
>>>>>> -        if ( rc )
>>>>>> +        header = &pdev->vpci->header;
>>>>>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>             {
>>>>>> -            spin_lock(&pdev->vpci->lock);
>>>>>> -            /* Disable memory decoding unconditionally on failure. */
>>>>>> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
>>>>>> -                            false);
>>>>>> -            spin_unlock(&pdev->vpci->lock);
>>>>>> +            struct vpci_bar *bar = &header->bars[i];
>>>>>> +            struct rangeset *mem = v->vpci.bar_mem[i];
>>>>>> +            struct map_data data = {
>>>>>> +                .d = v->domain,
>>>>>> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
>>>>>> +                .bar = bar,
>>>>>> +            };
>>>>>> +            int rc;
>>>>>> +
>>>>>> +            if ( rangeset_is_empty(mem) )
>>>>>> +                continue;
>>>>>>     
>>>>>> -            /* Clean all the rangesets */
>>>>>> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>>> -                     rangeset_purge(v->vpci.bar_mem[i]);
>>>>>> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>>>     
>>>>>> -            v->vpci.pdev = NULL;
>>>>>> +            if ( rc == -ERESTART )
>>>>>> +            {
>>>>>> +                read_unlock(&v->domain->pci_lock);
>>>>>> +                return true;
>>>>>> +            }
>>>>>>     
>>>>>> -            read_unlock(&v->domain->pci_lock);
>>>>>> +            if ( rc )
>>>>>> +            {
>>>>>> +                spin_lock(&pdev->vpci->lock);
>>>>>> +                /* Disable memory decoding unconditionally on failure. 
>>>>>> */
>>>>>> +                modify_decoding(pdev, v->vpci.memory.cmd & 
>>>>>> ~PCI_COMMAND_MEMORY,
>>>>>> +                                false);
>>>>>> +                spin_unlock(&pdev->vpci->lock);
>>>>>> +
>>>>>> +                /* Clean all the rangesets */
>>>>>> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>>> +                        rangeset_purge(v->vpci.bar_mem[i]);
>>>>>> +
>>>>>> +                v->vpci.memory.pdev = NULL;
>>>>>> +
>>>>>> +                read_unlock(&v->domain->pci_lock);
>>>>>>     
>>>>>> -            if ( !is_hardware_domain(v->domain) )
>>>>>> -                domain_crash(v->domain);
>>>>>> +                if ( !is_hardware_domain(v->domain) )
>>>>>> +                    domain_crash(v->domain);
>>>>>>     
>>>>>> -            return false;
>>>>>> +                break;
>>>>>> +            }
>>>>>>             }
>>>>>> -    }
>>>>>> -    v->vpci.pdev = NULL;
>>>>>> +        v->vpci.memory.pdev = NULL;
>>>>>>     
>>>>>> -    spin_lock(&pdev->vpci->lock);
>>>>>> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
>>>>>> -    spin_unlock(&pdev->vpci->lock);
>>>>>> +        spin_lock(&pdev->vpci->lock);
>>>>>> +        modify_decoding(pdev, v->vpci.memory.cmd, 
>>>>>> v->vpci.memory.rom_only);
>>>>>> +        spin_unlock(&pdev->vpci->lock);
>>>>>>     
>>>>>> -    read_unlock(&v->domain->pci_lock);
>>>>>> +        read_unlock(&v->domain->pci_lock);
>>>>>> +
>>>>>> +        break;
>>>>>> +    }
>>>>>> +    case WAIT:
>>>>>> +        if ( NOW() < v->vpci.wait.end )
>>>>>> +            return true;
>>>>>> +        v->vpci.wait.callback(v->vpci.wait.data);
>>>>>> +        break;
>>>>>
>>>>> As just indicated in reply to patch 6, busy waiting isn't really 
>>>>> acceptable.
>>>>> This is even more so when the waiting exceeds the typical length of a
>>>>> scheduling timeslice.
>>>>>
>>>>> In that other reply I said to put the vCPU to sleep, but you need to be 
>>>>> careful
>>>>> there too: The domain may not expect its vCPU to not make any progress 
>>>>> for such
>>>>> an extended period of time. This may need doing entirely differently: 
>>>>> Once the
>>>>> command register was written, you may want to record the time after which
>>>>> accesses to the VF registers are permitted. Earlier accesses would simply 
>>>>> be
>>>>> terminated. You may still additionally need a timer, in order to kick off 
>>>>> BAR
>>>>> mapping after that time. (Yet better would  be if the BAR mapping could be
>>>>> done during those 100ms. After all that may be a reason why this long a 
>>>>> delay
>>>>> is specified: Firmware on the device may also require some time to set up 
>>>>> the
>>>>> BARs accordingly.)
>>>>
>>>> I am not sure it would work that way. If we look at how linux
>>>> initialized sriov, it writes VFE and MSE bits, waits 100ms and then
>>>> expects VFs to be operational. If they are not operational at that
>>>> moment, then it considers the operation failed and removes all VFs. If
>>>> we also wait 100ms before enabling access, the probability of a guest
>>>> trying to access something before we allow it would be very high.
>>>
>>> Well, not really. Our counting of the 100ms necessarily starts before 
>>> Dom0's.
>>> Furthermore it may be acceptable (or even appropriate) to stall premature
>>> accesses (because they shouldn't occur in the first place), by blocking the
>>> vCPU at that point. A middle route may be possible: Terminate accesses in,
>>> say, the first 90ms, and stall the vCPU for any access past that, but before
>>> the 100ms expired.
>>
>> Is there any real benefit to doing all this work instead of just waiting
>> for Dom0 to register the FVs? Implementing it the way you described
>> would require a relatively complex state machine and two timers per
>> sriov-capable device. And will also probably require some hacks to
>> handle partially initialized VFs in Xen. This adds a lot of work and
>> many possible bugs for not a lot of benefit in my opinion.
> 
> Odd that you ask me this question. If there was no benefit, why did you do
> it this way?
> 

Roger asked for this approach in V1, and I saw that it can be done in a 
relatively straightforward way, so I implemented it. I didn’t exactly 
get what the benefits were, but I assumed that there are some and the 
effort is not too big to just do it if the maintainers are asking for it.

Now with the posibility of redoing everything *again* and making it much 
more complex I started to really think if its really needed. So I want 
to know your and others' opinions on registering VFs with Dom0.


-- 
Mykyta

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.