[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Mykyta Poturai <Mykyta_Poturai@xxxxxxxx>
  • Date: Wed, 1 Apr 2026 14:07:15 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=U9+z0TQLIKNTYmNzetCC+8epCDmJMvnYSr4IQDMSKGY=; b=o4oU0dihqDF5xYVZ1NPQoy/9GjXu8e4DRvR1aYfcnT87mNJ1ScFR53XVckfa4KKbZembcFB6MmX6wrE8RwL6gWsd4mKX7UH2Hzy35Gp+LKMGt5AR2H++c6H3Thc4XxV/ZzAsICCME8BuA0OH8GVXr+wHDM85xNxbUyuwSFiq0wlKkpWBTLQE2EFuFIJixExyBeXLgvN/CEY446FPUeq3a5QbYyK05QH2kl7f+VwysQyvhNfGEQKsjtuABwC0U+YUmVidAU4vrkHIexR90NGrhJq6lFNr+QWA5VHaqcbK/yJJLMNde6jAix+xFTWg6saVInWJL/j/3E19qFyNAzeGvg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=TNip1FhlsVINwKipBAn0Oua4DeJMtbe3VkGzOmVRpbSubZ3pAVIQ6wFvf+A1Z/lz/T6vrA9HaN7p5vRPoMKnCXn0gNzeK+dt6BQmwNMjnke25ovHuspB06wrP13jNiT02njYAJO4C+zWqljo5c3d+A2rv6VUpSYkS4S4S0GIC2VGD1AUamBhfJNMVT1AHnHStWvRZ0YpANINzGvXClgdOnvhYZK4zhsxBC0MGpfM3IDYmmHnZbvvMkj+1UqrG1e//jM2Rm0sXHme4gr/AY12gKQS+9flFJT0BC2d3JrLwptpkbB8A+kAQWXTcnAQxE5DfZsz9gNLbUU4Nc3RD4xoCA==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=epam.com header.i="@epam.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:x-ms-exchange-senderadcheck"
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=epam.com;
  • Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stewart Hildebrand <stewart.hildebrand@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Wed, 01 Apr 2026 14:07:23 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHcr7UNjpbdOXJQP0qdir/zrd85obXI3WYAgAEeNICAAAXvAIAAYLsA
  • Thread-topic: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions


On 4/1/26 11:21, Jan Beulich wrote:
> On 01.04.2026 09:59, Mykyta Poturai wrote:
>> On 3/31/26 17:55, Jan Beulich wrote:
>>> On 09.03.2026 12:08, Mykyta Poturai wrote:
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev 
>>>> *pdev, uint16_t cmd,
>>>>    
>>>>    bool vpci_process_pending(struct vcpu *v)
>>>>    {
>>>> -    const struct pci_dev *pdev = v->vpci.pdev;
>>>> -    struct vpci_header *header = NULL;
>>>> -    unsigned int i;
>>>> -
>>>> -    if ( !pdev )
>>>> -        return false;
>>>> -
>>>> -    read_lock(&v->domain->pci_lock);
>>>> -
>>>> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>> +    switch ( v->vpci.task )
>>>>        {
>>>> -        v->vpci.pdev = NULL;
>>>> -        read_unlock(&v->domain->pci_lock);
>>>> -        return false;
>>>> -    }
>>>> -
>>>> -    header = &pdev->vpci->header;
>>>> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>> +    case MODIFY_MEMORY:
>>>>        {
>>>> -        struct vpci_bar *bar = &header->bars[i];
>>>> -        struct rangeset *mem = v->vpci.bar_mem[i];
>>>> -        struct map_data data = {
>>>> -            .d = v->domain,
>>>> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>> -            .bar = bar,
>>>> -        };
>>>> -        int rc;
>>>> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
>>>> +        struct vpci_header *header = NULL;
>>>> +        unsigned int i;
>>>>    
>>>> -        if ( rangeset_is_empty(mem) )
>>>> -            continue;
>>>> +        if ( !pdev )
>>>> +            break;
>>>>    
>>>> -        rc = rangeset_consume_ranges(mem, map_range, &data);
>>>> +        read_lock(&v->domain->pci_lock);
>>>>    
>>>> -        if ( rc == -ERESTART )
>>>> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>            {
>>>> +            v->vpci.memory.pdev = NULL;
>>>>                read_unlock(&v->domain->pci_lock);
>>>> -            return true;
>>>> +            break;
>>>>            }
>>>>    
>>>> -        if ( rc )
>>>> +        header = &pdev->vpci->header;
>>>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>            {
>>>> -            spin_lock(&pdev->vpci->lock);
>>>> -            /* Disable memory decoding unconditionally on failure. */
>>>> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
>>>> -                            false);
>>>> -            spin_unlock(&pdev->vpci->lock);
>>>> +            struct vpci_bar *bar = &header->bars[i];
>>>> +            struct rangeset *mem = v->vpci.bar_mem[i];
>>>> +            struct map_data data = {
>>>> +                .d = v->domain,
>>>> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
>>>> +                .bar = bar,
>>>> +            };
>>>> +            int rc;
>>>> +
>>>> +            if ( rangeset_is_empty(mem) )
>>>> +                continue;
>>>>    
>>>> -            /* Clean all the rangesets */
>>>> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>> -                     rangeset_purge(v->vpci.bar_mem[i]);
>>>> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>    
>>>> -            v->vpci.pdev = NULL;
>>>> +            if ( rc == -ERESTART )
>>>> +            {
>>>> +                read_unlock(&v->domain->pci_lock);
>>>> +                return true;
>>>> +            }
>>>>    
>>>> -            read_unlock(&v->domain->pci_lock);
>>>> +            if ( rc )
>>>> +            {
>>>> +                spin_lock(&pdev->vpci->lock);
>>>> +                /* Disable memory decoding unconditionally on failure. */
>>>> +                modify_decoding(pdev, v->vpci.memory.cmd & 
>>>> ~PCI_COMMAND_MEMORY,
>>>> +                                false);
>>>> +                spin_unlock(&pdev->vpci->lock);
>>>> +
>>>> +                /* Clean all the rangesets */
>>>> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>> +                        rangeset_purge(v->vpci.bar_mem[i]);
>>>> +
>>>> +                v->vpci.memory.pdev = NULL;
>>>> +
>>>> +                read_unlock(&v->domain->pci_lock);
>>>>    
>>>> -            if ( !is_hardware_domain(v->domain) )
>>>> -                domain_crash(v->domain);
>>>> +                if ( !is_hardware_domain(v->domain) )
>>>> +                    domain_crash(v->domain);
>>>>    
>>>> -            return false;
>>>> +                break;
>>>> +            }
>>>>            }
>>>> -    }
>>>> -    v->vpci.pdev = NULL;
>>>> +        v->vpci.memory.pdev = NULL;
>>>>    
>>>> -    spin_lock(&pdev->vpci->lock);
>>>> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
>>>> -    spin_unlock(&pdev->vpci->lock);
>>>> +        spin_lock(&pdev->vpci->lock);
>>>> +        modify_decoding(pdev, v->vpci.memory.cmd, 
>>>> v->vpci.memory.rom_only);
>>>> +        spin_unlock(&pdev->vpci->lock);
>>>>    
>>>> -    read_unlock(&v->domain->pci_lock);
>>>> +        read_unlock(&v->domain->pci_lock);
>>>> +
>>>> +        break;
>>>> +    }
>>>> +    case WAIT:
>>>> +        if ( NOW() < v->vpci.wait.end )
>>>> +            return true;
>>>> +        v->vpci.wait.callback(v->vpci.wait.data);
>>>> +        break;
>>>
>>> As just indicated in reply to patch 6, busy waiting isn't really acceptable.
>>> This is even more so when the waiting exceeds the typical length of a
>>> scheduling timeslice.
>>>
>>> In that other reply I said to put the vCPU to sleep, but you need to be 
>>> careful
>>> there too: The domain may not expect its vCPU to not make any progress for 
>>> such
>>> an extended period of time. This may need doing entirely differently: Once 
>>> the
>>> command register was written, you may want to record the time after which
>>> accesses to the VF registers are permitted. Earlier accesses would simply be
>>> terminated. You may still additionally need a timer, in order to kick off 
>>> BAR
>>> mapping after that time. (Yet better would  be if the BAR mapping could be
>>> done during those 100ms. After all that may be a reason why this long a 
>>> delay
>>> is specified: Firmware on the device may also require some time to set up 
>>> the
>>> BARs accordingly.)
>>
>> I am not sure it would work that way. If we look at how linux
>> initialized sriov, it writes VFE and MSE bits, waits 100ms and then
>> expects VFs to be operational. If they are not operational at that
>> moment, then it considers the operation failed and removes all VFs. If
>> we also wait 100ms before enabling access, the probability of a guest
>> trying to access something before we allow it would be very high.
> 
> Well, not really. Our counting of the 100ms necessarily starts before Dom0's.
> Furthermore it may be acceptable (or even appropriate) to stall premature
> accesses (because they shouldn't occur in the first place), by blocking the
> vCPU at that point. A middle route may be possible: Terminate accesses in,
> say, the first 90ms, and stall the vCPU for any access past that, but before
> the 100ms expired.
> 

Is there any real benefit to doing all this work instead of just waiting 
for Dom0 to register the FVs? Implementing it the way you described 
would require a relatively complex state machine and two timers per 
sriov-capable device. And will also probably require some hacks to 
handle partially initialized VFs in Xen. This adds a lot of work and 
many possible bugs for not a lot of benefit in my opinion.

-- 
Mykyta

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.