Xen project Mailing List

Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions

From: Mykyta Poturai <Mykyta_Poturai@xxxxxxxx>

Date: Wed, 1 Apr 2026 14:07:15 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=U9+z0TQLIKNTYmNzetCC+8epCDmJMvnYSr4IQDMSKGY=; b=o4oU0dihqDF5xYVZ1NPQoy/9GjXu8e4DRvR1aYfcnT87mNJ1ScFR53XVckfa4KKbZembcFB6MmX6wrE8RwL6gWsd4mKX7UH2Hzy35Gp+LKMGt5AR2H++c6H3Thc4XxV/ZzAsICCME8BuA0OH8GVXr+wHDM85xNxbUyuwSFiq0wlKkpWBTLQE2EFuFIJixExyBeXLgvN/CEY446FPUeq3a5QbYyK05QH2kl7f+VwysQyvhNfGEQKsjtuABwC0U+YUmVidAU4vrkHIexR90NGrhJq6lFNr+QWA5VHaqcbK/yJJLMNde6jAix+xFTWg6saVInWJL/j/3E19qFyNAzeGvg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=TNip1FhlsVINwKipBAn0Oua4DeJMtbe3VkGzOmVRpbSubZ3pAVIQ6wFvf+A1Z/lz/T6vrA9HaN7p5vRPoMKnCXn0gNzeK+dt6BQmwNMjnke25ovHuspB06wrP13jNiT02njYAJO4C+zWqljo5c3d+A2rv6VUpSYkS4S4S0GIC2VGD1AUamBhfJNMVT1AHnHStWvRZ0YpANINzGvXClgdOnvhYZK4zhsxBC0MGpfM3IDYmmHnZbvvMkj+1UqrG1e//jM2Rm0sXHme4gr/AY12gKQS+9flFJT0BC2d3JrLwptpkbB8A+kAQWXTcnAQxE5DfZsz9gNLbUU4Nc3RD4xoCA==

Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=epam.com header.i="@epam.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:x-ms-exchange-senderadcheck"

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=epam.com;

Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stewart Hildebrand <stewart.hildebrand@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Wed, 01 Apr 2026 14:07:23 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHcr7UNjpbdOXJQP0qdir/zrd85obXI3WYAgAEeNICAAAXvAIAAYLsA

Thread-topic: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions

On 4/1/26 11:21, Jan Beulich wrote: > On 01.04.2026 09:59, Mykyta Poturai wrote: >> On 3/31/26 17:55, Jan Beulich wrote: >>> On 09.03.2026 12:08, Mykyta Poturai wrote: >>>> --- a/xen/drivers/vpci/header.c >>>> +++ b/xen/drivers/vpci/header.c >>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev >>>> *pdev, uint16_t cmd, >>>> >>>> bool vpci_process_pending(struct vcpu *v) >>>> { >>>> - const struct pci_dev *pdev = v->vpci.pdev; >>>> - struct vpci_header *header = NULL; >>>> - unsigned int i; >>>> - >>>> - if ( !pdev ) >>>> - return false; >>>> - >>>> - read_lock(&v->domain->pci_lock); >>>> - >>>> - if ( !pdev->vpci || (v->domain != pdev->domain) ) >>>> + switch ( v->vpci.task ) >>>> { >>>> - v->vpci.pdev = NULL; >>>> - read_unlock(&v->domain->pci_lock); >>>> - return false; >>>> - } >>>> - >>>> - header = &pdev->vpci->header; >>>> - for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>> + case MODIFY_MEMORY: >>>> { >>>> - struct vpci_bar *bar = &header->bars[i]; >>>> - struct rangeset *mem = v->vpci.bar_mem[i]; >>>> - struct map_data data = { >>>> - .d = v->domain, >>>> - .map = v->vpci.cmd & PCI_COMMAND_MEMORY, >>>> - .bar = bar, >>>> - }; >>>> - int rc; >>>> + const struct pci_dev *pdev = v->vpci.memory.pdev; >>>> + struct vpci_header *header = NULL; >>>> + unsigned int i; >>>> >>>> - if ( rangeset_is_empty(mem) ) >>>> - continue; >>>> + if ( !pdev ) >>>> + break; >>>> >>>> - rc = rangeset_consume_ranges(mem, map_range, &data); >>>> + read_lock(&v->domain->pci_lock); >>>> >>>> - if ( rc == -ERESTART ) >>>> + if ( !pdev->vpci || (v->domain != pdev->domain) ) >>>> { >>>> + v->vpci.memory.pdev = NULL; >>>> read_unlock(&v->domain->pci_lock); >>>> - return true; >>>> + break; >>>> } >>>> >>>> - if ( rc ) >>>> + header = &pdev->vpci->header; >>>> + for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>> { >>>> - spin_lock(&pdev->vpci->lock); >>>> - /* Disable memory decoding unconditionally on failure. */ >>>> - modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY, >>>> - false); >>>> - spin_unlock(&pdev->vpci->lock); >>>> + struct vpci_bar *bar = &header->bars[i]; >>>> + struct rangeset *mem = v->vpci.bar_mem[i]; >>>> + struct map_data data = { >>>> + .d = v->domain, >>>> + .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY, >>>> + .bar = bar, >>>> + }; >>>> + int rc; >>>> + >>>> + if ( rangeset_is_empty(mem) ) >>>> + continue; >>>> >>>> - /* Clean all the rangesets */ >>>> - for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>> - if ( !rangeset_is_empty(v->vpci.bar_mem[i]) ) >>>> - rangeset_purge(v->vpci.bar_mem[i]); >>>> + rc = rangeset_consume_ranges(mem, map_range, &data); >>>> >>>> - v->vpci.pdev = NULL; >>>> + if ( rc == -ERESTART ) >>>> + { >>>> + read_unlock(&v->domain->pci_lock); >>>> + return true; >>>> + } >>>> >>>> - read_unlock(&v->domain->pci_lock); >>>> + if ( rc ) >>>> + { >>>> + spin_lock(&pdev->vpci->lock); >>>> + /* Disable memory decoding unconditionally on failure. */ >>>> + modify_decoding(pdev, v->vpci.memory.cmd & >>>> ~PCI_COMMAND_MEMORY, >>>> + false); >>>> + spin_unlock(&pdev->vpci->lock); >>>> + >>>> + /* Clean all the rangesets */ >>>> + for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>> + if ( !rangeset_is_empty(v->vpci.bar_mem[i]) ) >>>> + rangeset_purge(v->vpci.bar_mem[i]); >>>> + >>>> + v->vpci.memory.pdev = NULL; >>>> + >>>> + read_unlock(&v->domain->pci_lock); >>>> >>>> - if ( !is_hardware_domain(v->domain) ) >>>> - domain_crash(v->domain); >>>> + if ( !is_hardware_domain(v->domain) ) >>>> + domain_crash(v->domain); >>>> >>>> - return false; >>>> + break; >>>> + } >>>> } >>>> - } >>>> - v->vpci.pdev = NULL; >>>> + v->vpci.memory.pdev = NULL; >>>> >>>> - spin_lock(&pdev->vpci->lock); >>>> - modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only); >>>> - spin_unlock(&pdev->vpci->lock); >>>> + spin_lock(&pdev->vpci->lock); >>>> + modify_decoding(pdev, v->vpci.memory.cmd, >>>> v->vpci.memory.rom_only); >>>> + spin_unlock(&pdev->vpci->lock); >>>> >>>> - read_unlock(&v->domain->pci_lock); >>>> + read_unlock(&v->domain->pci_lock); >>>> + >>>> + break; >>>> + } >>>> + case WAIT: >>>> + if ( NOW() < v->vpci.wait.end ) >>>> + return true; >>>> + v->vpci.wait.callback(v->vpci.wait.data); >>>> + break; >>> >>> As just indicated in reply to patch 6, busy waiting isn't really acceptable. >>> This is even more so when the waiting exceeds the typical length of a >>> scheduling timeslice. >>> >>> In that other reply I said to put the vCPU to sleep, but you need to be >>> careful >>> there too: The domain may not expect its vCPU to not make any progress for >>> such >>> an extended period of time. This may need doing entirely differently: Once >>> the >>> command register was written, you may want to record the time after which >>> accesses to the VF registers are permitted. Earlier accesses would simply be >>> terminated. You may still additionally need a timer, in order to kick off >>> BAR >>> mapping after that time. (Yet better would be if the BAR mapping could be >>> done during those 100ms. After all that may be a reason why this long a >>> delay >>> is specified: Firmware on the device may also require some time to set up >>> the >>> BARs accordingly.) >> >> I am not sure it would work that way. If we look at how linux >> initialized sriov, it writes VFE and MSE bits, waits 100ms and then >> expects VFs to be operational. If they are not operational at that >> moment, then it considers the operation failed and removes all VFs. If >> we also wait 100ms before enabling access, the probability of a guest >> trying to access something before we allow it would be very high. > > Well, not really. Our counting of the 100ms necessarily starts before Dom0's. > Furthermore it may be acceptable (or even appropriate) to stall premature > accesses (because they shouldn't occur in the first place), by blocking the > vCPU at that point. A middle route may be possible: Terminate accesses in, > say, the first 90ms, and stall the vCPU for any access past that, but before > the 100ms expired. > Is there any real benefit to doing all this work instead of just waiting for Dom0 to register the FVs? Implementing it the way you described would require a relatively complex state machine and two timers per sriov-capable device. And will also probably require some hacks to handle partially initialized VFs in Xen. This adds a lot of work and many possible bugs for not a lot of benefit in my opinion. -- Mykyta

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.