Xen project Mailing List

Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions

From: Mykyta Poturai <Mykyta_Poturai@xxxxxxxx>

Date: Wed, 1 Apr 2026 14:40:38 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=GQtJi37Fd/Vazf7qgfK12WLcM59LS8GY90I2uV/4frA=; b=mlsDHF9GuwdxhV6TmSpBHx0PBuijZdIXXQZsraZshn3bWhd7ELtxOxN7tfpv/eJhYslOloZozrLzm/R7YlIONihOX+AVOJMOl8d9ymnwDg9Qapaor952YFabntmte1MTrCIiFYO7z44TwhV0lmyyBw3T+AR0LQCWlevJ3NAApVZH6NqrRPRGssV2fW1R7AncWX3eDhT3i8+yDk7h89JTOleIe71f2EtSL6teOEvVdNtTx0r6XQIBBBVAHttiOtIPGPAGVn1P21yvt1hbtmpGmodDCtuu3ibzm6pI7VE4mwsLX7VJyD8Lb/wdDPeCN/kvTCALQIEN5Q+PckHsY+NpnQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=ClLhBAXao+AUeBJbOxOygPkF6OtiyJicGc+chwccQiovipVaY9LfkQNicMeSo9L23qXbuUu9lCUgNYAxb2qivpXNoRePBpsBvon5EJ5on6YYD6Y3D2tXnFqsE8wbSLDd3eI1SfDlaRK91C9omkA1y+1Oi9/q5yHO29Uo4Od7kCW1KsU9Qwv0wL/HK37zy7wGYWBqy8U+EbBqK+KTdo4dobAqEkBXAAzCOpugnbR2nllSJwxiAZglByTFT56AEevWr1Qqy1bKfTze0vIMV+Xmtpii+iJqVKJbEg/YVA18x+/kcXEQ0kRk3T1Di5nQWZ457jgHilwlPhBXMi9slT17EA==

Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=epam.com header.i="@epam.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:x-ms-exchange-senderadcheck"

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=epam.com;

Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stewart Hildebrand <stewart.hildebrand@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Wed, 01 Apr 2026 14:59:48 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHcr7UNjpbdOXJQP0qdir/zrd85obXI3WYAgAEeNICAAAXvAIAAYLsAgAACF4CAAAc+AA==

Thread-topic: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions

On 4/1/26 17:14, Jan Beulich wrote: > On 01.04.2026 16:07, Mykyta Poturai wrote: >> On 4/1/26 11:21, Jan Beulich wrote: >>> On 01.04.2026 09:59, Mykyta Poturai wrote: >>>> On 3/31/26 17:55, Jan Beulich wrote: >>>>> On 09.03.2026 12:08, Mykyta Poturai wrote: >>>>>> --- a/xen/drivers/vpci/header.c >>>>>> +++ b/xen/drivers/vpci/header.c >>>>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev >>>>>> *pdev, uint16_t cmd, >>>>>> >>>>>> bool vpci_process_pending(struct vcpu *v) >>>>>> { >>>>>> - const struct pci_dev *pdev = v->vpci.pdev; >>>>>> - struct vpci_header *header = NULL; >>>>>> - unsigned int i; >>>>>> - >>>>>> - if ( !pdev ) >>>>>> - return false; >>>>>> - >>>>>> - read_lock(&v->domain->pci_lock); >>>>>> - >>>>>> - if ( !pdev->vpci || (v->domain != pdev->domain) ) >>>>>> + switch ( v->vpci.task ) >>>>>> { >>>>>> - v->vpci.pdev = NULL; >>>>>> - read_unlock(&v->domain->pci_lock); >>>>>> - return false; >>>>>> - } >>>>>> - >>>>>> - header = &pdev->vpci->header; >>>>>> - for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>>>> + case MODIFY_MEMORY: >>>>>> { >>>>>> - struct vpci_bar *bar = &header->bars[i]; >>>>>> - struct rangeset *mem = v->vpci.bar_mem[i]; >>>>>> - struct map_data data = { >>>>>> - .d = v->domain, >>>>>> - .map = v->vpci.cmd & PCI_COMMAND_MEMORY, >>>>>> - .bar = bar, >>>>>> - }; >>>>>> - int rc; >>>>>> + const struct pci_dev *pdev = v->vpci.memory.pdev; >>>>>> + struct vpci_header *header = NULL; >>>>>> + unsigned int i; >>>>>> >>>>>> - if ( rangeset_is_empty(mem) ) >>>>>> - continue; >>>>>> + if ( !pdev ) >>>>>> + break; >>>>>> >>>>>> - rc = rangeset_consume_ranges(mem, map_range, &data); >>>>>> + read_lock(&v->domain->pci_lock); >>>>>> >>>>>> - if ( rc == -ERESTART ) >>>>>> + if ( !pdev->vpci || (v->domain != pdev->domain) ) >>>>>> { >>>>>> + v->vpci.memory.pdev = NULL; >>>>>> read_unlock(&v->domain->pci_lock); >>>>>> - return true; >>>>>> + break; >>>>>> } >>>>>> >>>>>> - if ( rc ) >>>>>> + header = &pdev->vpci->header; >>>>>> + for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>>>> { >>>>>> - spin_lock(&pdev->vpci->lock); >>>>>> - /* Disable memory decoding unconditionally on failure. */ >>>>>> - modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY, >>>>>> - false); >>>>>> - spin_unlock(&pdev->vpci->lock); >>>>>> + struct vpci_bar *bar = &header->bars[i]; >>>>>> + struct rangeset *mem = v->vpci.bar_mem[i]; >>>>>> + struct map_data data = { >>>>>> + .d = v->domain, >>>>>> + .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY, >>>>>> + .bar = bar, >>>>>> + }; >>>>>> + int rc; >>>>>> + >>>>>> + if ( rangeset_is_empty(mem) ) >>>>>> + continue; >>>>>> >>>>>> - /* Clean all the rangesets */ >>>>>> - for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>>>> - if ( !rangeset_is_empty(v->vpci.bar_mem[i]) ) >>>>>> - rangeset_purge(v->vpci.bar_mem[i]); >>>>>> + rc = rangeset_consume_ranges(mem, map_range, &data); >>>>>> >>>>>> - v->vpci.pdev = NULL; >>>>>> + if ( rc == -ERESTART ) >>>>>> + { >>>>>> + read_unlock(&v->domain->pci_lock); >>>>>> + return true; >>>>>> + } >>>>>> >>>>>> - read_unlock(&v->domain->pci_lock); >>>>>> + if ( rc ) >>>>>> + { >>>>>> + spin_lock(&pdev->vpci->lock); >>>>>> + /* Disable memory decoding unconditionally on failure. >>>>>> */ >>>>>> + modify_decoding(pdev, v->vpci.memory.cmd & >>>>>> ~PCI_COMMAND_MEMORY, >>>>>> + false); >>>>>> + spin_unlock(&pdev->vpci->lock); >>>>>> + >>>>>> + /* Clean all the rangesets */ >>>>>> + for ( i = 0; i < ARRAY_SIZE(header->bars); i++ ) >>>>>> + if ( !rangeset_is_empty(v->vpci.bar_mem[i]) ) >>>>>> + rangeset_purge(v->vpci.bar_mem[i]); >>>>>> + >>>>>> + v->vpci.memory.pdev = NULL; >>>>>> + >>>>>> + read_unlock(&v->domain->pci_lock); >>>>>> >>>>>> - if ( !is_hardware_domain(v->domain) ) >>>>>> - domain_crash(v->domain); >>>>>> + if ( !is_hardware_domain(v->domain) ) >>>>>> + domain_crash(v->domain); >>>>>> >>>>>> - return false; >>>>>> + break; >>>>>> + } >>>>>> } >>>>>> - } >>>>>> - v->vpci.pdev = NULL; >>>>>> + v->vpci.memory.pdev = NULL; >>>>>> >>>>>> - spin_lock(&pdev->vpci->lock); >>>>>> - modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only); >>>>>> - spin_unlock(&pdev->vpci->lock); >>>>>> + spin_lock(&pdev->vpci->lock); >>>>>> + modify_decoding(pdev, v->vpci.memory.cmd, >>>>>> v->vpci.memory.rom_only); >>>>>> + spin_unlock(&pdev->vpci->lock); >>>>>> >>>>>> - read_unlock(&v->domain->pci_lock); >>>>>> + read_unlock(&v->domain->pci_lock); >>>>>> + >>>>>> + break; >>>>>> + } >>>>>> + case WAIT: >>>>>> + if ( NOW() < v->vpci.wait.end ) >>>>>> + return true; >>>>>> + v->vpci.wait.callback(v->vpci.wait.data); >>>>>> + break; >>>>> >>>>> As just indicated in reply to patch 6, busy waiting isn't really >>>>> acceptable. >>>>> This is even more so when the waiting exceeds the typical length of a >>>>> scheduling timeslice. >>>>> >>>>> In that other reply I said to put the vCPU to sleep, but you need to be >>>>> careful >>>>> there too: The domain may not expect its vCPU to not make any progress >>>>> for such >>>>> an extended period of time. This may need doing entirely differently: >>>>> Once the >>>>> command register was written, you may want to record the time after which >>>>> accesses to the VF registers are permitted. Earlier accesses would simply >>>>> be >>>>> terminated. You may still additionally need a timer, in order to kick off >>>>> BAR >>>>> mapping after that time. (Yet better would be if the BAR mapping could be >>>>> done during those 100ms. After all that may be a reason why this long a >>>>> delay >>>>> is specified: Firmware on the device may also require some time to set up >>>>> the >>>>> BARs accordingly.) >>>> >>>> I am not sure it would work that way. If we look at how linux >>>> initialized sriov, it writes VFE and MSE bits, waits 100ms and then >>>> expects VFs to be operational. If they are not operational at that >>>> moment, then it considers the operation failed and removes all VFs. If >>>> we also wait 100ms before enabling access, the probability of a guest >>>> trying to access something before we allow it would be very high. >>> >>> Well, not really. Our counting of the 100ms necessarily starts before >>> Dom0's. >>> Furthermore it may be acceptable (or even appropriate) to stall premature >>> accesses (because they shouldn't occur in the first place), by blocking the >>> vCPU at that point. A middle route may be possible: Terminate accesses in, >>> say, the first 90ms, and stall the vCPU for any access past that, but before >>> the 100ms expired. >> >> Is there any real benefit to doing all this work instead of just waiting >> for Dom0 to register the FVs? Implementing it the way you described >> would require a relatively complex state machine and two timers per >> sriov-capable device. And will also probably require some hacks to >> handle partially initialized VFs in Xen. This adds a lot of work and >> many possible bugs for not a lot of benefit in my opinion. > > Odd that you ask me this question. If there was no benefit, why did you do > it this way? > Roger asked for this approach in V1, and I saw that it can be done in a relatively straightforward way, so I implemented it. I didn’t exactly get what the benefits were, but I assumed that there are some and the effort is not too big to just do it if the maintainers are asking for it. Now with the posibility of redoing everything *again* and making it much more complex I started to really think if its really needed. So I want to know your and others' opinions on registering VFs with Dom0. -- Mykyta

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.