[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issues with the device eject path in XenVif


  • To: "Paul Durrant" <xadimgnik@xxxxxxxxx>, win-pv-devel@xxxxxxxxxxxxxxxxxxxx
  • From: "Tu Dinh" <ngoc-tu.dinh@xxxxxxxxxx>
  • Date: Mon, 13 Apr 2026 13:43:49 +0000
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=mte1 header.d=mandrillapp.com header.i="@mandrillapp.com" header.h="From:Subject:Message-Id:To:References:In-Reply-To:Feedback-ID:Date:MIME-Version:Content-Type:Content-Transfer-Encoding"; dkim=pass header.s=mte1 header.d=vates.tech header.i="ngoc-tu.dinh@xxxxxxxxxx" header.h="From:Subject:Message-Id:To:References:In-Reply-To:Feedback-ID:Date:MIME-Version:Content-Type:Content-Transfer-Encoding"
  • Delivery-date: Mon, 13 Apr 2026 13:43:56 +0000
  • Feedback-id: 30504962:30504962.20260413:md
  • List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>

On 13/04/2026 15:35, Paul Durrant wrote:
> On 13/04/2026 14:22, Tu Dinh wrote:
>> On 13/04/2026 14:41, Paul Durrant wrote:
>>> On 09/04/2026 16:29, Tu Dinh wrote:
>>>> Hi all,
>>>>
>>>> I'm currently trying to fix some lingering issues with VIF unplug,
>>>> which
>>>> will let me replace the MRSW lock with a simpler and faster
>>>> implementation.
>>>>
>>>> Pdo->Eject/PdoRequestEject (e.g. in XenVif) is signaled by the
>>>> FrontendEject worker thread, which watches backend/vif/DOMID/X/online
>>>> among a few other things. I've run into several issues with this code
>>>> path:
>>>>
>>>> - When removing the VIF using `xe vif-unplug force=true`, the entire
>>>> xenstore key of the backend is removed without a chance to tear down
>>>> the
>>>> connection. However, the watch on BACKEND/online will be triggered
>>>> before the watch on device/vif, which causes the PDO to be marked as
>>>> ejected, and so goes through the QUERY_REMOVE_DEVICE/REMOVE_DEVICE
>>>> instead of being a surprise removal.
>>>> - In the REMOVE_DEVICE case, NDIS will wait for packets to be returned
>>>> before continuing. Yet we cannot make progress because the backend has
>>>> already disappeared, so the system will hang. This can be reproduced by
>>>> doing an unplug with force=true while having some outbound traffic, but
>>>> the timing is quite tight with the current code.
>>>> - BACKEND/online is an internal, backend-specific value that is not
>>>> documented in xenstore-paths or netif.h. So frontends should not use
>>>> this value. I also find converting a VIF unplug to a query remove based
>>>> on reading BACKEND/online somewhat dubious.
>>>>
>>>> I've considered several options for a fix, which I have documented
>>>> below:
>>>>
>>>> 1. Make FrontendIsBackendOnline return a status code if BACKEND/online
>>>> doesn't exist, and treat an error to read the key as a surprise
>>>> removal.
>>>>      - This ends up being unworkable, since QEMU will always first set
>>>> BACKEND/online to 0 even if the VIF is being force-unplugged.
>>>
>>> I still think this is the right way to deal with force unplug. Is there
>>> a tell-tale you can look for to see if it is forced? (E.g. has the
>>> frontend xenstore area completely gone?)
>>>
>>
>> What I observe during a force VIF unplug is an unplug request
>> (BACKEND/online=0 / PdoRequestEject) shortly followed by the backend
>> being wiped out. I couldn't find any tell I could use to distinguish the
>> force unplug case from the normal one.
>>
>> Maybe it can be fixed by attaching the watchdog thread's event to a
>> watch on the backend, then (for transmitters) faking responses in the
>> watchdog thread if we detect that the backend has disappeared.
>>
>
> The PdoRequestEject is trigger off the state change watch though isn't
> it. In the case of force does the state still change to 'closing'? I'd
> have thought the node would be removed, in which case the state would go
> to 'unknown' instead.
>

There's no watchdog thread waiting for BACKEND/state, and NDIS waits for
packet return during initial handling of IRP_MN_REMOVE_DEVICE before
xennet/xenvif is entered. So for now there's no opportunity for
FrontendClose/FrontendWaitForBackendXenbusStateChange to be called in
order to update the state to Unknown.

>>>>
>>>> 2. Make FrontendIsBackendOnline check the backend's existence (i.e.
>>>> reading the backend key instead of backend/online).
>>>>      - This changes the unplug order slightly, but looks like the
>>>> cleanest
>>>> solution. Though I'm not sure if it breaks cancelling of device removal
>>>> requests.
>>>>
>>>> 3. Remove the eject codepath and rely on FdoScan instead.
>>>>      - This might break a few things that assume the presence of this
>>>> codepath.
>>>>
>>>> I'd be glad to hear your opinions on this matter.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> --
>>>> Ngoc Tu Dinh | Vates XCP-ng Developer
>>>>
>>>> XCP-ng & Xen Orchestra - Vates solutions
>>>>
>>>> web: https://vates.tech
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Ngoc Tu Dinh | Vates XCP-ng Developer
>>
>> XCP-ng & Xen Orchestra - Vates solutions
>>
>> web: https://vates.tech
>>
>>
>
>



--
Ngoc Tu Dinh | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.