[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Issues with the device eject path in XenVif
- To: win-pv-devel@xxxxxxxxxxxxxxxxxxxx
- From: Paul Durrant <xadimgnik@xxxxxxxxx>
- Date: Mon, 13 Apr 2026 13:41:01 +0100
- Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=20251104 header.d=gmail.com header.i="@gmail.com" header.h="Content-Transfer-Encoding:In-Reply-To:Content-Language:References:To:Subject:User-Agent:MIME-Version:Date:Message-ID:From"
- Delivery-date: Mon, 13 Apr 2026 12:41:08 +0000
- List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>
On 09/04/2026 16:29, Tu Dinh wrote:
Hi all,
I'm currently trying to fix some lingering issues with VIF unplug, which
will let me replace the MRSW lock with a simpler and faster implementation.
Pdo->Eject/PdoRequestEject (e.g. in XenVif) is signaled by the
FrontendEject worker thread, which watches backend/vif/DOMID/X/online
among a few other things. I've run into several issues with this code path:
- When removing the VIF using `xe vif-unplug force=true`, the entire
xenstore key of the backend is removed without a chance to tear down the
connection. However, the watch on BACKEND/online will be triggered
before the watch on device/vif, which causes the PDO to be marked as
ejected, and so goes through the QUERY_REMOVE_DEVICE/REMOVE_DEVICE
instead of being a surprise removal.
- In the REMOVE_DEVICE case, NDIS will wait for packets to be returned
before continuing. Yet we cannot make progress because the backend has
already disappeared, so the system will hang. This can be reproduced by
doing an unplug with force=true while having some outbound traffic, but
the timing is quite tight with the current code.
- BACKEND/online is an internal, backend-specific value that is not
documented in xenstore-paths or netif.h. So frontends should not use
this value. I also find converting a VIF unplug to a query remove based
on reading BACKEND/online somewhat dubious.
I've considered several options for a fix, which I have documented below:
1. Make FrontendIsBackendOnline return a status code if BACKEND/online
doesn't exist, and treat an error to read the key as a surprise removal.
- This ends up being unworkable, since QEMU will always first set
BACKEND/online to 0 even if the VIF is being force-unplugged.
I still think this is the right way to deal with force unplug. Is there
a tell-tale you can look for to see if it is forced? (E.g. has the
frontend xenstore area completely gone?)
2. Make FrontendIsBackendOnline check the backend's existence (i.e.
reading the backend key instead of backend/online).
- This changes the unplug order slightly, but looks like the cleanest
solution. Though I'm not sure if it breaks cancelling of device removal
requests.
3. Remove the eject codepath and rely on FdoScan instead.
- This might break a few things that assume the presence of this
codepath.
I'd be glad to hear your opinions on this matter.
Thanks,
--
Ngoc Tu Dinh | Vates XCP-ng Developer
XCP-ng & Xen Orchestra - Vates solutions
web: https://vates.tech
|