[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Issues with the device eject path in XenVif


  • To: win-pv-devel <win-pv-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: "Tu Dinh" <ngoc-tu.dinh@xxxxxxxxxx>
  • Date: Thu, 09 Apr 2026 15:29:09 +0000
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=mte1 header.d=mandrillapp.com header.i="@mandrillapp.com" header.h="From:Subject:Message-Id:To:Cc:Feedback-ID:Date:MIME-Version:Content-Type:Content-Transfer-Encoding"; dkim=pass header.s=mte1 header.d=vates.tech header.i="ngoc-tu.dinh@xxxxxxxxxx" header.h="From:Subject:Message-Id:To:Cc:Feedback-ID:Date:MIME-Version:Content-Type:Content-Transfer-Encoding"
  • Cc: "Owen Smith" <owen.smith@xxxxxxxxxx>
  • Delivery-date: Thu, 09 Apr 2026 15:29:14 +0000
  • Feedback-id: 30504962:30504962.20260409:md
  • List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>

Hi all,

I'm currently trying to fix some lingering issues with VIF unplug, which 
will let me replace the MRSW lock with a simpler and faster implementation.

Pdo->Eject/PdoRequestEject (e.g. in XenVif) is signaled by the 
FrontendEject worker thread, which watches backend/vif/DOMID/X/online 
among a few other things. I've run into several issues with this code path:

- When removing the VIF using `xe vif-unplug force=true`, the entire 
xenstore key of the backend is removed without a chance to tear down the 
connection. However, the watch on BACKEND/online will be triggered 
before the watch on device/vif, which causes the PDO to be marked as 
ejected, and so goes through the QUERY_REMOVE_DEVICE/REMOVE_DEVICE 
instead of being a surprise removal.
- In the REMOVE_DEVICE case, NDIS will wait for packets to be returned 
before continuing. Yet we cannot make progress because the backend has 
already disappeared, so the system will hang. This can be reproduced by 
doing an unplug with force=true while having some outbound traffic, but 
the timing is quite tight with the current code.
- BACKEND/online is an internal, backend-specific value that is not 
documented in xenstore-paths or netif.h. So frontends should not use 
this value. I also find converting a VIF unplug to a query remove based 
on reading BACKEND/online somewhat dubious.

I've considered several options for a fix, which I have documented below:

1. Make FrontendIsBackendOnline return a status code if BACKEND/online 
doesn't exist, and treat an error to read the key as a surprise removal.
   - This ends up being unworkable, since QEMU will always first set 
BACKEND/online to 0 even if the VIF is being force-unplugged.

2. Make FrontendIsBackendOnline check the backend's existence (i.e. 
reading the backend key instead of backend/online).
   - This changes the unplug order slightly, but looks like the cleanest 
solution. Though I'm not sure if it breaks cancelling of device removal 
requests.

3. Remove the eject codepath and rely on FdoScan instead.
   - This might break a few things that assume the presence of this 
codepath.

I'd be glad to hear your opinions on this matter.

Thanks,


--
Ngoc Tu Dinh | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.