[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issues with the device eject path in XenVif


  • To: "Paul Durrant" <xadimgnik@xxxxxxxxx>, win-pv-devel@xxxxxxxxxxxxxxxxxxxx
  • From: "Tu Dinh" <ngoc-tu.dinh@xxxxxxxxxx>
  • Date: Mon, 13 Apr 2026 13:56:50 +0000
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=mte1 header.d=mandrillapp.com header.i="@mandrillapp.com" header.h="From:Subject:Message-Id:To:References:In-Reply-To:Feedback-ID:Date:MIME-Version:Content-Type:Content-Transfer-Encoding"; dkim=pass header.s=mte1 header.d=vates.tech header.i="ngoc-tu.dinh@xxxxxxxxxx" header.h="From:Subject:Message-Id:To:References:In-Reply-To:Feedback-ID:Date:MIME-Version:Content-Type:Content-Transfer-Encoding"
  • Delivery-date: Mon, 13 Apr 2026 13:57:00 +0000
  • Feedback-id: 30504962:30504962.20260413:md
  • List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>

On 13/04/2026 15:48, Paul Durrant wrote:
> On 13/04/2026 14:43, Tu Dinh wrote:
>> On 13/04/2026 15:35, Paul Durrant wrote:
>>> On 13/04/2026 14:22, Tu Dinh wrote:
>>>> On 13/04/2026 14:41, Paul Durrant wrote:
>>>>> On 09/04/2026 16:29, Tu Dinh wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I'm currently trying to fix some lingering issues with VIF unplug,
>>>>>> which
>>>>>> will let me replace the MRSW lock with a simpler and faster
>>>>>> implementation.
>>>>>>
>>>>>> Pdo->Eject/PdoRequestEject (e.g. in XenVif) is signaled by the
>>>>>> FrontendEject worker thread, which watches backend/vif/DOMID/X/online
>>>>>> among a few other things. I've run into several issues with this code
>>>>>> path:
>>>>>>
>>>>>> - When removing the VIF using `xe vif-unplug force=true`, the entire
>>>>>> xenstore key of the backend is removed without a chance to tear down
>>>>>> the
>>>>>> connection. However, the watch on BACKEND/online will be triggered
>>>>>> before the watch on device/vif, which causes the PDO to be marked as
>>>>>> ejected, and so goes through the QUERY_REMOVE_DEVICE/REMOVE_DEVICE
>>>>>> instead of being a surprise removal.
>>>>>> - In the REMOVE_DEVICE case, NDIS will wait for packets to be
>>>>>> returned
>>>>>> before continuing. Yet we cannot make progress because the backend
>>>>>> has
>>>>>> already disappeared, so the system will hang. This can be
>>>>>> reproduced by
>>>>>> doing an unplug with force=true while having some outbound
>>>>>> traffic, but
>>>>>> the timing is quite tight with the current code.
>>>>>> - BACKEND/online is an internal, backend-specific value that is not
>>>>>> documented in xenstore-paths or netif.h. So frontends should not use
>>>>>> this value. I also find converting a VIF unplug to a query remove
>>>>>> based
>>>>>> on reading BACKEND/online somewhat dubious.
>>>>>>
>>>>>> I've considered several options for a fix, which I have documented
>>>>>> below:
>>>>>>
>>>>>> 1. Make FrontendIsBackendOnline return a status code if BACKEND/
>>>>>> online
>>>>>> doesn't exist, and treat an error to read the key as a surprise
>>>>>> removal.
>>>>>>       - This ends up being unworkable, since QEMU will always
>>>>>> first set
>>>>>> BACKEND/online to 0 even if the VIF is being force-unplugged.
>>>>>
>>>>> I still think this is the right way to deal with force unplug. Is
>>>>> there
>>>>> a tell-tale you can look for to see if it is forced? (E.g. has the
>>>>> frontend xenstore area completely gone?)
>>>>>
>>>>
>>>> What I observe during a force VIF unplug is an unplug request
>>>> (BACKEND/online=0 / PdoRequestEject) shortly followed by the backend
>>>> being wiped out. I couldn't find any tell I could use to distinguish
>>>> the
>>>> force unplug case from the normal one.
>>>>
>>>> Maybe it can be fixed by attaching the watchdog thread's event to a
>>>> watch on the backend, then (for transmitters) faking responses in the
>>>> watchdog thread if we detect that the backend has disappeared.
>>>>
>>>
>>> The PdoRequestEject is trigger off the state change watch though isn't
>>> it. In the case of force does the state still change to 'closing'? I'd
>>> have thought the node would be removed, in which case the state would go
>>> to 'unknown' instead.
>>>
>>
>> There's no watchdog thread waiting for BACKEND/state, and NDIS waits for
>> packet return during initial handling of IRP_MN_REMOVE_DEVICE before
>> xennet/xenvif is entered. So for now there's no opportunity for
>> FrontendClose/FrontendWaitForBackendXenbusStateChange to be called in
>> order to update the state to Unknown.
>>
>
> Ok, that's the problem then. IIRC the eject protocol is setting online -
>  > 0 and then setting state -> closing. So the eject should be driven by
> a watch on 'state' rather than a watch on 'online'.
>

Thanks, I'll give that a try.

>>>>>>
>>>>>> 2. Make FrontendIsBackendOnline check the backend's existence (i.e.
>>>>>> reading the backend key instead of backend/online).
>>>>>>       - This changes the unplug order slightly, but looks like the
>>>>>> cleanest
>>>>>> solution. Though I'm not sure if it breaks cancelling of device
>>>>>> removal
>>>>>> requests.
>>>>>>
>>>>>> 3. Remove the eject codepath and rely on FdoScan instead.
>>>>>>       - This might break a few things that assume the presence of
>>>>>> this
>>>>>> codepath.
>>>>>>
>>>>>> I'd be glad to hear your opinions on this matter.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ngoc Tu Dinh | Vates XCP-ng Developer
>>>>>>
>>>>>> XCP-ng & Xen Orchestra - Vates solutions
>>>>>>
>>>>>> web: https://vates.tech
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ngoc Tu Dinh | Vates XCP-ng Developer
>>>>
>>>> XCP-ng & Xen Orchestra - Vates solutions
>>>>
>>>> web: https://vates.tech
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Ngoc Tu Dinh | Vates XCP-ng Developer
>>
>> XCP-ng & Xen Orchestra - Vates solutions
>>
>> web: https://vates.tech
>>
>>
>
>



--
Ngoc Tu Dinh | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.