|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Issues with the device eject path in XenVif
On 13/04/2026 15:48, Paul Durrant wrote: > On 13/04/2026 14:43, Tu Dinh wrote: >> On 13/04/2026 15:35, Paul Durrant wrote: >>> On 13/04/2026 14:22, Tu Dinh wrote: >>>> On 13/04/2026 14:41, Paul Durrant wrote: >>>>> On 09/04/2026 16:29, Tu Dinh wrote: >>>>>> Hi all, >>>>>> >>>>>> I'm currently trying to fix some lingering issues with VIF unplug, >>>>>> which >>>>>> will let me replace the MRSW lock with a simpler and faster >>>>>> implementation. >>>>>> >>>>>> Pdo->Eject/PdoRequestEject (e.g. in XenVif) is signaled by the >>>>>> FrontendEject worker thread, which watches backend/vif/DOMID/X/online >>>>>> among a few other things. I've run into several issues with this code >>>>>> path: >>>>>> >>>>>> - When removing the VIF using `xe vif-unplug force=true`, the entire >>>>>> xenstore key of the backend is removed without a chance to tear down >>>>>> the >>>>>> connection. However, the watch on BACKEND/online will be triggered >>>>>> before the watch on device/vif, which causes the PDO to be marked as >>>>>> ejected, and so goes through the QUERY_REMOVE_DEVICE/REMOVE_DEVICE >>>>>> instead of being a surprise removal. >>>>>> - In the REMOVE_DEVICE case, NDIS will wait for packets to be >>>>>> returned >>>>>> before continuing. Yet we cannot make progress because the backend >>>>>> has >>>>>> already disappeared, so the system will hang. This can be >>>>>> reproduced by >>>>>> doing an unplug with force=true while having some outbound >>>>>> traffic, but >>>>>> the timing is quite tight with the current code. >>>>>> - BACKEND/online is an internal, backend-specific value that is not >>>>>> documented in xenstore-paths or netif.h. So frontends should not use >>>>>> this value. I also find converting a VIF unplug to a query remove >>>>>> based >>>>>> on reading BACKEND/online somewhat dubious. >>>>>> >>>>>> I've considered several options for a fix, which I have documented >>>>>> below: >>>>>> >>>>>> 1. Make FrontendIsBackendOnline return a status code if BACKEND/ >>>>>> online >>>>>> doesn't exist, and treat an error to read the key as a surprise >>>>>> removal. >>>>>> - This ends up being unworkable, since QEMU will always >>>>>> first set >>>>>> BACKEND/online to 0 even if the VIF is being force-unplugged. >>>>> >>>>> I still think this is the right way to deal with force unplug. Is >>>>> there >>>>> a tell-tale you can look for to see if it is forced? (E.g. has the >>>>> frontend xenstore area completely gone?) >>>>> >>>> >>>> What I observe during a force VIF unplug is an unplug request >>>> (BACKEND/online=0 / PdoRequestEject) shortly followed by the backend >>>> being wiped out. I couldn't find any tell I could use to distinguish >>>> the >>>> force unplug case from the normal one. >>>> >>>> Maybe it can be fixed by attaching the watchdog thread's event to a >>>> watch on the backend, then (for transmitters) faking responses in the >>>> watchdog thread if we detect that the backend has disappeared. >>>> >>> >>> The PdoRequestEject is trigger off the state change watch though isn't >>> it. In the case of force does the state still change to 'closing'? I'd >>> have thought the node would be removed, in which case the state would go >>> to 'unknown' instead. >>> >> >> There's no watchdog thread waiting for BACKEND/state, and NDIS waits for >> packet return during initial handling of IRP_MN_REMOVE_DEVICE before >> xennet/xenvif is entered. So for now there's no opportunity for >> FrontendClose/FrontendWaitForBackendXenbusStateChange to be called in >> order to update the state to Unknown. >> > > Ok, that's the problem then. IIRC the eject protocol is setting online - > > 0 and then setting state -> closing. So the eject should be driven by > a watch on 'state' rather than a watch on 'online'. > Thanks, I'll give that a try. >>>>>> >>>>>> 2. Make FrontendIsBackendOnline check the backend's existence (i.e. >>>>>> reading the backend key instead of backend/online). >>>>>> - This changes the unplug order slightly, but looks like the >>>>>> cleanest >>>>>> solution. Though I'm not sure if it breaks cancelling of device >>>>>> removal >>>>>> requests. >>>>>> >>>>>> 3. Remove the eject codepath and rely on FdoScan instead. >>>>>> - This might break a few things that assume the presence of >>>>>> this >>>>>> codepath. >>>>>> >>>>>> I'd be glad to hear your opinions on this matter. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> >>>>>> -- >>>>>> Ngoc Tu Dinh | Vates XCP-ng Developer >>>>>> >>>>>> XCP-ng & Xen Orchestra - Vates solutions >>>>>> >>>>>> web: https://vates.tech >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Ngoc Tu Dinh | Vates XCP-ng Developer >>>> >>>> XCP-ng & Xen Orchestra - Vates solutions >>>> >>>> web: https://vates.tech >>>> >>>> >>> >>> >> >> >> >> -- >> Ngoc Tu Dinh | Vates XCP-ng Developer >> >> XCP-ng & Xen Orchestra - Vates solutions >> >> web: https://vates.tech >> >> > > -- Ngoc Tu Dinh | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |