Xen project Mailing List

Re: race condition when re-connecting vif after backend died

From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Date: Sun, 2 Nov 2025 04:19:40 +0100

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Sun, 02 Nov 2025 03:20:14 +0000

Feedback-id: i1568416f:Fastmail

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, Oct 08, 2025 at 04:04:58PM +0200, Marek Marczykowski-Górecki wrote: > On Wed, Oct 08, 2025 at 02:32:02PM +0200, Jürgen Groß wrote: > > On 08.10.25 13:22, Marek Marczykowski-Górecki wrote: > > > Hi, > > > > > > I have the following scenario: > > > 1. Start backend domain (call it netvm1) > > > 2. Start frontend domain (call it vm1), with > > > vif=['backend=netvm2,mac=00:16:3e:5e:6c:00,script=vif-route-qubes,ip=10.138.17.244'] > > > 3. Pause vm1 (not strictly required, but makes reproducing much easier) > > > 5. Crash/shutdown/destroy netvm1 > > > 4. Start another backend domain (call it netvm2) > > > 5. In quick succession: > > > 5.1. unpause vm1 > > > 5.2. detach (or actually cleanup) vif from vm1 (connected to now dead > > > netvm1) > > > 5.3. attach similar vif with backend=netvm2 > > The way it's above, it tricky to reproduce (1/20 or even less often). > But if I move unpause after 5.3, then it's happening reliably. I hope > it's not too different scenario... > > > > Sometimes it ends up with eth0 being present in vm1, but its xenstore > > > state key is still XenbusStateInitializing. And the backend state is at > > > XenbusStateInitWait. > > > In step 5.2, normally libxl waits for the backend to transition to state > > > XenbusStateClosed, and IIUC backend waits for the frontend to do the > > > same too. But when the backend is gone, libxl seems to simply removes > > > frontend xenstore entries without any coordination with the frontend > > > domain itself. > > > What I suspect happens is that xenstore events generated at 5.2 are > > > getting handled by the frontend's kernel only after 5.3. At this stage, > > > frontend sees device that was is XenbusStateConnected transitioning to > > > XenbusStateInitializing (not really expected by the frontend to somebody > > > else change its state key) and (I guess) doesn't notice device vanished > > > for a moment (xenbus_dev_changed() doesn't hit the !exists path). I > > > haven't verified it, but I guess it also doesn't notice backend path > > > change, so it's still watching the old one (gone at this point). > > > > > > If my diagnosis is correct, what should be the solution here? Add > > > handling for XenbusStateUnknown in xen-netfrontc.c:netback_changed()? If > > > so, it should probably carefully cleanup the old device while not > > > touching xenstore entries (which belong to the new instance already) and > > > then re-initialize the device (xennet_connect()? call). > > > Or maybe it should be done in generic way in xenbus_probe.c, in > > > xenbus_dev_changed()? Not sure how exactly - maybe by checking if > > > backend path (or just backend-id?) changed? And then call both > > > device_unregister() (again, being careful to not change xenstore, > > > especially not set XenbusStateClosed) and then xenbus_probe_node()? > > > > > > > I think we need to know what is going on here. > > > > Can you repeat the test with Xenstore tracing enabled? Just do: > > > > xenstore-control logfile /tmp/xs-trace > > > > before point 3. in your list above and then perform steps 3. - 5.3. and > > then send the logfile. Please make sure not to have any additional actions > > causing Xenstore traffic in between, as this would make it much harder to > > analyze the log. > > I can't completely avoid other xenstore activity, but I tried to reduce > it as much as possible... > > I'm attaching reproduction script, its output, and xenstore traces. Note > I split xenstore trace into two parts, hopefully making it easier to > analyze. Ok, I think I managed to fix it. There were two cases: frontend overriding state key of already re-connected device, and frontend re-creating state key of device forcefully removed by the toolstack. I'll post the patch in a moment. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab

Attachment: signature.asc
Description: PGP signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.