[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] xen-netback: Check for hotplug-status existence before watching

On 17/05/2021 22:43, Marek Marczykowski-Górecki wrote:
On Tue, May 11, 2021 at 12:46:38PM +0000, Durrant, Paul wrote:
-----Original Message-----
From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
Sent: 11 May 2021 11:45
To: Durrant, Paul <pdurrant@xxxxxxxxxxxx>
Cc: Michael Brown <mbrown@xxxxxxxxxxxxxxxx>; paul@xxxxxxx; 
netdev@xxxxxxxxxxxxxxx; wei.liu@xxxxxxxxxx
Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence 
before watching

On Tue, May 11, 2021 at 12:40:54PM +0200, Marek Marczykowski-Górecki wrote:
On Tue, May 11, 2021 at 07:06:55AM +0000, Durrant, Paul wrote:
-----Original Message-----
From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
Sent: 10 May 2021 20:43
To: Michael Brown <mbrown@xxxxxxxxxxxxxxxx>; paul@xxxxxxx
Cc: paul@xxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; 
Paul <pdurrant@xxxxxxxxxxxx>
Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence 
before watching

On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
If you have a suggested patch, I'm happy to test that it doesn't reintroduce
the regression bug that was fixed by this commit.

Actually, I've just tested with a simple reloading xen-netfront module. It
seems in this case, the hotplug script is not re-executed. In fact, I
think it should not be re-executed at all, since the vif interface
remains in place (it just gets NO-CARRIER flag).

This brings a question, why removing hotplug-status in the first place?
The interface remains correctly configured by the hotplug script after
all. From the commit message:

     xen-netback: remove 'hotplug-status' once it has served its purpose

     Removing the 'hotplug-status' node in netback_remove() is wrong; the script
     may not have completed. Only remove the node once the watch has fired and
     has been unregistered.

I think the intention was to remove 'hotplug-status' node _later_ in
case of quickly adding and removing the interface. Is that right, Paul?

The removal was done to allow unbind/bind to function correctly. IIRC before 
the original patch
doing a bind would stall forever waiting for the hotplug status to change, 
which would never happen.

Hmm, in that case maybe don't remove it at all in the backend, and let
it be cleaned up by the toolstack, when it removes other backend-related

No, unbind/bind _does_ require hotplug script to be called again.

Yes, sorry I was misremembering. My memory is hazy but there was definitely a 
problem at the time with leaving the node in place.

When exactly vif interface appears in the system (starts to be available
for the hotplug script)? Maybe remove 'hotplug-status' just before that

I really can't remember any detail. Perhaps try reverting both patches then and 
check that the unbind/rmmod/modprobe/bind sequence still works (and the backend 
actually makes it into connected state).

Ok, I've tried this. I've reverted both commits, then used your test
script from the 9476654bd5e8ad42abe8ee9f9e90069ff8e60c17:
This has been tested by running iperf as a server in the test VM and
     then running a client against it in a continuous loop, whilst also
while true;
       do echo vif-$DOMID-$VIF >unbind;
       echo down;
       rmmod xen-netback;
       echo unloaded;
       modprobe xen-netback;
       cd $(pwd);
       brctl addif xenbr0 vif$DOMID.$VIF;
       ip link set vif$DOMID.$VIF up;
       echo up;
       sleep 5;
in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind,
     unload, re-load, re-bind and re-plumb the backend.
In fact, the need to call `brctl` and `ip link` manually is exactly
because the hotplug script isn't executed. When I execute it manually,
the backend properly gets back to working. So, removing 'hotplug-status'
was in the correct place (netback_remove). The missing part is the toolstack
calling the hotplug script on xen-netback re-bind.

Why is that missing? We're going behind the back of the toolstack to do the unbind and bind so why should we expect it to re-execute a hotplug script?

In this case, I'm not sure what is the proper way. If I restart
xendriverdomain service (I do run the backend in domU), it properly
executes hotplug script and the backend interface gets properly
configured. But it doesn't do it on its own. It seems to be related to
device "state" in xenstore. The specific part of the libxl is

     ddev = search_for_device(dguest, dev);
     if (ddev == NULL && state == XenbusStateClosed) {
          * Spurious state change, device has already been disconnected
          * or never attached.
         goto skip;
     } else if (ddev == NULL) {
         rc = add_device(egc, nested_ao, dguest, dev);
         if (rc > 0)
             free_ao = true;
     } else if (state == XenbusStateClosed && online == 0) {
         rc = remove_device(egc, nested_ao, dguest, ddev);
         if (rc > 0)
             free_ao = true;
         check_and_maybe_remove_guest(gc, ddomain, dguest);

In short: if device gets XenbusStateInitWait for the first time (ddev ==
NULL case), it goes to add_device() which executes the hotplug script
and stores the device.
Then, if device goes to XenbusStateClosed + online==0 state, then it
executes hotplug script again (with "offline" parameter) and forgets the
device. If you unbind the driver, the device stays in
XenbusStateConnected state (in xenstore), and after you bind it again,
it goes to XenbusStateInitWait. It don't think it goes through
XenbusStateClosed, and online stays at 1 too, so libxl doesn't execute
the hotplug script again.

This is pretty key. The frontend should not notice an unbind/bind i.e. there should be no evidence of it happening by examining states in xenstore (from the guest side).


Some solution could be to add an extra case at the end, like "ddev !=
NULL && state == XenbusStateInitWait && hotplug-status != connected".
And make sure xl devd won't call the same hotplug script multiple times
for the same device _at the same time_ (I'm not sure about the async
machinery here).

But even if xl devd (aka xendriverdomain service) gets "fixes" to
execute hotplug script in that case, I don't think it would work in
backend in dom0 case - there, I think nothing watches already configured
vif interfaces (there is no xl devd daemon in dom0, and xl background
process watches only domain death and cdrom eject events).

I'm adding toolstack maintainers, maybe they'll have some idea...

In any case, the issue is not calling the hotplug script, responsible
for configuring newly created vif interface. Not kernel waiting for it.
So, I think both commits should still be reverted.



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.