[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] xen-netback: Check for hotplug-status existence before watching



On Tue, May 11, 2021 at 12:46:38PM +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
> > Sent: 11 May 2021 11:45
> > To: Durrant, Paul <pdurrant@xxxxxxxxxxxx>
> > Cc: Michael Brown <mbrown@xxxxxxxxxxxxxxxx>; paul@xxxxxxx; 
> > xen-devel@xxxxxxxxxxxxxxxxxxxx;
> > netdev@xxxxxxxxxxxxxxx; wei.liu@xxxxxxxxxx
> > Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status 
> > existence before watching
> > 
> > On Tue, May 11, 2021 at 12:40:54PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Tue, May 11, 2021 at 07:06:55AM +0000, Durrant, Paul wrote:
> > > > > -----Original Message-----
> > > > > From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
> > > > > Sent: 10 May 2021 20:43
> > > > > To: Michael Brown <mbrown@xxxxxxxxxxxxxxxx>; paul@xxxxxxx
> > > > > Cc: paul@xxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxxx; 
> > > > > netdev@xxxxxxxxxxxxxxx; wei.liu@xxxxxxxxxx;
> > Durrant,
> > > > > Paul <pdurrant@xxxxxxxxxxxx>
> > > > > Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status 
> > > > > existence before watching
> > > > >
> > > > > On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
> > > > > > If you have a suggested patch, I'm happy to test that it doesn't 
> > > > > > reintroduce
> > > > > > the regression bug that was fixed by this commit.
> > > > >
> > > > > Actually, I've just tested with a simple reloading xen-netfront 
> > > > > module. It
> > > > > seems in this case, the hotplug script is not re-executed. In fact, I
> > > > > think it should not be re-executed at all, since the vif interface
> > > > > remains in place (it just gets NO-CARRIER flag).
> > > > >
> > > > > This brings a question, why removing hotplug-status in the first 
> > > > > place?
> > > > > The interface remains correctly configured by the hotplug script after
> > > > > all. From the commit message:
> > > > >
> > > > >     xen-netback: remove 'hotplug-status' once it has served its 
> > > > > purpose
> > > > >
> > > > >     Removing the 'hotplug-status' node in netback_remove() is wrong; 
> > > > > the script
> > > > >     may not have completed. Only remove the node once the watch has 
> > > > > fired and
> > > > >     has been unregistered.
> > > > >
> > > > > I think the intention was to remove 'hotplug-status' node _later_ in
> > > > > case of quickly adding and removing the interface. Is that right, 
> > > > > Paul?
> > > >
> > > > The removal was done to allow unbind/bind to function correctly. IIRC 
> > > > before the original patch
> > doing a bind would stall forever waiting for the hotplug status to change, 
> > which would never happen.
> > >
> > > Hmm, in that case maybe don't remove it at all in the backend, and let
> > > it be cleaned up by the toolstack, when it removes other backend-related
> > > nodes?
> > 
> > No, unbind/bind _does_ require hotplug script to be called again.
> > 
> 
> Yes, sorry I was misremembering. My memory is hazy but there was definitely a 
> problem at the time with leaving the node in place.
> 
> > When exactly vif interface appears in the system (starts to be available
> > for the hotplug script)? Maybe remove 'hotplug-status' just before that
> > point?
> > 
> 
> I really can't remember any detail. Perhaps try reverting both patches then 
> and check that the unbind/rmmod/modprobe/bind sequence still works (and the 
> backend actually makes it into connected state).

Ok, I've tried this. I've reverted both commits, then used your test
script from the 9476654bd5e8ad42abe8ee9f9e90069ff8e60c17:
    
    This has been tested by running iperf as a server in the test VM and
    then running a client against it in a continuous loop, whilst also
    running:
    
    while true;
      do echo vif-$DOMID-$VIF >unbind;
      echo down;
      rmmod xen-netback;
      echo unloaded;
      modprobe xen-netback;
      cd $(pwd);
      brctl addif xenbr0 vif$DOMID.$VIF;
      ip link set vif$DOMID.$VIF up;
      echo up;
      sleep 5;
      done
    
    in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind,
    unload, re-load, re-bind and re-plumb the backend.
    
In fact, the need to call `brctl` and `ip link` manually is exactly
because the hotplug script isn't executed. When I execute it manually,
the backend properly gets back to working. So, removing 'hotplug-status'
was in the correct place (netback_remove). The missing part is the toolstack
calling the hotplug script on xen-netback re-bind.

In this case, I'm not sure what is the proper way. If I restart
xendriverdomain service (I do run the backend in domU), it properly
executes hotplug script and the backend interface gets properly
configured. But it doesn't do it on its own. It seems to be related to
device "state" in xenstore. The specific part of the libxl is
backend_watch_callback():
https://github.com/xen-project/xen/blob/master/tools/libs/light/libxl_device.c#L1664

    ddev = search_for_device(dguest, dev);
    if (ddev == NULL && state == XenbusStateClosed) {
        /*
         * Spurious state change, device has already been disconnected
         * or never attached.
         */
        goto skip;
    } else if (ddev == NULL) {
        rc = add_device(egc, nested_ao, dguest, dev);
        if (rc > 0)
            free_ao = true;
    } else if (state == XenbusStateClosed && online == 0) {
        rc = remove_device(egc, nested_ao, dguest, ddev);
        if (rc > 0)
            free_ao = true;
        check_and_maybe_remove_guest(gc, ddomain, dguest);
    }

In short: if device gets XenbusStateInitWait for the first time (ddev ==
NULL case), it goes to add_device() which executes the hotplug script
and stores the device.
Then, if device goes to XenbusStateClosed + online==0 state, then it
executes hotplug script again (with "offline" parameter) and forgets the
device. If you unbind the driver, the device stays in
XenbusStateConnected state (in xenstore), and after you bind it again,
it goes to XenbusStateInitWait. It don't think it goes through
XenbusStateClosed, and online stays at 1 too, so libxl doesn't execute
the hotplug script again.

Some solution could be to add an extra case at the end, like "ddev !=
NULL && state == XenbusStateInitWait && hotplug-status != connected".
And make sure xl devd won't call the same hotplug script multiple times
for the same device _at the same time_ (I'm not sure about the async
machinery here).

But even if xl devd (aka xendriverdomain service) gets "fixes" to
execute hotplug script in that case, I don't think it would work in
backend in dom0 case - there, I think nothing watches already configured
vif interfaces (there is no xl devd daemon in dom0, and xl background
process watches only domain death and cdrom eject events). 

I'm adding toolstack maintainers, maybe they'll have some idea...

In any case, the issue is not calling the hotplug script, responsible
for configuring newly created vif interface. Not kernel waiting for it.
So, I think both commits should still be reverted.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.