[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] xl devd segmentation fault on xl block-detach



On Thu, May 04, 2017 at 10:51:01AM +0100, Ian Jackson wrote:
> Wei Liu writes ("Re: [Xen-devel] [BUG] xl devd segmentation fault on xl 
> block-detach"):
> > Can you give the following patch a try? This applies to 4.8.
> > 
> > Not sure if there is a better way to fix it though. Ian and Roger?
> 
> I find the logic here rather awkward.  I do remember reviewing it and
> becoming a bit confused at the time and it seems that even though I
> eventually convinced myself it was OK, I was wrong.
> 
> > From: Wei Liu <wei.liu2@xxxxxxxxxx>
> > Date: Wed, 3 May 2017 17:55:42 +0100
> > Subject: [PATCH] libxl: fix backend_watch_callback
> > 
> > That function needs to cope with spurious events. The original "skip"
> > path blindly freed dguest even when it needed to stay in ddomain list.
> > 
> > Free dguest iff it is newly added to the list. That way we don't free
> > the one that should stay on the list and we don't unnecessarily add a
> > stale dguest entry to ddomain list.
> 
> AFAICT right now you are right.  But I see another possible way of
> fixing it:
> 
> How about moving the num_devs == 0 check, and associated cleanup, to
> the exit path ?  That way a if new guest struct is spuriously
> allocated, it will automatically be freed.  It would mean that the
> freeing of dguest would depend only on other invariants already in the
> code, rather than on explicit tracking.
> 
> The invariants are, I think:
> 
>  * Any libxl__ddomain_device is either
>      * on some list libxl__ddomain_guest->devices
>      * being processed for removal, and referenced by a device
>        remove async call initiated by remove_device and which will
>        call device_complete() when done
>    but not both!
> 
>  * Any libxl__domain_guest is on the list libxl__ddomain->guests.
> 
>  The above apply even within any function, except very briefly when
>  transitioning from one state to another (eg creation, destruction).
> 
>  * SUM(libxl__domain_guest->num_*) != 0, when we return from the
>    outermost callback.  (Ie, there are no leftover empty guest
>    structs.)

Yes, that seems better so that there's no code duplication. Will send a patch
shortly.

> 
> Thinking about this like this, and observing the control flow, leads
> me to think I have found another bug.
> 
> Consider what happens if a device is removed while it is still being
> added.  That is, an event comes in which causes us to call add_device.
> add_device sets up the callback and starts doing work (eg hotplug
> scripts).  Before that finishes, the device is removed again.
> backend_watch_callback will tear the device down and free dev.
> 
> But dev is still referenced by the add_device operation, and when it
> completes, device_complete will call
>   libxl__device_backend_path(gc, aodev->dev)
> 
> There ought to be a (perhaps implicit) invariant that
>  * Any dev referenced by an aodev call is legit

Right, maybe an easier solution would be to not pass the stored libxl__device
to the async functions, and instead copy it to a temporary one that's GC'ed
afterwards. AFAICT the async operations only rely on the libxl__device, so
passing a device tracked by the GC should solve this without refcounting, will
send a patch for this also.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.