[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Blktap fixes and kernel patch.



On Tue, 2012-07-31 at 22:41 +0100, Dr. Greg Wettstein wrote:

> After becoming far more familiar with the XEN vbd infrastructure then
> I had intended I was finally able to run down the problem with the
> blktap2 driver which has plagued 4.1.2.  Most specifically the
> orphaning of the tapdisk2 driver process and after Ian's fix for that,
> the hang and orphaning of a tapdisk minor which occurs when libxl
> attempts to shutdown the driver process.

Thanks for digging into this Greg.

> In the process I updated the blktap2 kernel driver to patch cleanly
> into the Linux 3.4 kernel.  These fixes have been validated against
> the 3.4 kernel as well as the 3.2 kernel.

Just to be clear this is just a straight forward port, there's no part
of the deadlock fix in here?

> The first patch is one which was done by Ian for the development tree
> with minor corrections for 4.1.2.  I'm including it for completeness
> for those who want a trouble free patch set for a 4.1.2 distribution.
> This patch fixes the orphaning of the tapdisk2 driver process when xl
> shuts down.

This is a fairly straight backport of a patch in unstable?

If you send a mail with a subject "Xen 4.1.x backport request
<commit-id>" explaining which commit it is and CC keir@xxxxxxx &
ian.jackson@xxxxxxxxxxxxx then we can see about getting this into a
future 4.1.x (perhaps even 4.1.3, not sure which stage of rcs we are at
there).

If the backport is reasonably trivial then there is often no need to
include it but since you have done so you might as well include the
patch for reference.

> The second patch corrects the deadlock which occurs between the
> blktap2 kernel driver and the blktap2 userspace control plane.  The
> deadlock causes a delay in the shutdown of a XEN guest and results in
> the 'orphaning' of tapdisk minor number allocations.  As seems to be
> typical with these types of things the fix was trivially straight
> forward once I finally figured out what was going on.

Thanks for this.

Am I right that the important functional change here is that the xs_rm
needs to come after we read the params node but before tap_ctl_destroy?
Obviously removing the node before calling libxl__device_destroy_tapdisk
is wrong since libxl__device_destroy_tapdisk reads from be_path!

Looking at 4.2.0-rc1 I see that libxl__device_destroy removes the
backend before calling libxl__device_destroy_tapdisk, so I think that a
fix is needed there too.

I'm less sure about the usage in libxl__initiate_device_remove. I wonder
if the call to libxl__device_destroy_tapdisk there needs to move to
device_hotplug_done right after the transaction which cleans up the
backend back?

The code which used to be in libxl__devices_destroy is now in
libxl__initiate_device_remove so I expect that fixing that would be
sufficient.

I'd really appreciate it if you could validate whether 4.2.0-rc1 works
for you or not, I suspect not. We would usually want fix the development
version before considering fixes for the stable branches (even if the
actual patch ends up looking totally different) otherwise we run the
risk of regressions in the next version.

Is there a simple command which will list the leaked tap devices? If so
we can consider adding it to the leak-check phase of the automated tests
(although I'm not sure how much use these make of blktap)

For future reference if you intend for a patch to be applied it is best
to submit it in the form described in
http://wiki.xen.org/wiki/Submitting_Xen_Patches, that is one patch per
email, with a changelog specific to that change and a Signed-off-by. In
this sort of scenario (a patch going to 4.1 which isn't a backport) the
changelog should also mention why the patch isn't a backport.

> Ian for your reference the following change which you introduced to
> address this issue:
> 
> 79e3dbe4b659e78408a9eea76c51a601bd4a383a
> tapdisk: respond to destroy request before tearing down the commuication 
> channel
> 
> Is not needed and does not provide formally correct behavior in the
> presence of the two patches noted above.

Is it incorrect (i.e. should be reverted) or is it just incomplete/not
helpful?

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.