[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] blocking Xen 3.X production use: soft lockup bugs



Hi All,

I hate to say it, but it's starting to look like soft lockup bug(s)
are turning into a serious roadblock for general production use of Xen
3.X, on a wide range of hardware.  I've been using Xen since the 1.0
days, and I have to say that this the most serious showstopper bug
I've ever hit -- it usually manifests itself during the first
significant network and/or disk I/O after starting a second or third
domU on the same box, and is the only bug I've ever hit that has
caused permanent damage -- it tends to corrupt guest filesystems.  In
my case it's stopped a deployment dead in its tracks, and our only
options at this point are to go back to Xen 2.X or (horrors) to native
Linux kernels.

The problem (or something that looks identical) is described in
several tickets, status currently NEW or REOPENED, no clear
resolution:
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=543
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=690
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=697
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=705

In our own shop, we consistently hit soft lockups while running on
both IBM x330's and older Netengines (similar to an IBM 4000R).  We've
found no workaround.  We're on xen-3.0-testing, changeset 9732, kernel
2.6.6.13.  On April 6th, Keir posted a note saying this was fixed as
of a blkif_schedule() fix, which we already have because that was way
back in changeset 9587...
http://lists.xensource.com/archives/html/xen-devel/2006-04/msg00121.html.

The most recent devel list traffic I've found which covers this is
July 7th:
http://lists.xensource.com/archives/html/xen-users/2006-07/msg00134.html
...this message referred back to Kier's comment as describing a fix,
but it doesn't look true; while Kier's 9587 checkin may have fixed a
soft lockup problem, there appear to be more out there, or else
there's been regression.

Do we have any consensus that this bug is fixed at all in
xen-3.0-testing, or even unstable?  Is anyone who was hitting soft
lockups in testing *not* hitting them any more on the same hardware?
If so, what changeset are you on now?

If anyone needs any more information, just let me know.  As usual, if
anyone wants login and console server access to one of these boxes to
chase this down, I'm more than happy to provide that.

Thanks, 

Steve
-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@xxxxxxxxxxxxx 
http://www.stevegt.com -- http://Infrastructures.Org

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.