[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] help with xenstored 'hang'
On 1 July 2010 14:30, Jim Fehlig <jfehlig@xxxxxxxxxx> wrote: > Patrick Colp wrote: >> I was recently struggling with what sounds like a not-too-dissimilar >> problem while working with a disaggregated version of xenstore. The >> ultimate solution for me was to disable pthreads in xenstore/libxs. I >> just commented out the following line in tools/xenstore/Makefile: >> >> xs.opic: CFLAGS += -DUSE_PTHREAD >> >> After I removed that line and rebuilt and installed xenstore, it >> worked just fine. I would be curious to know if this also solves your >> problem. >> > > After more thought, this seems like it could cause problems in xend, > which is multi-threaded. ÂThis change essentially make the xenstore > client library thread-unsafe correct? I don't think so. I think it just makes the xenstore library single threaded. In my case, I was using a single threaded application and still ran into this problem, as the xenstore library seems to have multiple threads. But the description of your problem sounds a lot like what was happening with me where it seemed like messages were disappearing. I can't say if what worked for me would work for you, though. It just seemed similar enough to me. Patrick > > Regards, > Jim > >> >> Patrick >> >> >> On 30 June 2010 15:15, Jim Fehlig <jfehlig@xxxxxxxxxx> wrote: >> >>> I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2 >>> production installation. ÂThe hang occurs randomly, on a random host. >>> User has provided cores of xend and xenstored processes when hang >>> occurs. ÂAfter poking at these cores I have discovered >>> >>> In xend process, a thread is blocked on a cond variable, waiting for a >>> response to XS_TRANSACTION_START from xenstored. A reader thread >>> responsible for reading from xenstored is blocked on read(2). >>> >>> In the xenstored process, the lone thread is blocked on select(2), >>> waiting for IO. I examined the connections list and see that it contains >>> a connection for the XS_TRANSACTION_START request. ÂDumping the >>> connection object: >>> >>> (gdb) p *(struct connection *)0x526c70 >>> $48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0, >>> can_write = >>> true, in = 0x523600, >>> out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0, >>> transaction_list = {next = 0x523560, >>> prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1, >>> domain = 0x0, watches = { >>> next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read = >>> 0x405180 <readfd>} >>> >>> Notice transaction_started is set to 1, but out_list is empty. AFAICT, >>> that means the reply has been sent to xend. The reader thread in xend >>> should have received the response and signaled the cond variable - >>> allowing execution to progress. Ultimately, xend would send a >>> XS_TRANSACTION_END message, freeing the connection object in xenstored >>> and removing it from connections list. >>> >>> Does my understanding of this code sound correct? ÂAnyone have >>> suggestions or further debugging tips? ÂExamining cores is about my only >>> debug option as user does not want to deploy debug patches, enable >>> tracing, etc. across 700 hosts. >>> >>> Interestingly, when user strace's or attaches to xenstored process with >>> gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored >>> continues normally. ÂA new connection to xenstored (e.g. running xmtop) >>> seems to poke it along as well. ÂWould a timeout on select(2) in main >>> loop of xenstored help at all? >>> >>> Thanks for any insights! >>> Jim >>> >>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@xxxxxxxxxxxxxxxxxxx >>> http://lists.xensource.com/xen-devel >>> >>> >>> > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |