[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] help with xenstored 'hang'



Patrick Colp wrote:
> I was recently struggling with what sounds like a not-too-dissimilar
> problem while working with a disaggregated version of xenstore. The
> ultimate solution for me was to disable pthreads in xenstore/libxs. I
> just commented out the following line in tools/xenstore/Makefile:
>
> xs.opic: CFLAGS += -DUSE_PTHREAD
>
> After I removed that line and rebuilt and installed xenstore, it
> worked just fine. I would be curious to know if this also solves your
> problem.
>   

After more thought, this seems like it could cause problems in xend,
which is multi-threaded.  This change essentially make the xenstore
client library thread-unsafe correct?

Regards,
Jim

>
> Patrick
>
>
> On 30 June 2010 15:15, Jim Fehlig <jfehlig@xxxxxxxxxx> wrote:
>   
>> I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2
>> production installation.  The hang occurs randomly, on a random host.
>> User has provided cores of xend and xenstored processes when hang
>> occurs.  After poking at these cores I have discovered
>>
>> In xend process, a thread is blocked on a cond variable, waiting for a
>> response to XS_TRANSACTION_START from xenstored. A reader thread
>> responsible for reading from xenstored is blocked on read(2).
>>
>> In the xenstored process, the lone thread is blocked on select(2),
>> waiting for IO. I examined the connections list and see that it contains
>> a connection for the XS_TRANSACTION_START request.  Dumping the
>> connection object:
>>
>> (gdb) p *(struct connection *)0x526c70
>> $48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0,
>> can_write =
>> true, in = 0x523600,
>> out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0,
>> transaction_list = {next = 0x523560,
>> prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1,
>> domain = 0x0, watches = {
>> next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read =
>> 0x405180 <readfd>}
>>
>> Notice transaction_started is set to 1, but out_list is empty. AFAICT,
>> that means the reply has been sent to xend. The reader thread in xend
>> should have received the response and signaled the cond variable -
>> allowing execution to progress. Ultimately, xend would send a
>> XS_TRANSACTION_END message, freeing the connection object in xenstored
>> and removing it from connections list.
>>
>> Does my understanding of this code sound correct?  Anyone have
>> suggestions or further debugging tips?  Examining cores is about my only
>> debug option as user does not want to deploy debug patches, enable
>> tracing, etc. across 700 hosts.
>>
>> Interestingly, when user strace's or attaches to xenstored process with
>> gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored
>> continues normally.  A new connection to xenstored (e.g. running xmtop)
>> seems to poke it along as well.  Would a timeout on select(2) in main
>> loop of xenstored help at all?
>>
>> Thanks for any insights!
>> Jim
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>> http://lists.xensource.com/xen-devel
>>
>>
>>     

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.