[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] xenstored crashes with SIGSEGV



On Fri, 2014-12-12 at 17:58 +0000, Ian Campbell wrote:
> (adding Ian J who knows a bit more about C xenstored than me...)
> 
>  On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
> > Hello Ian,
> > 
> > On 12.12.2014 17:56, Ian Campbell wrote:
> > > On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> > >> On 12.12.2014 17:32, Ian Campbell wrote:
> > >>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> > >>>> We did enable tracing and now have the xenstored-trace.log of one 
> > >>>> crash:
> > >>>> It contains 1.6 billion lines and is 83 GiB.
> > >>>> It just shows xenstored to crash on TRANSACTION_START.
> > >>>>
> > >>>> Is there some tool to feed that trace back into a newly launched 
> > >>>> xenstored?
> > >>>
> > >>> Not that I know of I'm afraid.
> > >>
> > >> Okay, then I have to continue with my own tool.
> > > 
> > > If you do end up developing a tool to replay a xenstore trace then I
> > > think that'd be something great to have in tree!
> > 
> > I just need to figure out how to talk to xenstored on the wire: for some
> > strange reason xenstored is closing the connection to the UNIX socket on
> > the first write inside a transaction.
> > Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py...
> > 
> > >>> Do you get a core dump when this happens? You might need to fiddle with
> > >>> ulimits (some distros disable by default). IIRC there is also some /proc
> > >>> nob which controls where core dumps go on the filesystem.
> > >>
> > >> Not for that specific trace: We first enabled generating core files, but
> > >> only then discovered that this is not enough.
> > > 
> > > How wasn't it enough? You mean you couldn't use gdb to extract a
> > > backtrace from the core file? Or was something else wrong?
> > 
> > The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
> > 
> > (gdb) bt full
> > #0  talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116
> >         tc = <value optimized out>
> > #1  0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551
> >         tc = <value optimized out>
> > #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0
> > "/var/lib/xenstored/tdb.0x1935bb0",
> 
> I've timed out for tonight will try and have another look next week.

I've had another dig, and have instrumented all of the error paths from
this function and I can't see any way for an invalid pointer to be
produced, let alone freed. I've been running under valgrind which should
have caught any uninitialised memory type errors.

> >     hash_size=<value optimized out>, tdb_flags=0, open_flags=<value
> > optimized out>, mode=<value optimized out>,
> >     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at
> > tdb.c:1958

Please can you confirm what is at line 1958 of your copy of tdb.c. I
think it will be tdb->locked, but I'd like to be sure.

You are running a 64-bit dom0, correct? I've only just noticed that
0xff00000000 is >32bits. My testing so far was 32-bit, I don't think it
should matter wrt use of uninitialised data etc.

I can't help feeling that 0xff00000000 must be some sort of magic
sentinel value to someone. I can't figure out what though.

Have you observed the xenstored processes growing especially large
before this happens? I'm wondering if there might be a leak somewhere
which after a time is resulting a 

I'm about to send out a patch which plumbs tdb's logging into
xenstored's logging, in the hopes that next time you see this it might
say something as it dies.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.