[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] xenstored crashes with SIGSEGV



(adding Ian J who knows a bit more about C xenstored than me...)

 On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 12.12.2014 17:56, Ian Campbell wrote:
> > On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> >> On 12.12.2014 17:32, Ian Campbell wrote:
> >>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> >>>> We did enable tracing and now have the xenstored-trace.log of one crash:
> >>>> It contains 1.6 billion lines and is 83 GiB.
> >>>> It just shows xenstored to crash on TRANSACTION_START.
> >>>>
> >>>> Is there some tool to feed that trace back into a newly launched 
> >>>> xenstored?
> >>>
> >>> Not that I know of I'm afraid.
> >>
> >> Okay, then I have to continue with my own tool.
> > 
> > If you do end up developing a tool to replay a xenstore trace then I
> > think that'd be something great to have in tree!
> 
> I just need to figure out how to talk to xenstored on the wire: for some
> strange reason xenstored is closing the connection to the UNIX socket on
> the first write inside a transaction.
> Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py...
> 
> >>> Do you get a core dump when this happens? You might need to fiddle with
> >>> ulimits (some distros disable by default). IIRC there is also some /proc
> >>> nob which controls where core dumps go on the filesystem.
> >>
> >> Not for that specific trace: We first enabled generating core files, but
> >> only then discovered that this is not enough.
> > 
> > How wasn't it enough? You mean you couldn't use gdb to extract a
> > backtrace from the core file? Or was something else wrong?
> 
> The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
> 
> (gdb) bt full
> #0  talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116
>         tc = <value optimized out>
> #1  0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551
>         tc = <value optimized out>
> #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0
> "/var/lib/xenstored/tdb.0x1935bb0",

This is interesting actually. There are only a small number of calls to
talloc_free in tdb_open_ex (all wrapped in "SAFE_FREE") and they are all
in after the fail: error exit path. So I think the reason the crash is
rare is that you have to hit some other failure first.

About half of the "goto fail" statements are preceded by a TDB_LOG
statement. But given the presence of logfn=<null_log_fn> in the trace
that doesn't seem likely to be helpful right now.

It might be worth splurging some debug of your own before each of those
failure points and/or wiring up the tdb log function to xenstores
logging.

The calls to SAFE_FREE are
        SAFE_FREE(tdb->map_ptr);
        SAFE_FREE(tdb->name);
        SAFE_FREE(tdb->locked);
        SAFE_FREE(tdb);

I think those should all have been allocated by the time we get to fail
though, so not sure where 0xff000000 in the trace comes from.

I've timed out for tonight will try and have another look next week.

Ian.

>     hash_size=<value optimized out>, tdb_flags=0, open_flags=<value
> optimized out>, mode=<value optimized out>,
>     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at
> tdb.c:1958
>         tdb = 0x1921270
>         st = {st_dev = 17, st_ino = 816913342, st_nlink = 1, st_mode =
> 33184, st_uid = 0, st_gid = 0, __pad0 = 0,
>           st_rdev = 0, st_size = 303104, st_blksize = 4096, st_blocks =
> 592, st_atim = {tv_sec = 1415748063,
>             tv_nsec = 87562634}, st_mtim = {tv_sec = 1415748063, tv_nsec
> = 87562634}, st_ctim = {
>             tv_sec = 1415748063, tv_nsec = 87562634}, __unused = {0, 0, 0}}
>         rev = <value optimized out>
>         locked = 4232112
>         vp = <value optimized out>
> #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
> 0xff00000000 out of bounds>, hash_size=0,
>     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> No locals.
> #4  0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> "/var/lib/xenstored/tdb.0x1935bb0")
>     at tdb.c:2124
>         fd = <value optimized out>
>         saved_errno = <value optimized out>
>         copy = 0x0
> #5  0x0000000000406c2d in do_transaction_start (conn=0x1939550,
> in=<value optimized out>)
>     at xenstored_transaction.c:164
>         trans = 0x1935bb0
>         exists = <value optimized out>
>         id_str =
> "\300L\222\001\000\000\000\000\330!@\000\000\000\000\000P\225\223\001"
> #6  0x00000000004045ca in process_message (conn=0x1939550) at
> xenstored_core.c:1214
>         trans = <value optimized out>
> #7  consider_message (conn=0x1939550) at xenstored_core.c:1261
> No locals.
> #8  handle_input (conn=0x1939550) at xenstored_core.c:1308
>         bytes = <value optimized out>
>         in = <value optimized out>
> #9  0x0000000000405170 in main (argc=<value optimized out>, argv=<value
> optimized out>) at xenstored_core.c:1964
> 
> A 3rd trace is somewhere completely different:
> (gdb) bt
> #0  0x00007fcbf066088d in _IO_vfprintf_internal (s=0x7fff46ac3010,
> format=<value optimized out>, ap=0x7fff46ac3170)
>     at vfprintf.c:1617
> #1  0x00007fcbf0682732 in _IO_vsnprintf (string=0x7fff46ac318f "",
> maxlen=<value optimized out>,
>     format=0x40d4a4 "%.*s", args=0x7fff46ac3170) at vsnprintf.c:120
> #2  0x000000000040855b in talloc_vasprintf (t=0x17aaf20, fmt=0x40d4a4
> "%.*s", ap=0x7fff46ac31d0) at talloc.c:1104
> #3  0x0000000000408666 in talloc_asprintf (t=0x1f, fmt=0xffffe938
> <Address 0xffffe938 out of bounds>)
>     at talloc.c:1129
> #4  0x0000000000403a38 in ask_parents (conn=0x177a1f0, name=0x17aaf20
> "/local/domain/0/backend/vif/1/0/accel",
>     perm=XS_PERM_READ) at xenstored_core.c:492
> #5  errno_from_parents (conn=0x177a1f0, name=0x17aaf20
> "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
>     at xenstored_core.c:516
> #6  get_node (conn=0x177a1f0, name=0x17aaf20
> "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
>     at xenstored_core.c:543
> #7  0x000000000040481d in do_read (conn=0x177a1f0) at xenstored_core.c:744
> #8  process_message (conn=0x177a1f0) at xenstored_core.c:1178
> #9  consider_message (conn=0x177a1f0) at xenstored_core.c:1261
> #10 handle_input (conn=0x177a1f0) at xenstored_core.c:1308
> #11 0x0000000000405170 in main (argc=<value optimized out>, argv=<value
> optimized out>) at xenstored_core.c:1964
> 
> 
> >> It might be interesting to see what happens if you preserve the db and
> >> reboot arranging for the new xenstored to start with the old file. If
> >> the corruption is part of the file then maybe it can be induced to crash
> >> again more quickly.
> > 
> > Thanks for the pointer, will try.
> 
> Didn't crash immediately.
> Now running /usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py for
> the weekend.
> 
> Thanks again.
> Philipp
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.