[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] xenstored crashes with SIGSEGV



Hello Ian,

On 12.12.2014 17:56, Ian Campbell wrote:
> On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
>> On 12.12.2014 17:32, Ian Campbell wrote:
>>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
>>>> We did enable tracing and now have the xenstored-trace.log of one crash:
>>>> It contains 1.6 billion lines and is 83 GiB.
>>>> It just shows xenstored to crash on TRANSACTION_START.
>>>>
>>>> Is there some tool to feed that trace back into a newly launched xenstored?
>>>
>>> Not that I know of I'm afraid.
>>
>> Okay, then I have to continue with my own tool.
> 
> If you do end up developing a tool to replay a xenstore trace then I
> think that'd be something great to have in tree!

I just need to figure out how to talk to xenstored on the wire: for some
strange reason xenstored is closing the connection to the UNIX socket on
the first write inside a transaction.
Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py...

>>> Do you get a core dump when this happens? You might need to fiddle with
>>> ulimits (some distros disable by default). IIRC there is also some /proc
>>> nob which controls where core dumps go on the filesystem.
>>
>> Not for that specific trace: We first enabled generating core files, but
>> only then discovered that this is not enough.
> 
> How wasn't it enough? You mean you couldn't use gdb to extract a
> backtrace from the core file? Or was something else wrong?

The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.

(gdb) bt full
#0  talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116
        tc = <value optimized out>
#1  0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551
        tc = <value optimized out>
#2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0
"/var/lib/xenstored/tdb.0x1935bb0",
    hash_size=<value optimized out>, tdb_flags=0, open_flags=<value
optimized out>, mode=<value optimized out>,
    log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at
tdb.c:1958
        tdb = 0x1921270
        st = {st_dev = 17, st_ino = 816913342, st_nlink = 1, st_mode =
33184, st_uid = 0, st_gid = 0, __pad0 = 0,
          st_rdev = 0, st_size = 303104, st_blksize = 4096, st_blocks =
592, st_atim = {tv_sec = 1415748063,
            tv_nsec = 87562634}, st_mtim = {tv_sec = 1415748063, tv_nsec
= 87562634}, st_ctim = {
            tv_sec = 1415748063, tv_nsec = 87562634}, __unused = {0, 0, 0}}
        rev = <value optimized out>
        locked = 4232112
        vp = <value optimized out>
#3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
0xff00000000 out of bounds>, hash_size=0,
    tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
No locals.
#4  0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
"/var/lib/xenstored/tdb.0x1935bb0")
    at tdb.c:2124
        fd = <value optimized out>
        saved_errno = <value optimized out>
        copy = 0x0
#5  0x0000000000406c2d in do_transaction_start (conn=0x1939550,
in=<value optimized out>)
    at xenstored_transaction.c:164
        trans = 0x1935bb0
        exists = <value optimized out>
        id_str =
"\300L\222\001\000\000\000\000\330!@\000\000\000\000\000P\225\223\001"
#6  0x00000000004045ca in process_message (conn=0x1939550) at
xenstored_core.c:1214
        trans = <value optimized out>
#7  consider_message (conn=0x1939550) at xenstored_core.c:1261
No locals.
#8  handle_input (conn=0x1939550) at xenstored_core.c:1308
        bytes = <value optimized out>
        in = <value optimized out>
#9  0x0000000000405170 in main (argc=<value optimized out>, argv=<value
optimized out>) at xenstored_core.c:1964

A 3rd trace is somewhere completely different:
(gdb) bt
#0  0x00007fcbf066088d in _IO_vfprintf_internal (s=0x7fff46ac3010,
format=<value optimized out>, ap=0x7fff46ac3170)
    at vfprintf.c:1617
#1  0x00007fcbf0682732 in _IO_vsnprintf (string=0x7fff46ac318f "",
maxlen=<value optimized out>,
    format=0x40d4a4 "%.*s", args=0x7fff46ac3170) at vsnprintf.c:120
#2  0x000000000040855b in talloc_vasprintf (t=0x17aaf20, fmt=0x40d4a4
"%.*s", ap=0x7fff46ac31d0) at talloc.c:1104
#3  0x0000000000408666 in talloc_asprintf (t=0x1f, fmt=0xffffe938
<Address 0xffffe938 out of bounds>)
    at talloc.c:1129
#4  0x0000000000403a38 in ask_parents (conn=0x177a1f0, name=0x17aaf20
"/local/domain/0/backend/vif/1/0/accel",
    perm=XS_PERM_READ) at xenstored_core.c:492
#5  errno_from_parents (conn=0x177a1f0, name=0x17aaf20
"/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
    at xenstored_core.c:516
#6  get_node (conn=0x177a1f0, name=0x17aaf20
"/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
    at xenstored_core.c:543
#7  0x000000000040481d in do_read (conn=0x177a1f0) at xenstored_core.c:744
#8  process_message (conn=0x177a1f0) at xenstored_core.c:1178
#9  consider_message (conn=0x177a1f0) at xenstored_core.c:1261
#10 handle_input (conn=0x177a1f0) at xenstored_core.c:1308
#11 0x0000000000405170 in main (argc=<value optimized out>, argv=<value
optimized out>) at xenstored_core.c:1964


>> It might be interesting to see what happens if you preserve the db and
>> reboot arranging for the new xenstored to start with the old file. If
>> the corruption is part of the file then maybe it can be induced to crash
>> again more quickly.
> 
> Thanks for the pointer, will try.

Didn't crash immediately.
Now running /usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py for
the weekend.

Thanks again.
Philipp

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.