[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] xenstored crashes with SIGSEGV



2014-12-16 12:23 GMT+00:00 Ian Campbell <Ian.Campbell@xxxxxxxxxx>:
> On Tue, 2014-12-16 at 11:30 +0000, Frediano Ziglio wrote:
>> 2014-12-16 11:06 GMT+00:00 Ian Campbell <Ian.Campbell@xxxxxxxxxx>:
>> > On Tue, 2014-12-16 at 10:45 +0000, Ian Campbell wrote:
>> >> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
>> >> > > I notice in your bugzilla (for a different occurrence, I think):
>> >> > >> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip 
>> >> > >> 000000000045e238 sp 00007ffff68dfa30 error 6 in 
>> >> > >> python2.6[400000+21e000]
>> >> > >
>> >> > > Which appears to have faulted access 0xff000000000 too. It looks like
>> >> > > this process is a python thing, it's nothing to do with xenstored I
>> >> > > assume?
>> >> >
>> >> > Yes, that's one univention-config, which is completely independent of
>> >> > xen(stored).
>> >> >
>> >> > > It seems rather coincidental that it should be accessing the
>> >> > > same sort of address and be faulting.
>> >> >
>> >> > Yes, good catch. I'll have another look at those core dumps.
>> >>
>> >> With this in mind, please can you confirm what model of machines you've
>> >> seen this on, and in particular whether they are all the same class of
>> >> machine or whether they are significantly different.
>> >>
>> >> The reason being that randomly placed 0xff values in a field of 0x00
>> >> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
>> >> memory pages.
>> >
>> > Thanks for giving me access to the core files. This is very suspicious:
>> > (gdb) frame 2
>> > #2  0x000000000040a348 in tdb_open_ex (name=0x1941fb0 
>> > "/var/lib/xenstored/tdb.0x1935bb0", hash_size=<value optimized out>, 
>> > tdb_flags=0, open_flags=<value optimized out>, mode=<value optimized out>,
>> >     log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at 
>> > tdb.c:1958
>> > 1958            SAFE_FREE(tdb->locked);
>> >
>> > (gdb) x/96x tdb
>> > 0x1921270:      0x00000000      0x00000000      0x00000000      0x00000000
>> > 0x1921280:      0x0000001f      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921290:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212a0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212b0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212c0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212d0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212e0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x19212f0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921300:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921310:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921320:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921330:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921340:      0x00000000      0x00000000      0x0000ff00      0x000000ff
>> > 0x1921350:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921360:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>> > 0x1921370:      0x004093b0      0x00000000      0x004092f0      0x00000000
>> > 0x1921380:      0x00000002      0x00000000      0x00000091      0x00000000
>> > 0x1921390:      0x0193de70      0x00000000      0x01963600      0x00000000
>> > 0x19213a0:      0x00000000      0x00000000      0x0193fbb0      0x00000000
>> > 0x19213b0:      0x00000000      0x00000000      0x00000000      0x00000000
>> > 0x19213c0:      0x00405870      0x00000000      0x0040e3e0      0x00000000
>> > 0x19213d0:      0x00000038      0x00000000      0xe814ec70      0x6f2f6567
>> > 0x19213e0:      0x01963650      0x00000000      0x0193dec0      0x00000000
>> >
>> > Something has clearly done a number on the ram of this process.
>> > 0x1921270 through 0x192136f is 256 bytes...
>> >
>> > Since it appears to be happening to other processes too I would hazard
>> > that this is not a xenstored issue.
>> >
>> > Ian.
>> >
>>
>> Good catch Ian!
>>
>> Strange corruption. Probably not related to xenstored as you
>> suggested. I would be curious to see what's before the tdb pointer and
>> where does the corruption starts.
>
> (gdb) print tdb
> $2 = (TDB_CONTEXT *) 0x1921270
> (gdb) x/64x 0x1921200
> 0x1921200:      0x01921174      0x00000000      0x00000000      0x00000000
> 0x1921210:      0x01921174      0x00000000      0x00000171      0x00000000
> 0x1921220:      0x00000000      0x00000000      0x00000000      0x00000000

0x0 next (u64)
0x0 prev (u64)

> 0x1921230:      0x01941f60      0x00000000      0x00000000      0x00000000

0x01941f60 parent (u64), make sense is not NULL
0x0 child (u64)

> 0x1921240:      0x00000000      0x00000000      0x00000000      0x6f630065

0x0 refs (u64)
0x0 null_refs (u32)
0x6f630065 pad, garbage (u32)

> 0x1921250:      0x00000000      0x00000000      0x0040e8a7      0x00000000

0x0 destructor (u64)
0x0040e8a7 name (u64)

> 0x1921260:      0x00000118      0x00000000      0xe814ec70      0x00000000

0x118, size (u64)
0xe814ec70 magic (u32)
0x0 pad (u32)

Well... all the talloc header seems fine to me.


> 0x1921270:      0x00000000      0x00000000      0x00000000      0x00000000
> 0x1921280:      0x0000001f      0x000000ff      0x0000ff00      0x000000ff
> 0x1921290:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212a0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212b0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212c0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212d0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212e0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
> 0x19212f0:      0x00000000      0x000000ff      0x0000ff00      0x000000ff
>
> So it appears to start at 0x1921270 or maybe ...6c.
>

It looks like that there is a pattern like

 0x00000000      0x000000ff      0x0000ff00      0x000000ff

only exceptions are when field is set after talloc_zero (fd, flags,
functions). Something like the memset inside the talloc_zero fill with
this pattern instead of zeroes. Note that a pattern of 16 bytes is
compatible with SSE instructions size. Some bug in the save/restore
for SSE registers? Some bug on SSE emulation?

What does "info all-registers" gdb command say about SSE registers?

Do we have a bug in Xen that affect SSE instructions (possibly already
fixed after Philipp version) ?

>>  I also don't understand where the
>> "fd = 47" came from a previous mail. 0x1f is 31, not 47 (which is
>> 0x2f).
>
> I must have been using a different coredump to the origianl report
> (there are several).
>
> In the one which corresponds to the above:
>
> (gdb) print *tdb
> $3 = {name = 0x0, map_ptr = 0x0, fd = 31, map_size = 255,
>   read_only = 65280, locked = 0xff00000000, ecode = 65280, header = {
>     magic_food = 
> "\377\000\000\000\000\000\000\000\377\000\000\000\000\377\000\000\377\000\000\000\000\000\000\000\377\000\000\000\000\377\000",
>  version = 255, hash_size = 0, rwlocks = 255, reserved = {65280,
>       255, 0, 255, 65280, 255, 0, 255, 65280, 255, 0, 255, 65280,
>       255, 0, 255, 65280, 255, 0, 255, 65280, 255, 0, 255, 65280,
>       255, 0, 255, 65280, 255, 0}}, flags = 0, travlocks = {
>     next = 0xff0000ff00, off = 0, hash = 255}, next = 0xff0000ff00,
>   device = 1095216660480, inode = 1095216725760,
>   log_fn = 0x4093b0 <null_log_fn>,
>   hash_fn = 0x4092f0 <default_tdb_hash>, open_flags = 2}
> (gdb) print/x *tdb
> $4 = {name = 0x0, map_ptr = 0x0, fd = 0x1f, map_size = 0xff,
>   read_only = 0xff00, locked = 0xff00000000, ecode = 0xff00,
>   header = {magic_food = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
>       0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0,
>       0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0},
>     version = 0xff, hash_size = 0x0, rwlocks = 0xff, reserved = {
>       0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00,
>       0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0,
>       0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff,
>       0xff00, 0xff, 0x0}}, flags = 0x0, travlocks = {
>     next = 0xff0000ff00, off = 0x0, hash = 0xff},
>   next = 0xff0000ff00, device = 0xff00000000, inode = 0xff0000ff00,
>   log_fn = 0x4093b0, hash_fn = 0x4092f0, open_flags = 0x2}
>
> which is consistent.
>
>> I would not be surprised about a strange bug in libc or the kernel.
>
> Or even Xen itself, or the h/w.
>
> Ian,
>

Frediano

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.