[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen blktap driver for Ceph RBD : Anybody wants to test ? :p



> >> > > tapdisk[9180]: segfault at 7f7e3a5c8c10 ip 00007f7e387532d4 sp
> >> > 00007f7e3a5c8c10 error 4 in libpthread-2.13.so[7f7e38748000+17000]
> >> > > tapdisk:9180 blocked for more than 120 seconds.
> >> > > tapdisk         D ffff88043fc13540     0  9180      1 0x00000000
> 
> You can try generating a core file by changing the ulimit on the running
> process
> 
> A backtrace would be useful :)
> 

I found it was actually dumping core in /, but gdb doesn't seem to work nicely 
and all I get is this:

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Cannot find new threads: generic error
Core was generated by `tapdisk'.
Program terminated with signal 11, Segmentation fault.
#0  pthread_cond_wait@@GLIBC_2.3.2 () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:163
163     ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such 
file or directory.

Even when I attach to a running process.

One VM segfaults on startup, pretty much everytime except never when I attach 
strace to it, meaning it's probably a race condition and may not actually be in 
your code...

> 
> > Actually maybe not. What I was reading only applies for large number of
> > bytes written to the pipe, and even then I got confused by the double
> > negatives. Sorry for the noise.
> 
> Yes, as you discovered but size < PIPE_BUF, they should be atomic even
> in non-blocking mode. But I could still add assert() there to make
> sure it is.

Nah I got that completely backwards. I see now you are only passing a pointer 
so yes it should never be non-atomic.

> I did find a bug where it could "leak" requests which may lead to
> hang. But it shouldn't crash ...
> 
> Here's an (untested yet) patch in the rbd error path:
> 

I'll try that later this morning when I get a minute.

I've done the poor-mans-debugger thing and riddled the code with printf's but 
as far as I can determine every routine starts and ends. My thinking at the 
moment is that it's either a race (the VM's most likely to crash have multiple 
disks), or a buffer overflow that trips it up either immediately, or later.

I have definitely observed multiple VM's crash when something in ceph hiccup's 
(eg I bring a mon up or down), if that helps.

I also followed through the rbd_aio_release idea on the weekend - I can see 
that if the read returns failure it means the callback was never called so the 
release is then the responsibility of the caller.

Thanks

James


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.