[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 04/14] aio: make aio_context_acquire()/aio_context_release() a no-op



Am 19.12.2023 um 16:28 hat Kevin Wolf geschrieben:
> Am 05.12.2023 um 19:20 hat Stefan Hajnoczi geschrieben:
> > aio_context_acquire()/aio_context_release() has been replaced by
> > fine-grained locking to protect state shared by multiple threads. The
> > AioContext lock still plays the role of balancing locking in
> > AIO_WAIT_WHILE() and many functions in QEMU either require that the
> > AioContext lock is held or not held for this reason. In other words, the
> > AioContext lock is purely there for consistency with itself and serves
> > no real purpose anymore.
> > 
> > Stop actually acquiring/releasing the lock in
> > aio_context_acquire()/aio_context_release() so that subsequent patches
> > can remove callers across the codebase incrementally.
> > 
> > I have performed "make check" and qemu-iotests stress tests across
> > x86-64, ppc64le, and aarch64 to confirm that there are no failures as a
> > result of eliminating the lock.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@xxxxxxxxxx>
> > Reviewed-by: Eric Blake <eblake@xxxxxxxxxx>
> > Acked-by: Kevin Wolf <kwolf@xxxxxxxxxx>
> 
> I knew why I wasn't confident enough to give a R-b... This crashes
> qemu-storage-daemon in the qemu-iotests case graph-changes-while-io.
> 
> qemu-storage-daemon: ../nbd/server.c:2542: nbd_co_receive_request: Assertion 
> `client->recv_coroutine == qemu_coroutine_self()' failed.
> 
> (gdb) bt
> #0  0x00007fdb00529884 in __pthread_kill_implementation () from 
> /lib64/libc.so.6
> #1  0x00007fdb004d8afe in raise () from /lib64/libc.so.6
> #2  0x00007fdb004c187f in abort () from /lib64/libc.so.6
> #3  0x00007fdb004c179b in __assert_fail_base.cold () from /lib64/libc.so.6
> #4  0x00007fdb004d1187 in __assert_fail () from /lib64/libc.so.6
> #5  0x0000557f9f9534eb in nbd_co_receive_request (errp=0x7fdafc25eec0, 
> request=0x7fdafc25ef10, req=0x7fdaf00159c0) at ../nbd/server.c:2542
> #6  nbd_trip (opaque=0x557fa0b33fa0) at ../nbd/server.c:2962
> #7  0x0000557f9faa416b in coroutine_trampoline (i0=<optimized out>, 
> i1=<optimized out>) at ../util/coroutine-ucontext.c:177
> #8  0x00007fdb004efe90 in ?? () from /lib64/libc.so.6
> #9  0x00007fdafc35f680 in ?? ()
> #10 0x0000000000000000 in ?? ()
> (gdb) p *client
> $2 = {refcount = 4, close_fn = 0x557f9f95dc40 <nbd_blockdev_client_closed>, 
> exp = 0x557fa0b30590, tlscreds = 0x0, tlsauthz = 0x0, sioc = 0x557fa0b33d90, 
> ioc = 0x557fa0b33d90,
>   recv_coroutine = 0x7fdaf0015eb0, send_lock = {locked = 0, ctx = 0x0, 
> from_push = {slh_first = 0x0}, to_pop = {slh_first = 0x0}, handoff = 0, 
> sequence = 0, holder = 0x0},
>   send_coroutine = 0x0, read_yielding = false, quiescing = false, next = 
> {tqe_next = 0x0, tqe_circ = {tql_next = 0x0, tql_prev = 0x557fa0b305e8}}, 
> nb_requests = 1, closing = false,
>   check_align = 1, mode = NBD_MODE_EXTENDED, contexts = {exp = 
> 0x557fa0b30590, count = 1, base_allocation = true, allocation_depth = false, 
> bitmaps = 0x0}, opt = 7, optlen = 0}
> (gdb) p co_tls_current
> $3 = (Coroutine *) 0x7fdaf00061d0

This one isn't easy to debug...

The first problem here is that two nbd_trip() coroutines are scheduled
in the same iothread, and creating the second one overwrites
client->recv_coroutine, which triggers the assertion in the first one.

This can be fixed by introducing a new mutex in NBDClient and taking it
in nbd_client_receive_next_request() so that there is no race between
checking client->recv_coroutine != NULL and setting it to a new
coroutine. (Not entirely sure why two different threads are doing this,
maybe the main thread reentering in drained_end and the iothread waiting
for the next request?)

However, I'm seeing new assertion failures when I do that:
client->quiescing isn't set in the -EAGAIN case in nbd_trip(). I haven't
really figured out yet where this comes from. Taking the new NBDClient
lock in the drain functions and in nbd_trip() doesn't seem to be enough
to fix it anyway. Or maybe I didn't quite find the right places to take
it.

I'm not sure if the list of NBD clients needs a lock, too, or if it is
only ever accessed in the main thread, but that should be unrelated to
the assertion failures I'm seeing.

Kevin




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.