[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[qemu-xen staging] block/nbd: allow drain during reconnect attempt



commit dd1ec1a4afe190e030edfa052d95c9e6e065438c
Author:     Vladimir Sementsov-Ogievskiy <vsementsov@xxxxxxxxxxxxx>
AuthorDate: Mon Jul 27 21:47:48 2020 +0300
Commit:     Eric Blake <eblake@xxxxxxxxxx>
CommitDate: Tue Jul 28 09:54:43 2020 -0500

    block/nbd: allow drain during reconnect attempt
    
    It should be safe to reenter qio_channel_yield() on io/channel read/write
    path, so it's safe to reduce in_flight and allow attaching new aio
    context. And no problem to allow drain itself: connection attempt is
    not a guest request. Moreover, if remote server is down, we can hang
    in negotiation, blocking drain section and provoking a dead lock.
    
    How to reproduce the dead lock:
    
    1. Create nbd-fault-injector.conf with the following contents:
    
    [inject-error "mega1"]
    event=data
    io=readwrite
    when=before
    
    2. In one terminal run nbd-fault-injector in a loop, like this:
    
    n=1; while true; do
        echo $n; ((n++));
        ./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
    done
    
    3. In another terminal run qemu-io in a loop, like this:
    
    n=1; while true; do
        echo $n; ((n++));
        ./qemu-io -c 'read 0 512' nbd://127.0.0.1:10000;
    done
    
    After some time, qemu-io will hang trying to drain, for example, like
    this:
    
     #3 aio_poll (ctx=0x55f006bdd890, blocking=true) at
        util/aio-posix.c:600
     #4 bdrv_do_drained_begin (bs=0x55f006bea710, recursive=false,
        parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:427
     #5 bdrv_drained_begin (bs=0x55f006bea710) at block/io.c:433
     #6 blk_drain (blk=0x55f006befc80) at block/block-backend.c:1710
     #7 blk_unref (blk=0x55f006befc80) at block/block-backend.c:498
     #8 bdrv_open_inherit (filename=0x7fffba1563bc
        "nbd+tcp://127.0.0.1:10000", reference=0x0, options=0x55f006be86d0,
        flags=24578, parent=0x0, child_class=0x0, child_role=0,
        errp=0x7fffba154620) at block.c:3491
     #9 bdrv_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
        reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
        block.c:3513
     #10 blk_new_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
        reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
        block/block-backend.c:421
    
    And connection_co stack like this:
    
     #0 qemu_coroutine_switch (from_=0x55f006bf2650, to_=0x7fe96e07d918,
        action=COROUTINE_YIELD) at util/coroutine-ucontext.c:302
     #1 qemu_coroutine_yield () at util/qemu-coroutine.c:193
     #2 qio_channel_yield (ioc=0x55f006bb3c20, condition=G_IO_IN) at
        io/channel.c:472
     #3 qio_channel_readv_all_eof (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
        niov=1, errp=0x7fe96d729eb0) at io/channel.c:110
     #4 qio_channel_readv_all (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
        niov=1, errp=0x7fe96d729eb0) at io/channel.c:143
     #5 qio_channel_read_all (ioc=0x55f006bb3c20, buf=0x7fe96d729d28
        "\300.\366\004\360U", buflen=8, errp=0x7fe96d729eb0) at
        io/channel.c:247
     #6 nbd_read (ioc=0x55f006bb3c20, buffer=0x7fe96d729d28, size=8,
        desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
        /work/src/qemu/master/include/block/nbd.h:365
     #7 nbd_read64 (ioc=0x55f006bb3c20, val=0x7fe96d729d28,
        desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
        /work/src/qemu/master/include/block/nbd.h:391
     #8 nbd_start_negotiate (aio_context=0x55f006bdd890,
        ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
        outioc=0x55f006bf19f8, structured_reply=true,
        zeroes=0x7fe96d729dca, errp=0x7fe96d729eb0) at nbd/client.c:904
     #9 nbd_receive_negotiate (aio_context=0x55f006bdd890,
        ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
        outioc=0x55f006bf19f8, info=0x55f006bf1a00, errp=0x7fe96d729eb0) at
        nbd/client.c:1032
     #10 nbd_client_connect (bs=0x55f006bea710, errp=0x7fe96d729eb0) at
        block/nbd.c:1460
     #11 nbd_reconnect_attempt (s=0x55f006bf19f0) at block/nbd.c:287
     #12 nbd_co_reconnect_loop (s=0x55f006bf19f0) at block/nbd.c:309
     #13 nbd_connection_entry (opaque=0x55f006bf19f0) at block/nbd.c:360
     #14 coroutine_trampoline (i0=113190480, i1=22000) at
        util/coroutine-ucontext.c:173
    
    Note, that the hang may be
    triggered by another bug, so the whole case is fixed only together with
    commit "block/nbd: on shutdown terminate connection attempt".
    
    Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@xxxxxxxxxxxxx>
    Message-Id: <20200727184751.15704-3-vsementsov@xxxxxxxxxxxxx>
    Reviewed-by: Eric Blake <eblake@xxxxxxxxxx>
    Signed-off-by: Eric Blake <eblake@xxxxxxxxxx>
---
 block/nbd.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 3558c173e3..ee9ab7512b 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -291,8 +291,22 @@ static coroutine_fn void 
nbd_reconnect_attempt(BDRVNBDState *s)
         goto out;
     }
 
+    bdrv_dec_in_flight(s->bs);
+
     ret = nbd_client_handshake(s->bs, sioc, &local_err);
 
+    if (s->drained) {
+        s->wait_drained_end = true;
+        while (s->drained) {
+            /*
+             * We may be entered once from nbd_client_attach_aio_context_bh
+             * and then from nbd_client_co_drain_end. So here is a loop.
+             */
+            qemu_coroutine_yield();
+        }
+    }
+    bdrv_inc_in_flight(s->bs);
+
 out:
     s->connect_status = ret;
     error_free(s->connect_err);
--
generated by git-patchbot for /home/xen/git/qemu-xen.git#staging



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.