[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] xen-block: race condition when stopping the device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)



Hi Paul,

On 13/12/2019 15:55, Durrant, Paul wrote:
-----Original Message-----
From: Xen-devel <xen-devel-bounces@xxxxxxxxxxxxxxxxxxxx> On Behalf Of
Julien Grall
Sent: 13 December 2019 15:37
To: Ian Jackson <ian.jackson@xxxxxxxxxx>
Cc: Jürgen Groß <jgross@xxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx; Stefano
Stabellini <sstabellini@xxxxxxxxxx>; osstest service owner <osstest-
admin@xxxxxxxxxxxxxx>; Anthony Perard <anthony.perard@xxxxxxxxxx>
Subject: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions -
FAIL

+Anthony

On 13/12/2019 11:40, Ian Jackson wrote:
Julien Grall writes ("Re: [Xen-devel] [xen-4.13-testing test] 144736:
regressions - FAIL"):
AMD Seattle boards (laxton*) are known to fail booting time to time
because of PCI training issue. We have workaround for it (involving
longer power cycle) but this is not 100% reliable.

This wasn't a power cycle.  It was a software-initiated reboot.  It
does appear to hang in the firmware somewhere.  Do we expect the pci
training issue to occur in this case ?

The PCI training happens at every reset (including software). So I may
have confused the workaround for firmware corruption with the PCI
training. We definitely have a workfround for the former.

For the latter, I can't remember if we did use a new firmware or just
hope it does not happen often.

I think we had a thread on infra@ about the workaround some times last
year. Sadly this was sent on my Arm e-mail address and I didn't archive
it before leaving :(. Can you have a look if you can find the thread?


    test-armhf-armhf-xl-vhd      18 leak-check/check         fail
REGR.
vs. 144673

That one is strange. A qemu process seems to have have died producing
a core file, but I couldn't find any log containing any other
indication
of a crashed program.

I haven't found anything interesting in the log. @Ian could you set up
a repro for this?

There is some heisenbug where qemu crashes with very low probability.
(I forget whether only on arm or on x86 too).  This has been around
for a little while.  I doubt this particular failure will be
reproducible.

I can't remember such bug been reported on Arm before. Anyway, I managed
to get the stack trace from gdb:

Core was generated by `/usr/local/lib/xen/bin/qemu-system-i386
-xen-domid 1 -chardev socket,id=libxl-c'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
dir/hw/block/dataplane/xen-block.c:531
531
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
dir/hw/block/dataplane/xen-block.c:
No such file or directory.
[Current thread is 1 (LWP 1987)]
(gdb) bt
#0  0x006342be in xen_block_handle_requests (dataplane=0x108e600) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
dir/hw/block/dataplane/xen-block.c:531
#1  0x0063447c in xen_block_dataplane_event (opaque=0x108e600) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
dir/hw/block/dataplane/xen-block.c:626
#2  0x008d005c in xen_device_poll (opaque=0x107a3b0) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/hw/xen/xen-
bus.c:1077
#3  0x00a4175c in run_poll_handlers_once (ctx=0x1079708,
timeout=0xb1ba17f8) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
posix.c:520
#4  0x00a41826 in run_poll_handlers (ctx=0x1079708, max_ns=8000,
timeout=0xb1ba17f8) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
posix.c:562
#5  0x00a41956 in try_poll_mode (ctx=0x1079708, timeout=0xb1ba17f8) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
posix.c:597
#6  0x00a41a2c in aio_poll (ctx=0x1079708, blocking=true) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/aio-
posix.c:639
#7  0x0071dc16 in iothread_run (opaque=0x107d328) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-
dir/iothread.c:75
#8  0x00a44c80 in qemu_thread_start (args=0x1079538) at
/home/osstest/build.144736.build-armhf/xen/tools/qemu-xen-dir/util/qemu-
thread-posix.c:502
#9  0xb67ae5d8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

This feels like a race condition between the init/free code with
handler. Anthony, does it ring any bell?


 From that stack bt it looks like an iothread managed to run after the sring 
was NULLed. This should not be able happen as the dataplane should have been 
moved back onto QEMU's main thread context before the ring is unmapped.

My knowledge of this code is fairly limited, so correct me if I am wrong.

blk_set_aio_context() would set the context for the block aio. AFAICT, the only aio for the block is xen_block_complete_aio().

In the stack above, we are not dealing with a block aio but an aio tie to the event channel (see the call from xen_device_poll). So I don't think the blk_set_aio_context() would affect the aio.

So it would be possible to get the iothread running because we received a notification on the event channel while we are stopping the block (i.e xen_block_dataplane_stop()).

If xen_block_dataplane_stop() grab the context lock first, then the iothread dealing with the event may wait on the lock until its released.

By the time the lock is grabbed, we may have free all the resources (including srings). So the event iothread will end up to dereference a NULL pointer.

It feels to me we need a way to quiesce all the iothreads (blk, event,...) before continuing. But I am a bit unsure how to do this in QEMU.

Cheers,

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.