Xen project Mailing List

Re: [MirageOS-devel] Xenstore_ring deadlocks in Mirage Xenstore

To: James Bielman <jamesjb@xxxxxxxxxx>

From: Dave Scott <Dave.Scott@xxxxxxxxxx>

Date: Thu, 10 Jul 2014 20:47:06 +0000

Accept-language: en-GB, en-US

Cc: "mirageos-devel@xxxxxxxxxxxxxxxxxxxx" <mirageos-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 10 Jul 2014 20:47:36 +0000

List-id: Developer list for MirageOS <mirageos-devel.lists.xenproject.org>

Thread-index: AQHPnHx3Ul12mze37Eek0OXftSEgqJuZpbYA

Thread-topic: [MirageOS-devel] Xenstore_ring deadlocks in Mirage Xenstore

Hi James, On 10 Jul 2014, at 21:20, James Bielman <jamesjb@xxxxxxxxxx> wrote: > Hi all, > > I'm doing some testing of our Xenstore/Flask patches and I'm noticing an > intermittent problem with the interdomain communication locking up. > > We are using "ocaml-xenstore-xen" forked from: > > https://github.com/djs55/ocaml-xenstore-xen > > into our "xenstore_mac" branch at: > > https://github.com/GaloisInc/ocaml-xenstore-xen > > I'm starting the Xenstore kernel with the "init-xenstore-domain" tool. > > Most of the time, this works perfectly, but occasionally Xenstore sees > an event notification that data is available for reading, but the > Xenstore ring contains no data (it seems to always contain all zeros). Does the ring recover after this or has it become stuck? Does the problem affect other domain rings or is it just dom0? > > The problem doesn't seem to be with event signalling, because I can > replace the "block" with a "sleep" and it repeatedly sleeps reading an > empty queue. > > It feels like a race condition to me, perhaps related to the fact that > dom0 is writing to the Xenstore ring before the Xenstore domain is > unpaused (during IOCTL_XENBUS_BACKEND_SETUP). > > Has anyone encountered this issue before, or have any insight into > tracking it down? It should be ok for dom0 to write to the ring before the domain is unpaused, although in this situation I would worry more about events going missing than spurious extra events turning up :-) I can’t remember off the top of my head what the event channel bitmap is set to when the domain is built — worth checking, perhaps the bitmap is being left undefined and sometimes contains a ‘1’ in the critical spot. I’ve had problems before when I replaced the code which updates the consumer and producer pointers with a Cstruct.set_uint32, which internally used ocplib-endian, which used a C function which executed unaligned byte-by-byte loads and stores when compiled with particular compiler versions (ocplib-endian conditionally compiles code depending on the compiler version). These byte-by-byte operations exposed intermediate/currupted values to the client. Occasionally a pointer would be seen incrementing 0xff -> 0x1ff -> 0x100, causing the client to read 0xff extra bytes which were often zero. By the time I got in there with a diagnostic tool, the initial corruption was gone. It’s worth checking that your code has the fix (which was to use custom C stubs to perform regular loads and stores via uint32_t*) If the notification is being seen too early (before the ring data has been updated), perhaps there’s a bug in the Linux xenstore client in dom0? I know of one other bug in the Linux client which causes watch events and watch responses to become permuted (especially if xenstored responds quickly). This manifests in libxenstore’s watch code failing. Cheers, Dave _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.