[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [MirageOS-devel] Xenstore_ring deadlocks in Mirage Xenstore
Hi James, On 10 Jul 2014, at 21:20, James Bielman <jamesjb@xxxxxxxxxx> wrote: > Hi all, > > I'm doing some testing of our Xenstore/Flask patches and I'm noticing an > intermittent problem with the interdomain communication locking up. > > We are using "ocaml-xenstore-xen" forked from: > > https://github.com/djs55/ocaml-xenstore-xen > > into our "xenstore_mac" branch at: > > https://github.com/GaloisInc/ocaml-xenstore-xen > > I'm starting the Xenstore kernel with the "init-xenstore-domain" tool. > > Most of the time, this works perfectly, but occasionally Xenstore sees > an event notification that data is available for reading, but the > Xenstore ring contains no data (it seems to always contain all zeros). Does the ring recover after this or has it become stuck? Does the problem affect other domain rings or is it just dom0? > > The problem doesn't seem to be with event signalling, because I can > replace the "block" with a "sleep" and it repeatedly sleeps reading an > empty queue. > > It feels like a race condition to me, perhaps related to the fact that > dom0 is writing to the Xenstore ring before the Xenstore domain is > unpaused (during IOCTL_XENBUS_BACKEND_SETUP). > > Has anyone encountered this issue before, or have any insight into > tracking it down? It should be ok for dom0 to write to the ring before the domain is unpaused, although in this situation I would worry more about events going missing than spurious extra events turning up :-) I can’t remember off the top of my head what the event channel bitmap is set to when the domain is built — worth checking, perhaps the bitmap is being left undefined and sometimes contains a ‘1’ in the critical spot. I’ve had problems before when I replaced the code which updates the consumer and producer pointers with a Cstruct.set_uint32, which internally used ocplib-endian, which used a C function which executed unaligned byte-by-byte loads and stores when compiled with particular compiler versions (ocplib-endian conditionally compiles code depending on the compiler version). These byte-by-byte operations exposed intermediate/currupted values to the client. Occasionally a pointer would be seen incrementing 0xff -> 0x1ff -> 0x100, causing the client to read 0xff extra bytes which were often zero. By the time I got in there with a diagnostic tool, the initial corruption was gone. It’s worth checking that your code has the fix (which was to use custom C stubs to perform regular loads and stores via uint32_t*) If the notification is being seen too early (before the ring data has been updated), perhaps there’s a bug in the Linux xenstore client in dom0? I know of one other bug in the Linux client which causes watch events and watch responses to become permuted (especially if xenstored responds quickly). This manifests in libxenstore’s watch code failing. Cheers, Dave _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |