[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] Xenstore_ring deadlocks in Mirage Xenstore



Hi James,

On 10 Jul 2014, at 21:20, James Bielman <jamesjb@xxxxxxxxxx> wrote:

> Hi all,
> 
> I'm doing some testing of our Xenstore/Flask patches and I'm noticing an
> intermittent problem with the interdomain communication locking up.
> 
> We are using "ocaml-xenstore-xen" forked from:
> 
>  https://github.com/djs55/ocaml-xenstore-xen
> 
> into our "xenstore_mac" branch at:
> 
>  https://github.com/GaloisInc/ocaml-xenstore-xen
> 
> I'm starting the Xenstore kernel with the "init-xenstore-domain" tool.
> 
> Most of the time, this works perfectly, but occasionally Xenstore sees
> an event notification that data is available for reading, but the
> Xenstore ring contains no data (it seems to always contain all zeros).

Does the ring recover after this or has it become stuck? Does the problem 
affect other domain rings or is it just dom0?

> 
> The problem doesn't seem to be with event signalling, because I can
> replace the "block" with a "sleep" and it repeatedly sleeps reading an
> empty queue.
> 
> It feels like a race condition to me, perhaps related to the fact that
> dom0 is writing to the Xenstore ring before the Xenstore domain is
> unpaused (during IOCTL_XENBUS_BACKEND_SETUP).
> 
> Has anyone encountered this issue before, or have any insight into
> tracking it down?

It should be ok for dom0 to write to the ring before the domain is unpaused, 
although in this situation I would worry more about events going missing than 
spurious extra events turning up :-) I can’t remember off the top of my head 
what the event channel bitmap is set to when the domain is built — worth 
checking, perhaps the bitmap is being left undefined and sometimes contains a 
‘1’ in the critical spot.

I’ve had problems before when I replaced the code which updates the consumer 
and producer pointers with a Cstruct.set_uint32, which internally used 
ocplib-endian, which used a C function which executed unaligned byte-by-byte 
loads and stores when compiled with particular compiler versions (ocplib-endian 
conditionally compiles code depending on the compiler version). These 
byte-by-byte operations exposed intermediate/currupted values to the client. 
Occasionally a pointer would be seen incrementing 0xff -> 0x1ff -> 0x100, 
causing the client to read 0xff extra bytes which were often zero. By the time 
I got in there with a diagnostic tool, the initial corruption was gone. It’s 
worth checking that your code has the fix (which was to use custom C stubs to 
perform regular loads and stores via uint32_t*)

If the notification is being seen too early (before the ring data has been 
updated), perhaps there’s a bug in the Linux xenstore client in dom0? I know of 
one other bug in the Linux client which causes watch events and watch responses 
to become permuted (especially if xenstored responds quickly). This manifests 
in libxenstore’s watch code failing.

Cheers,
Dave


_______________________________________________
MirageOS-devel mailing list
MirageOS-devel@xxxxxxxxxxxxxxxxxxxx
http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.