Xen project Mailing List

Re: [Xen-devel] Issues with the xen-access.c model

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Tamas K Lengyel <tamas@xxxxxxxxxxxxx>

From: Razvan Cojocaru <rcojocaru@xxxxxxxxxxxxxxx>

Date: Wed, 6 Apr 2016 19:08:53 +0300

Cc: "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Wed, 06 Apr 2016 16:08:51 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 04/05/2016 07:10 PM, Andrew Cooper wrote: > On 05/04/16 16:55, Razvan Cojocaru wrote: >> On 04/05/16 18:35, Tamas K Lengyel wrote: >>> >>> On Tue, Apr 5, 2016 at 6:13 AM, Andrew Cooper <andrew.cooper3@xxxxxxxxxx >>> <mailto:andrew.cooper3@xxxxxxxxxx>> wrote: >>> >>> On 05/04/16 11:55, Razvan Cojocaru wrote: >>> > Hello, >>> > >>> > xen-access.c does roughly this: >>> > >>> > for (;;) { >>> > poll_evt_channel(); >>> > >>> > if (new_events) { >>> > while (ring_buffer_has_requests) { >>> > pull_request(); >>> > process_request(); >>> > put_response(); >>> > } >>> > >>> > signal_evt_channel(); >>> > } >>> > } >>> > >>> > The problems are these: >>> > >>> > 1. vm_event_put_request() does notify_via_xen_event_channel(d, >>> > ved->xen_port); at the very end, and we seem to be using a FIFO event >>> > channel (if I'm reading the code right, it's generic event channel >>> code >>> > that pre-dates the vm_event code that sits on top of it). >>> > >>> > This means that for a guest that's running two VCPUs, the following >>> > scenario is possible: >>> > >>> > * VCPU1 puts an event in the ring buffer >>> > * VCPU1 signals the event channel >>> > * poll_evt_channel() wakes up >>> > * VCPU2 puts an event in the ring buffer >>> > * the loop processes the first event >>> > * VCPU2 signals the event channel >>> > * the loop processes the second event (since there _are_ more requests >>> > in the ring buffer) >>> > * the loop signals the event channel that requests have been processed >>> > * poll_evt_channel() wakes up again, because of the second event >>> channel >>> > signal, but this is pointless since all events have already been >>> processed >>> > >>> > This can be avoided by counting the requests processed vs. the >>> number of >>> > succesful poll()s, but it still does not solve the second problem: >>> > >>> > 2. Race conditions. At least in theory, there's nothing to stop the >>> > second iteration of the request processing loop to read a partially >>> > written request from the ring buffer, or garbage left over from a >>> > previous request, if it hits it "just right" (i.e. the "VCPU2 puts an >>> > event in the ring buffer" part can run on top of the "the loop >>> processes >>> > the second event"). >>> > >>> > All this doesn't really show up unless we're in heavy processing >>> > scenarios, but they can occur, especially in synced vm_event >>> scenarios. >>> > >>> > The problems seems to be alleviated by changing the loop to this: >>> > >>> > for (;;) { >>> > poll_evt_channel(); >>> > >>> > if (new_events) { >>> > while (ring_buffer_has_requests) { >>> > pull_request(); >>> > process_request(); >>> > put_response(); >>> > >>> > /* Moved here. */ >>> > signal_evt_channel(); >>> >>> >>> That was the original setup, I've changed it a while back by moving it >>> outside the loop. My reasoning was that there is no point in going >>> lockstep with events and issuing a hypercall after each is processed. >>> With only a handful of events on the ring the context switch is probably >>> more taxing then having all requests and responses processed in one >>> swipe. Of course, depending on the application this may not be the >>> optimal case and the signal can be moved back to the loop (with lots of >>> vCPUs it may be the case that the first vCPU remains paused while all >>> events are processed for the other vCPUs, which of course would be bad). >>> >>> >>> > } >>> > } >>> > } >>> > >>> > but it's obviously not foolproof, the main problem remaining the >>> > synchronization of the ring buffer from both the hypervisor side and >>> > userspace. >>> > >>> > How should we proceed to fix this? >>> >>> The vm_event code in Xen must increment the producer index as the final >>> action of putting a request on the ring. If it doesn't then it is >>> simply broken and needs fixing. >>> >>> This will ensure that the consumer never sees a partial request on the >>> ring; by the time it observes a newer producer index, all data is >>> already written. >>> >>> >>> It may worth double-checking but I'm pretty sure that's how it works >>> right now as that part of the code has been pretty much just been >>> recycled from the other Xen ring implementations. >> I understand. We shouldn't forget that the consumer also modifies the >> ring in userspace, and the hypervisor then becomes a consumer of >> responses, and the same issues apply to it (would compiler optimizations >> be a problem here?). > > Quite possibly. See XSA-155 > >> But unless that's the case, then I'm probably seeing a different issue and >> Andrew is right. >> >> Thank you both for the prompt replies! > > You need to make certain that for both producer/consumer pairs, the > following is true: > > The producer and consumer indices are read using some form of atomic > operation (ACCESS_ONCE() most likely, or read_atomic()), followed by an > smb_rmb() to guarantee that the read completes and is in a local > variable before any calculation is performed based on the value. (This > prevents the compiler "optimising" to multiple reads of the shared ring). > > Then, memcpy() (or suitable alternative) to get the contents of the > request in/out of the ring. > > For the producer side, an smp_wmb() between the memcpy() into the ring, > and the write to the producer index. > For the consumer side, an smp_mb() between the memcpy() out of the ring, > and the write to the consumer index. Did all of that, to no avail - but the issue may indeed be somewhere else. What I'm seeing is that on Windows 7 x86 boot, at the point of loading some drivers (most often around usbport.sys), if I get page faults in quick succession and try to emulate the offending instruction with code similar to the xen-access.c "outside-the-loop resume" model, the domU appears to freeze (although it's still running and nothing crashes). x64 guests seem unaffected by this, and x86 guests almost never hang if I put the resume code inside the loop. This behaviour is always accompanied by qemu-dm going to 100% CPU for a few minutes. It eventually subsides, but the guest remains unusable although it's not dead. This goes away if I just run the instruction (with a single-step-like model) inside the hypervisor instead of trying to emulate it, this being the only change. Unfortunately this is not trivial to demonstrate with small modifications to the Xen in-tree code (see the CLI emulation issue, for example). Would trying to emulate an instruction from a driver working with qemu cause this? No emulation failures are reported, but the attempt apparently leads to a(n IO?) loop of some kind. Thanks, Razvan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.