[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Fast inter-VM signaling using monitor/mwait


  • To: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
  • From: Michael Abd-El-Malek <mabdelmalek@xxxxxxx>
  • Date: Mon, 20 Apr 2009 12:32:31 -0400
  • Delivery-date: Fri, 01 May 2009 12:26:59 -0700
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

I've implemented a fast inter-VM signaling mechanism using the x86 monitor/mwait instructions. One-way event notification takes ~0.5us, compared to ~8us when using Xen's event channels. If there's interest in this code, I'm willing to clean it up and/or share it with others.
A little bit of background...  For my dissertation work, I'm enabling  
portable file system implementations by running a file system in a  
VM.  Small file system-agnostic modules in the kernel pass all VFS  
operations from the user OS (running user applications) to the file  
system VM (running the preferred OS for the file system).  In contrast  
to user-level file systems, my approach leverages unmodified file  
system implementations and provides better isolation for the FS from  
the myriad OSs that a user may be running.  I've implemented a unified  
buffer caching mechanism between VMs that requires very little changes  
to the OSs: less than a dozen line of changes.  Additionally, we've  
modified Xen's migration mechanism to support atomic migration of two  
VMs.  We currently have NetBSD and Linux (2.6.18 and 2.6.28) ports.   
I've implemented an IPC layer that's very similar to the one in the  
block and network PV drivers (i.e., uses shared memory for data  
transfer and event channels for signaling).
Unfortunately, Xen's event channels were too slow for my purposes.   
For the remainder of this email, assume that each VM has a dedicated  
core -- I'm trying to optimize latency for this case.  The culprit is  
the overhead for context switching to the guest OS interrupt handler  
(~3.5us for x86_64 2.6.28) and another context switch to a worker  
thread (~3us).  In addition, there's a ~2us cost for making a "send  
event" hypercall; this includes the cost of a hypercall and for  
sending an x86 inter-process-interrupt (IPI).  Thus, a one-way event  
notification costs ~8us.  Thus, an IPC takes ~16us for a request and a  
response notification.  This cost hasn't been problematic for the  
block and network drivers primarily since the hardware access cost for  
the underlying operations is typically in the millisecond range.  An  
extra 16us is noise.
Our design goal of preserving file system semantics without modifying  
the file system necessitates that all VFS operations are sent to the  
file system VM.  In other words, there is no client caching.  Thus,  
there is a high frequency of IPCs among the VMs.  For example, we pass  
all in-cache data and metadata accesses, and permission checks and  
directory entry validation callbacks.  These VFS operations can often  
cost less than 1us.  Adding a 16us signaling cost is thus a big  
overhead, slowing macrobenchmarks by ~20%.
I implemented a polling mechanism that spins on a shared memory  
location to check for requests/responses.  Its performance overhead  
was minimal (<1us).  But it had an adverse effect on power consumption  
during idle time.  Fortunately, since the Pentium chip, x86 has  
included two instructions for efficiently (power-wise) implementing  
this type of inter-processor polling.  A processor executes a monitor  
instruction with a memory address to be monitored, then executes an  
mwait instruction.  The mwait instruction returns when a write occurs  
to that memory location, or when an interrupt occurs.
The mwait instruction is privileged.  So I added a new hypercall that  
wraps access to the mwait instruction.  Thus, my code has a Xen  
component (the new hypercall) and a guest kernel component (code for  
executing the hypercall and for turning off/on the timer interrupts  
around the hypercall).  For this code to be merged into Xen, it would  
need to add security checks and check whether the processor supports  
such a feature.
Are any folks interested in this code?  Would it make sense to  
integrate this into Xen?  I've implemented the guest code in Linux  
2.6.28, but I can easily port it to 2.6.30 or 2.6.18.  I'm also happy  
to provide my benchmarking code.
Cheers,
Mike

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.