[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [RFC] Erratic mouse in HVM guest

  • To: <xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
  • From: "Ross Maxfield" <rmaxfiel@xxxxxxxxxx>
  • Date: Wed, 28 Jun 2006 16:42:05 -0600
  • Delivery-date: Wed, 28 Jun 2006 15:42:56 -0700
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

To whom it may concern,

For many months, some of us at Novell working on and testing Xen have contended 
with chaotic mouse behavior in HVM Linux guests.   This ill-mannered mouse, 
however, appears to be sensitive to certain hardware.  Although I have seen the 
mouse jump around the screen occasionally on diverse machines, I see it 
continuously on the Harwich Twin Castle Paxville (3GHz, 8GB, x86_64, 8 way 
duel-core).  The mouse is completely unusable in the guest as the slightest 
mouse event produces wild results in the guest, either erratic mouse movement 
or button presses.

Bug 167187, “Erratic mouse behavior with HVM Linux guest and SDL” was entered 
into Novell's Bugzilla April 17th, 2006, and Intel was informed of the issue.  
Since Novell's first release of Xen with SLES is with full support of 
para-virtualized guests, this issue relative to the HVM guest has been put 
aside until recently when I began to explore the cause of the mouse problem.  
Here's what I've learned.

First, the mouse behaves erratically because the data coming out of 
/dev/input/mice is jumbled up, out of order actually.  This was rather 
perplexing because I had been able to determine that qemu was delivering the 
data in the proper order and, in fact, i8042_interrupt() of 
linux-2.6.16/drivers/input/serio/i8042.c executing in the HVM guest was also 
reporting that the data had been read in proper order, yet the processing of 
the data occurred out of order.

After exploring a number of possible causes for this behavior I discovered an 
assumption in the kernel code that is true when the kernel is running natively 
but not necessarily true when hosted by the hypervisor.

I learned that the i8042_interrupt() will be polled by the timer interrupt if 
HZ/20 jiffies has expired since the last 8042 interrupt.  So here's what I 
believe is happening.  Each mouse event generates at least three bytes of data, 
each byte of data generates an interrupt.  When the first interrupt is injected 
in the guest, as well as all interrupts, the kernel masks the interrupt vector 
in the PIC and then EOIs the PIC before actually handling the interrupt.  This, 
of course, allows ANY other interrupt to occur save the one currently begin 
serviced.  When i8042_interrupt() is called, it first calls timer_mod() to 
delay the timer callback another HZ/20, takes a spin_lock_irqsave() disabling 
interrupts (interrupts are enabled prior to i8042_interrupt() being called), 
reads the 8042 obtaining the first byte of data from qemu, and then releases 
the spinlock.  Immediately after releasing the spinlock, this isr is 
interrupted by a timer interrupt which discovers that the 8042's HZ/20 timer 
has expired and i8042_interrupt() is reentered and runs to completion as there 
is not a pending timer interrupt.  When the timer interrupt completes, the 
previously interrupted isr resumes and continues to process what was to be the 
first byte but now is not.  I have been able to determine that the timer is 
indeed calling i8052_interrupt() and causing the mis-ordered data.

For the timer interrupt handler to believe that HZ/20 jiffies had expired there 
must have been at least that amount of time lapse between i8052_interrupt() 
releasing the spinlock and calling serio_intrerrupt() a dozen lines later, 
suggesting a lengthy hypervisor preemption followed by a timer isr before 
resuming from the point of preemption.  Or, a considerable amount of time, > 
HZ/20, expired reading the data from qemu's emulation of port 0x60, followed by 
a timer isr after the spin_unlock_irqrestore() in i8052_interrupt().  Which 
ever case may be, i8052_interrupt() is _assuming_ that HZ/20 jiffies are not 
going to lapse before its isr completes.  This assumption is probably fair 
enough for running natively, but not a good assumption when hosted by the 
current implementation of the hypervisor.

The question now is, does the hypervisor change to accommodate the assumption, 
or is the assumption removed from the kernel, or is there yet some other 
fiendish time-consuming bug yet to be discovered ?

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.