[Xen-devel] Kernel 2.6.39+ hangs when running as HVM guest under Xen

Since kernel 2.6.39 we were experiencing strange hangs when booting those as HVM
guests in Xen (similar hangs but different places when looking at CentOS 5.4 +
Xen 3.4.3 as well as Xen 4.1 and a 3.0 based dom0). The problem only happens
when running with more than one vcpu.

I was able to examine some dumps[1] and it always seemed to be a weird
situations. In one example (booting 3.0 HVM under Xen 3.4.3/2.6.18 dom0) the
lockup always seemed to occur when the delayed mtrr init took place. Cpu#0
seemed to have been starting the rendevouz (stop_cpu) but then been interrupted
and the other (I was using vcpu=2 for simplicity) was idling somewhere else but
had the mtrr
rendevouz handler queued up (just seemed to never get started).

Things seemed to indicate some IPI problem but to be sure I went to bisect when
the problem started. I ended up with the following patch which, when reverted,
allows me to bring up a 3.0 HVM guest with more than one CPU without any 

commit 99bbb3a84a99cd04ab16b998b20f01a72cfa9f4f
Author: Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>
Date:   Thu Dec 2 17:55:10 2010 +0000

    xen: PV on HVM: support PV spinlocks and IPIs

    Initialize PV spinlocks on boot CPU right after native_smp_prepare_cpus
    (that switch to APIC mode and initialize APIC routing); on secondary

    Enable the usage of event channels to send and receive IPIs when
    running as a PV on HVM guest.

Though I have not yet really understood why exactly this happens, I thought I
post the results so far. It feels like either signalling an IPI through the
eventchannel does not come through or goes to the wrong CPU. It did not seem to
cause the exactly same place to fail. Like said, the 3.0 guest running in the
CentOS dom0 was locking up early right after all CPUs were brought up. While
during the bisect (using a kernel between 2.6.38 and .39-rc1) the lockup was 

Maybe someone has a clue immediately. I will dig a bit deeper in the dumps in
the meantime. Looking at the description, which sounds like using event channels
only was intended for PV on HVM guests, it is wrong in the first place to set
the xen ipi functions on the HVM side...


[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/791850

