[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Debugging a weird hardware fault.



________________________________________
From: Keir Fraser [keir.xen@xxxxxxxxx]
Sent: 28 July 2011 21:42
To: Andrew Cooper; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] Debugging a weird hardware fault.

On 28/07/2011 20:53, "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx> wrote:

> My questions to the Xen community are:
>
> what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is
> in action, and more generally, how can I go about debugging which tasks
> are being run.

By the time you get to time_suspend(), you are running on CPU0, all other
CPUs are offline, all domUs are suspended, and IRQs are disabled. There's
not much scope for unexpected interruptions unless it's an NMI or SMI.

By that point the serial subsystem is in synchronous mode, rather than
interrupt-driven, so it's no wonder it continues to work.

 -- Keir


Initially, an SMI was what I was thinking, but the triple fault occurs whether 
you start bringing down CPUs or not.  While waiting 10 seconds in the 
platform_op select statment, the fault still occurs when all CPUs are still up, 
all IRQs still enabled and potentially domU's still up.  (Also, from studying 
the Xen3.4 code, I believe that interrupts are still actually up during 
time_suspend(), but are soon brought down by lapic_suspend() later in 
device_power_down().)

Convertly, in the hacked up case where I ditched most of the shared S3/S5 
codepath and just hit the PM1A, the server correctly shut down and stayed shut 
down, implying that the fault was caused by software (be it BIOS or OS) rather 
than hardware.  From what I understand of the APCI spec (and I claim very 
little knowledge), there are a multitude of hardware events which could bring 
the server out of S5, appearing as a triple fault, which would not be affected 
by whether you had hit the PM1A register.

In this specific example, dom0 regular shudown code already brought down the 
domUs (of which there were none because we never started any), and we were 
running with 1 CPU only so no others were up.  This opens up a whole host of 
other possibilities which could be playing an effect betwee the 
XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down.

~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.