Xen project Mailing List

RE: [Xen-devel] Debugging a weird hardware fault.

To: Keir Fraser <keir.xen@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

From: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>

Date: Thu, 28 Jul 2011 23:45:07 +0100

Accept-language: en-US

Acceptlanguage: en-US

Cc:

Delivery-date: Thu, 28 Jul 2011 15:45:56 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AcxNZu1KSOfHQqC8kEeWRl0hBxZ7qQADlzEK

Thread-topic: [Xen-devel] Debugging a weird hardware fault.

________________________________________ From: Keir Fraser [keir.xen@xxxxxxxxx] Sent: 28 July 2011 21:42 To: Andrew Cooper; xen-devel@xxxxxxxxxxxxxxxxxxx Subject: Re: [Xen-devel] Debugging a weird hardware fault. On 28/07/2011 20:53, "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx> wrote: > My questions to the Xen community are: > > what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is > in action, and more generally, how can I go about debugging which tasks > are being run. By the time you get to time_suspend(), you are running on CPU0, all other CPUs are offline, all domUs are suspended, and IRQs are disabled. There's not much scope for unexpected interruptions unless it's an NMI or SMI. By that point the serial subsystem is in synchronous mode, rather than interrupt-driven, so it's no wonder it continues to work. -- Keir Initially, an SMI was what I was thinking, but the triple fault occurs whether you start bringing down CPUs or not. While waiting 10 seconds in the platform_op select statment, the fault still occurs when all CPUs are still up, all IRQs still enabled and potentially domU's still up. (Also, from studying the Xen3.4 code, I believe that interrupts are still actually up during time_suspend(), but are soon brought down by lapic_suspend() later in device_power_down().) Convertly, in the hacked up case where I ditched most of the shared S3/S5 codepath and just hit the PM1A, the server correctly shut down and stayed shut down, implying that the fault was caused by software (be it BIOS or OS) rather than hardware. From what I understand of the APCI spec (and I claim very little knowledge), there are a multitude of hardware events which could bring the server out of S5, appearing as a triple fault, which would not be affected by whether you had hit the PM1A register. In this specific example, dom0 regular shudown code already brought down the domUs (of which there were none because we never started any), and we were running with 1 CPU only so no others were up. This opens up a whole host of other possibilities which could be playing an effect betwee the XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.