[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Debugging a weird hardware fault.




On 02/08/11 15:26, Keir Fraser wrote:
> On 02/08/2011 07:14, "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx> wrote:
>
>> Just for information, this turned out to be a BIOS bug.  It was setting
>> a 6 second timer when executing _PTS, which hit the system reset if
>> PM1{a,b} had not been hit when the timer expired.  As Xen does all of
>> its shutdown after the call to _PTS and before PM1{a,b}, there is a
>> significant time gap, which was falling fowl of the timer in most cases.
> Six seconds though, that's quite a long time! Is it a big box?

It is a Netscalar SDX box, designed to have 24 logical pcpus, 96GB ram,
320 pci-passed-through ixgbe virtual functions (claiming 3 irqs per vf).

It seems that Xen spends a fair amount of time doing freeze_domains
(even though dom0 has already shut down all domUs, albeit forcibly if
they haven't shut down nicely within 15 seconds), and bringing down the
other CPUs (in particular, it spends ages fiddling around with irq
affinities).

Overall, there is probably quite a bit of optimization which could be
done, but that still doesn't excuse a BIOS deciding that "a long time"
as per the ACPI spec is "less than 6 seconds".

~Andrew

>> In this case, it seems likely that a BIOS fix can be done, as Supermicro
>> do provide a custom BIOS for the NetScalar box in question.
>>
>> However, If anyone else comes across this issue, we did make a software
>> solution.  You can replace /etc/init.d/halt (or equivalent for your
>> chosen dom0 distro) to KEXEC reboot into a native kernel which listens
>> for a special command line parameter and calls pm_power_off_prepare()
>> and pm_power_off() after the ACPI module has initialized[1].
>>
>> This issue does however show that Xen itself is in breach of the ACPI
>> spec, which is a dangerous situation to be in given the fragility of
>> APCI at the best of times.  In due course, I will put my mind to solving
>> the dom0-Xen ACPI interaction problems if the question is still open.
> Yes, this is ultimately the issue. It's going to be a pain to fix properly,
> unfortunately.
>
>  -- Keir
>
>

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.