[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] MCE logs and CPU issue

Hi both,

you might wanna throw 30 minutes into setting up a OMD nagios instance
(www.omdistro.org), adding the affected servers to the check_mk config
and grab my linux ECC error check plugin from the community exchange

I *really really hope* I got everything right and it will be able to
detect ECC 1/2bit errors once the CPUs  report them.

The error
>> Feb  2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr
>> cf839a00, state cc0035400001009f]
is as descriptive as anything that isn't a real big iron Unix box can get.
(Of course, then you'd have better ECC and a page deallocation table
anyway and all this would not be causing problems)

My assumption is that Xen properly forwards MCEs. There was a
presentation by Intel on the topic at one of the last XenSummits. I
wasn't there but read through it some time. I guess you'll be able to
find it.

If needed I can do a short walkthrough of the setup. I just wanna
avoid this looking like an advertisement. It's not my fault there's no
other good ECC check plugin for Nagios :)

2012/2/3 Luke S. Crawford <lsc@xxxxxxxxx>:
> On Fri, Feb 03, 2012 at 11:37:27AM +0800, Sylvain Chevalier wrote:
>> Hi,
>> On one of our servers running xen, we see many instances like this in
>> /var/log/messages on dom0:
>> Feb  2 17:45:11 maradona kernel: [172988.068048] MCE_DOM0_LOG: enter
>> dom0 mce vIRQ handler
>> Feb  2 17:45:11 maradona kernel: [172988.068050] MCE_DOM0_LOG: No more
>> urgent data
>> Feb  2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr
>> cf839a00, state cc0035400001009f]
>> Feb  2 17:45:11 maradona kernel: [172988.068059] MCE_DOM0_LOG: No more
>> nonurgent data
>> it is always CPU8, BANK12. And the server will sometimes just abruptly
>> reboot after logging this.
>> Does it mean that MCE messages are logged by xen in /var/log/messages
>> and that there is a problem with this cpu? Do you know how I can dig
>> further and find what the problem is?
> Betcha it is the ram in that bank.
> I'm getting similar errors in a server that I just swapped out, only my
> MCE errors say:
> (XEN) MCE: The hardware reports a non fatal, correctable incident occured on 
> CPU 0.
> (XEN) Bank 4: dc0c4000fe080813[c008000401000000] at        363fe9000
> (this is on my serial console, not /var/log/messages)
> 'non-fatal, correctable incident on cpu0, Bank 4'  sure sounds a lot
> like it's a correctable ECC error.   The crash would then be explained
> by an uncorrectable ecc error (commonly in failing ram, you get correctable
> errors, then an uncorrectable error.)

bingo :>

> Now, this was on an ancient garbage nvidia mcp55 motherboard and nothing
> like the kernel EDAC/bluesmoke module works with it, xen or no.
> The counter evidence to that theory is that the motherboard system event
> log (accessed through the bios setup screen)  doesn't show any errors.

MCEs are often seen while nothing shows up in iLO or other things. I
guess this is since Intel / AMD decide when the cpu sends out an
MCE/EDAC event, whereas the HW vendors might even be slightly inclined
to not immediately replace stuff because of a single pci crc error.
(which aren't even checked in linux as per default... lol)


the purpose of libvirt is to provide an abstraction layer hiding all
xen features added since 2006 until they were finally understood and
copied by the kvm devs.

Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.