[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] MCE logs and CPU issue



Hi both,

you might wanna throw 30 minutes into setting up a OMD nagios instance
(www.omdistro.org), adding the affected servers to the check_mk config
and grab my linux ECC error check plugin from the community exchange
(http://exchange.check-mk.org)

I *really really hope* I got everything right and it will be able to
detect ECC 1/2bit errors once the CPUs  report them.

The error
>> Feb  2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr
>> cf839a00, state cc0035400001009f]
is as descriptive as anything that isn't a real big iron Unix box can get.
(Of course, then you'd have better ECC and a page deallocation table
anyway and all this would not be causing problems)

My assumption is that Xen properly forwards MCEs. There was a
presentation by Intel on the topic at one of the last XenSummits. I
wasn't there but read through it some time. I guess you'll be able to
find it.

If needed I can do a short walkthrough of the setup. I just wanna
avoid this looking like an advertisement. It's not my fault there's no
other good ECC check plugin for Nagios :)

2012/2/3 Luke S. Crawford <lsc@xxxxxxxxx>:
> On Fri, Feb 03, 2012 at 11:37:27AM +0800, Sylvain Chevalier wrote:
>> Hi,
>>
>> On one of our servers running xen, we see many instances like this in
>> /var/log/messages on dom0:
>>
>> Feb  2 17:45:11 maradona kernel: [172988.068048] MCE_DOM0_LOG: enter
>> dom0 mce vIRQ handler
>> Feb  2 17:45:11 maradona kernel: [172988.068050] MCE_DOM0_LOG: No more
>> urgent data
>> Feb  2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr
>> cf839a00, state cc0035400001009f]
>> Feb  2 17:45:11 maradona kernel: [172988.068059] MCE_DOM0_LOG: No more
>> nonurgent data
>>
>> it is always CPU8, BANK12. And the server will sometimes just abruptly
>> reboot after logging this.
>
>> Does it mean that MCE messages are logged by xen in /var/log/messages
>> and that there is a problem with this cpu? Do you know how I can dig
>> further and find what the problem is?
>
> Betcha it is the ram in that bank.
>
> I'm getting similar errors in a server that I just swapped out, only my
> MCE errors say:
>
> (XEN) MCE: The hardware reports a non fatal, correctable incident occured on 
> CPU 0.
> (XEN) Bank 4: dc0c4000fe080813[c008000401000000] at        363fe9000
>
> (this is on my serial console, not /var/log/messages)
>
> 'non-fatal, correctable incident on cpu0, Bank 4'  sure sounds a lot
> like it's a correctable ECC error.   The crash would then be explained
> by an uncorrectable ecc error (commonly in failing ram, you get correctable
> errors, then an uncorrectable error.)

bingo :>

> Now, this was on an ancient garbage nvidia mcp55 motherboard and nothing
> like the kernel EDAC/bluesmoke module works with it, xen or no.
>
> The counter evidence to that theory is that the motherboard system event
> log (accessed through the bios setup screen)  doesn't show any errors.

MCEs are often seen while nothing shows up in iLO or other things. I
guess this is since Intel / AMD decide when the cpu sends out an
MCE/EDAC event, whereas the HW vendors might even be slightly inclined
to not immediately replace stuff because of a single pci crc error.
(which aren't even checked in linux as per default... lol)


Flo

-- 
the purpose of libvirt is to provide an abstraction layer hiding all
xen features added since 2006 until they were finally understood and
copied by the kvm devs.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.