[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] MCE logs and CPU issue

On Fri, Feb 03, 2012 at 11:37:27AM +0800, Sylvain Chevalier wrote:
> Hi,
> On one of our servers running xen, we see many instances like this in
> /var/log/messages on dom0:
> Feb  2 17:45:11 maradona kernel: [172988.068048] MCE_DOM0_LOG: enter
> dom0 mce vIRQ handler
> Feb  2 17:45:11 maradona kernel: [172988.068050] MCE_DOM0_LOG: No more
> urgent data
> Feb  2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr
> cf839a00, state cc0035400001009f]
> Feb  2 17:45:11 maradona kernel: [172988.068059] MCE_DOM0_LOG: No more
> nonurgent data
> it is always CPU8, BANK12. And the server will sometimes just abruptly
> reboot after logging this.

> Does it mean that MCE messages are logged by xen in /var/log/messages
> and that there is a problem with this cpu? Do you know how I can dig
> further and find what the problem is?

Betcha it is the ram in that bank.  

I'm getting similar errors in a server that I just swapped out, only my
MCE errors say:

(XEN) MCE: The hardware reports a non fatal, correctable incident occured on 
CPU 0.
(XEN) Bank 4: dc0c4000fe080813[c008000401000000] at        363fe9000

(this is on my serial console, not /var/log/messages)

'non-fatal, correctable incident on cpu0, Bank 4'  sure sounds a lot
like it's a correctable ECC error.   The crash would then be explained
by an uncorrectable ecc error (commonly in failing ram, you get correctable
errors, then an uncorrectable error.)  

Now, this was on an ancient garbage nvidia mcp55 motherboard and nothing
like the kernel EDAC/bluesmoke module works with it, xen or no. 

The counter evidence to that theory is that the motherboard system event
log (accessed through the bios setup screen)  doesn't show any errors.

Now, like I said, this server was in production, so I drove a spare in
to the co-lo, swapped the hard drives and brought it back up .  (took a 
lot longer than it should, as this server hadn't been touched in years,
and somehow the good disks didn't end up with bootloaders.  By 'somehow'
I mean, "i am an idiot and did not install bootloaders when I replaced
bad disks"  - then I didn't bring my rescue cd, and the DHCP/tftp PXE
server I would have used to boot it into rescue mode was on the server
that was down.   It took all day when it should have taken about
as long as it takes to get up to the 14th floor of market post tower.)

Anyhow, I'm delaying diagnostics on my bad server until tomorrow;  I'd
bet lunch that if I turn ecc off and run memtest, I'll find a bad ram 

Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.