Xen project Mailing List

Re: [Xen-users] MCE logs and CPU issue

To: Sylvain Chevalier <sylvain.chevalier@xxxxxxxxx>

From: "Luke S. Crawford" <lsc@xxxxxxxxx>

Date: Fri, 3 Feb 2012 00:02:39 -0500

Delivery-date: Fri, 03 Feb 2012 05:04:22 +0000

List-id: Xen user discussion <xen-users.lists.xensource.com>

On Fri, Feb 03, 2012 at 11:37:27AM +0800, Sylvain Chevalier wrote: > Hi, > > On one of our servers running xen, we see many instances like this in > /var/log/messages on dom0: > > Feb 2 17:45:11 maradona kernel: [172988.068048] MCE_DOM0_LOG: enter > dom0 mce vIRQ handler > Feb 2 17:45:11 maradona kernel: [172988.068050] MCE_DOM0_LOG: No more > urgent data > Feb 2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr > cf839a00, state cc0035400001009f] > Feb 2 17:45:11 maradona kernel: [172988.068059] MCE_DOM0_LOG: No more > nonurgent data > > it is always CPU8, BANK12. And the server will sometimes just abruptly > reboot after logging this. > Does it mean that MCE messages are logged by xen in /var/log/messages > and that there is a problem with this cpu? Do you know how I can dig > further and find what the problem is? Betcha it is the ram in that bank. I'm getting similar errors in a server that I just swapped out, only my MCE errors say: (XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU 0. (XEN) Bank 4: dc0c4000fe080813[c008000401000000] at 363fe9000 (this is on my serial console, not /var/log/messages) 'non-fatal, correctable incident on cpu0, Bank 4' sure sounds a lot like it's a correctable ECC error. The crash would then be explained by an uncorrectable ecc error (commonly in failing ram, you get correctable errors, then an uncorrectable error.) Now, this was on an ancient garbage nvidia mcp55 motherboard and nothing like the kernel EDAC/bluesmoke module works with it, xen or no. The counter evidence to that theory is that the motherboard system event log (accessed through the bios setup screen) doesn't show any errors. Now, like I said, this server was in production, so I drove a spare in to the co-lo, swapped the hard drives and brought it back up . (took a lot longer than it should, as this server hadn't been touched in years, and somehow the good disks didn't end up with bootloaders. By 'somehow' I mean, "i am an idiot and did not install bootloaders when I replaced bad disks" - then I didn't bring my rescue cd, and the DHCP/tftp PXE server I would have used to boot it into rescue mode was on the server that was down. It took all day when it should have taken about as long as it takes to get up to the 14th floor of market post tower.) Anyhow, I'm delaying diagnostics on my bad server until tomorrow; I'd bet lunch that if I turn ecc off and run memtest, I'll find a bad ram module. _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.