We have been trying to find documentation on how to tell Xen to forward MCE information to the linux kernel in Dom0 in order to let a system administrator be able to get notified when his system has bad memory. However from what I can tell this has not
been documented anywhere.
If anyone knows of documentation (or knows the answer) of what someone is supposed to do in order to monitor the corrected errors and monitor the uncorrected errors when they are running modern xen, it would be appreciated.
To clarify, (and for people not familiar):
When running old xen ( example: Xen 4.1) on a system, linux in dom0 would load the edac modules. example: amd64_edac_mod , edac_mce_amd , and edac_core
Once the modules loaded, the error counts for ECC memory, and PCI, could be found in these "files":
/sys/devices/system/edac/mc/mc0/ce_count
/sys/devices/system/edac/mc/mc0/ue_count
/sys/devices/system/edac/pci/pci0/npe_count
/sys/devices/system/edac/pci/pci0/pe_count
However, in 2009-02, "cegger" wrote MCA/MCE_in_Xen, a proposal for having xen start checking the information
Xen started accessing the EDAC information (now called "MCE") at some point after that, which blocks the linux kernel in dom0 from accessing it.
Now, The linux kernel compile option: CONFIG_XEN_MCE_LOG=y is documented as: "Allow kernel fetching MCE error from Xen platform and converting it into Linux mcelog format for mcelog tools".
I imagine there must be some way on the xen side for this to work for CONFIG_XEN_MCE_LOG to have gotten into the linux kernel and be enabled by default in distributions.
(notes: mcelog seems to have been replaced with "ras daemon", but I believe that it pulls information using the same kernel APT as "mcelog") (so I believe the final output of if you are having memory errors is pulled by doing "ras-mc-ctl --errors" now
instead of looking in /sys/devices/system/edac/mc and /sys/devices/system/edac/pci)
I suspect that to check if it was working on a modern system, one would do "ras-mc-ctl --status" and get something implying that the xen mce interface is working instead of just saying "ras-mc-ctl: drivers not loaded."
Somewhere it was said that adding the xen boot parameter "mce=1" to grub would cause xen to forward the info to the linux kernel, but that conflicts with recent changes to the documentation. Also, tested by setting "mce=1" and nothing appears to change.
Any help is appreciated.