[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Xens handling of MCE


  • To: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Development <D@xxxxxxxxxx>
  • Date: Thu, 31 Aug 2023 19:52:05 +0000
  • Accept-language: en-US
  • Delivery-date: Thu, 31 Aug 2023 19:51:49 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHZ3EO9T2w+QWWFXkyFcitIPlY3Pg==
  • Thread-topic: Xens handling of MCE

We have been trying to find documentation on how to tell Xen to forward MCE information to the linux kernel in Dom0 in order to let a system administrator be able to get notified when his system has bad memory.  However from what I can tell this has not been documented anywhere.

If anyone knows of documentation (or knows the answer) of what someone is supposed to do in order to monitor the corrected errors and monitor the uncorrected errors when they are running modern xen, it would be appreciated.


To clarify, (and for people not familiar):

    When running old xen ( example: Xen 4.1) on a system, linux in dom0 would load the edac modules.  example: amd64_edac_mod , edac_mce_amd , and edac_core
    Once the modules loaded, the error counts for ECC memory, and PCI, could be found in these "files":
               /sys/devices/system/edac/mc/mc0/ce_count
               /sys/devices/system/edac/mc/mc0/ue_count
               /sys/devices/system/edac/pci/pci0/npe_count
               /sys/devices/system/edac/pci/pci0/pe_count
    
    However, in 2009-02, "cegger" wrote MCA/MCE_in_Xen, a proposal for having xen start checking the information
    Xen started accessing the EDAC information (now called "MCE") at some point after that, which blocks the linux kernel in dom0 from accessing it.
    (I also found what appears to be related sides from a presentation from 2012 at: https://lkml.iu.edu/hypermail/linux/kernel/1206.3/01304/xen_vMCE_design_%28v0_2%29.pdf )
    
    Now, The linux kernel compile option: CONFIG_XEN_MCE_LOG=y is documented as: "Allow kernel fetching MCE error from Xen platform and converting it into Linux mcelog format for mcelog tools".
       I imagine there must be some way on the xen side for this to work for CONFIG_XEN_MCE_LOG to have gotten into the linux kernel and be enabled by default in distributions.
       (notes: mcelog seems to have been replaced with "ras daemon", but I believe that it pulls information using the same kernel APT as "mcelog") (so I believe the final output of if you are having memory errors is pulled by doing "ras-mc-ctl --errors" now instead of looking in /sys/devices/system/edac/mc and /sys/devices/system/edac/pci)
    I suspect that to check if it was working on a modern system, one would do "ras-mc-ctl --status" and get something implying that the xen mce interface is working instead of just saying "ras-mc-ctl: drivers not loaded."
    Somewhere it was said that adding the xen boot parameter "mce=1" to grub would cause xen to forward the info to the linux kernel, but that conflicts with recent changes to the documentation.  Also, tested by setting "mce=1" and nothing appears to change.


Any help is appreciated.


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.