[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Thoughts on current Xen EDAC/MCE situation



On Wed, Jan 24, 2024 at 07:20:56AM -0800, Elliott Mitchell wrote:
> On Wed, Jan 24, 2024 at 08:23:15AM +0100, Jan Beulich wrote:
> > 
> > Third, as to Dom0's purposes of having the address: If all it is to use
> > it for is to pass it back to Xen, paths in the respective drivers will
> > necessarily be entirely different for the Xen vs the native cases.
> 
> I'm less than certain of the best place for Xen to intercept MCE events.
> For UE memory events, the simplest approach on Linux might be to wrap the
> memory_failure() function.  Yet for Linux/x86,
> mce_register_decode_chain() also looks like a very good candidate.

I did hope to get some response.

It really does look like, aside from being x86-only,
mce_register_decode_chain() is the ideal hook point.  Xen could forward
NMIs to Domain 0, then intercept them from the decode chain.  For UEs
Xen would mark the event handled, then create a new event for whichever
domain (if any) was effected.


Right now my main concern is several of the Linux MCE/EDAC drivers are
growing `if (cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) return -ENODEV;`
calls.

This approach is being poisoned and will become quite difficult if this
isn't stopped.  The justification found for one instance was that it
"removed one message", with no useful information.  I cannot help
suspecting it involved a hypervisor from Redmond, WA and their engineers
are encouraged to poison interfaces used by others.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.