[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Thoughts on current Xen EDAC/MCE situation
I've been mentioning this on a regular basis, but the state of MCE handling with Xen seems poor. I find the present handling of MCE in Xen an odd choice. Having Xen do most of the handling of MCE events is a behavior matching a traditional stand-alone hypervisor. Yet Xen was originally pushing any task not requiring hypervisor action onto Domain 0. MCE seems a perfect match for sharing responsibility with Domain 0. Domain 0 needs to know about any MCE event, this is where system administrators will expect to find logs. In fact, if the event is a Correctable Error, then *only* Domain 0 needs to know. For a CE, Xen may need no action at all (an implementation could need help) and the effected domain would need no action. It is strictly for Uncorrectable Errors that action beside logging is needed. For a UE memory error, the best approach might be for Domain 0 to decode the error. Once Domain 0 determines it is UE, invoke a hypercall to pass the GPFN to Xen. Xen would then forcibly unmap the page (similar to what Linux does to userspace for corrupted pages). Xen would then identify what the page was used for, alert the domain and return that to Domain 0. The key advantage of this approach is it makes MCE handling act very similar to MCE handling without Xen. Documentation about how MCEs are reported/decoded would apply equally to Xen. Another rather important issue is it means less maintenance work to keep MCE handling working with cutting-edge hardware. I've noticed one vendor being sluggish about getting patches into Linux and I fear similar issues may apply more severely to Xen. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg@xxxxxxx PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |