Xen project Mailing List

RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler

To: "JBeulich@xxxxxxxx" <JBeulich@xxxxxxxx>, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>

From: "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>

Date: Sat, 8 Oct 2011 16:29:13 +0800

Accept-language: en-US

Acceptlanguage: en-US

Cc: "keir.xen@xxxxxxxxx" <keir.xen@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Sat, 08 Oct 2011 01:30:38 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: Acx/Qhd6SSxYVjF3SFeFyfeM4atJZAGTM3kQ

Thread-topic: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler

> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@xxxxxxxx] > Sent: Friday, September 30, 2011 3:25 PM > To: Liu, Jinsong; Jiang, Yunhong > Cc: keir.xen@xxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx > Subject: RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler > > >>> On 30.09.11 at 04:51, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote: > > > > >> -----Original Message----- > >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx] > >> Sent: Thursday, September 29, 2011 11:42 PM > >> To: Liu, Jinsong > >> Cc: Keir Fraser; Jiang, Yunhong; xen-devel@xxxxxxxxxxxxxxxxxxx > >> Subject: Re: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler > >> > >> >>> On 29.09.11 at 17:20, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx> wrote: > >> > @@ -782,8 +821,12 @@ static void intel_default_mce_uhandler( > >> > > >> > switch (type) > >> > { > >> > - /* Panic if no handler for SRAR error */ > >> > case intel_mce_ucr_srar: > >> > + if ( !guest_mode(regs) ) > >> > + *result = MCER_RESET; > >> > + else > >> > + *result = MCER_CONTINUE; > >> > + break; > >> > case intel_mce_fatal: > >> > *result = MCER_RESET; > >> > break; > >> > >> Using the stack based registers for any decision in an MCE handler > >> seems bogus to me - without knowing that the error occurred > >> synchronously, the result is meaningless. Unfortunately I wasn't > > > > I think the usage of stack in MCE handler should be case by case. > > For example, it's ok to use it at data load instruction since the EIPV is > > valid for it. > > According to the table in the manual, this is only the case on the local > thread. Sorry for long delay response because of china holiday . That's about the so-called "affected processors". For the un-affected processor, it should not be impact. But current patch does not handle the share bank situation correctly and need some changes. According to the SDM, only affected processor should clear the bank, un-affected process should not clear it, current patch does not handle such situation. > > > For the instruction load, not that sure and I will check it internally. > > > > But agree that we should not do this depends on the error type (like SRAR), > > but should depends on the specific error code. > > > >> able to spot - throughout your patch - what SRAR actually stands > >> for, and the manual is no help in that respect either. It does state, > >> however, that EIPV in three of four cases would be clear for these, > >> so using the registers on stack is likely wrong here. > >> > >> This made me look at the current source, and there I see in > >> mce_urgent_action() > >> > >> if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs)) > >> return -1; > >> > >> which I think should say ... _EIPV and use || instead. Thoughts? > > > > I think this code means, if the error happens in hypervisor mode (i.e. > > !guest_mode()), and RIPV indicate the RIP in stack can't be restarted, we > > have to panic. > > Then the guest_mode() check still lacks an extra check of EIPV, like > > if ( !(gstatus & MCG_STATUS_RIPV) && > (!(gstatus & MCG_STATUS_EIPV) || !guest_mode(regs))) > return -1; > The RIPV is not related to the EIPV. RIPV means the context saved in the stack can't be restarted anymore. According to the SDM, RIPV means "execution can be restarted reliably at the instruction pointed to by the instruction pointer pushed on the stack". It's not about error happened synchronously or asynchronously. The point is, if the program is running in hypervisor context, and RIPV tells us that the program can't be restarted, we can't do anything but panic, because we can't switch context while we are in xen. So this code have nothing to do with EIPV. If Xen is pre-emptible and can be switched to a new context, possibly sometimes (like in shadow handling) we can switch to a new context and don't need such detection, but at least currently we don't want to involve so complex handling. The boundary condition of syscall/sysret is something interesting. If a Instruction Fetch Error happens in hypervisor's syscall entry point, it will cause system panic because in the end, we should find it's a xen page broken by checking the MCi_ADDR MSR. And if guest's sysret rip is broken, it will overkill xen hypervisor for a guest error. But possibly the handling result is acceptable considering the low possibility and there is no potential data pollution. Thanks --jyh > Jan

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.