[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler




> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> Sent: Friday, September 30, 2011 3:25 PM
> To: Liu, Jinsong; Jiang, Yunhong
> Cc: keir.xen@xxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
> 
> >>> On 30.09.11 at 04:51, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> >> Sent: Thursday, September 29, 2011 11:42 PM
> >> To: Liu, Jinsong
> >> Cc: Keir Fraser; Jiang, Yunhong; xen-devel@xxxxxxxxxxxxxxxxxxx
> >> Subject: Re: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
> >>
> >> >>> On 29.09.11 at 17:20, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx> wrote:
> >> > @@ -782,8 +821,12 @@ static void intel_default_mce_uhandler(
> >> >
> >> >      switch (type)
> >> >      {
> >> > -    /* Panic if no handler for SRAR error */
> >> >      case intel_mce_ucr_srar:
> >> > +        if ( !guest_mode(regs) )
> >> > +            *result = MCER_RESET;
> >> > +        else
> >> > +            *result = MCER_CONTINUE;
> >> > +        break;
> >> >      case intel_mce_fatal:
> >> >          *result = MCER_RESET;
> >> >          break;
> >>
> >> Using the stack based registers for any decision in an MCE handler
> >> seems bogus to me - without knowing that the error occurred
> >> synchronously, the result is meaningless. Unfortunately I wasn't
> >
> > I think the usage of stack in MCE handler should be case by case.
> > For example, it's ok to use it at data load instruction since the EIPV
is
> > valid for it.
> 
> According to the table in the manual, this is only the case on the local
> thread.

Sorry for long delay response because of china holiday
.
That's about the so-called "affected processors". For the un-affected
processor, it should not be impact.
But current patch does not handle the share bank situation correctly and
need some changes. According to the SDM, only affected processor should
clear the bank, un-affected process should not clear it, current patch does
not handle such situation.

> 
> > For the instruction load, not that sure and I will check it internally.
> >
> > But agree that we should not do this depends on the error type (like
SRAR),
> > but should depends on the specific error code.
> >
> >> able to spot - throughout your patch - what SRAR actually stands
> >> for, and the manual is no help in that respect either. It does state,
> >> however, that EIPV in three of four cases would be clear for these,
> >> so using the registers on stack is likely wrong here.
> >>
> >> This made me look at the current source, and there I see in
> >> mce_urgent_action()
> >>
> >>     if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs))
> >>         return -1;
> >>
> >> which I think should say ... _EIPV and use || instead. Thoughts?
> >
> > I think this code means, if the error happens in hypervisor mode (i.e.
> > !guest_mode()), and RIPV indicate the RIP in stack can't be restarted,
we
> > have to panic.
> 
> Then the guest_mode() check still lacks an extra check of EIPV, like
> 
>      if ( !(gstatus & MCG_STATUS_RIPV) &&
>           (!(gstatus & MCG_STATUS_EIPV) || !guest_mode(regs)))
>          return -1;
> 

The RIPV is not related to the EIPV. RIPV means the context saved in the
stack can't be restarted anymore. According to the SDM, RIPV means
"execution can be restarted reliably at the instruction pointed to by the
instruction pointer pushed on the stack". It's not about error happened
synchronously or asynchronously. The point is, if the program is running in
hypervisor context, and RIPV tells us that the program can't be restarted,
we can't do anything but panic, because we can't switch context while we are
in xen. So this code have nothing to do with EIPV. 

If Xen is pre-emptible and can be switched to a new context, possibly
sometimes (like in shadow handling) we can switch to a new context and don't
need such detection, but at least currently we don't want to involve so
complex handling. 

The boundary condition of syscall/sysret is something interesting. If a
Instruction Fetch Error happens in hypervisor's syscall entry point, it will
cause system panic because in the end, we should find it's a xen page broken
by checking the MCi_ADDR MSR. And if guest's sysret rip is broken, it will
overkill xen hypervisor for a guest error. But possibly the handling result
is acceptable considering the low possibility and there is no potential data
pollution.
 
Thanks
--jyh
> Jan

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.