[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC: MCA/MCE concept



On Wednesday 30 May 2007 10:49:40 Jan Beulich wrote:
> >>> "Christoph Egger" <Christoph.Egger@xxxxxxx> 30.05.07 09:45 >>>
> >
> >On Wednesday 30 May 2007 09:19:12 Jan Beulich wrote:
> >> >case I) - Xen reveives a MCE from the CPU
> >> >
> >> >1) Xen MCE handler figures out if error is an correctable error (CE)
> >> >    or uncorrectable error (UE)
> >> >2a) error == CE:
> >> >     Xen notifies Dom0 if Dom0 installed an MCA event handler
> >> >     for statistical purpose
> >> >2b) error == UE and UE impacts Xen or Dom0:
> >>
> >> A very important aspect here is how you want to classify what impact an
> >> uncorrectable has - generally, I can see very few situations where you
> >> could confine the impact to a sub-portion of the system (i.e. a single
> >> domU, dom0, or Xen). The general rule in my opinion must be to halt the
> >> system, the question just is how likely it is that you can get a
> >> meaningful message out (to screen, serial, or logs) that can help
> >> analyze the problem afterwards. If it is somewhat likely, then dom0
> >> should be involved, otherwise Xen should just shut down the system.
> >
> >Here you can best help out using HW features to handle errors.
> >AMD CPUs features online-spare RAM and Chipkill since K8 RevF.
> >
> >CPUs such as the Sparc features Data Poisoning. That would be the
> >most handy technique that can be used here.
>
> But that assumes the error is recoverable (i.e. no other data got
> corrupted). You still didn't clarify how you intend to determine the
> impact an uncorrectable error had.

I know. I am lacking a sudden inspiration here.
That's why I discuss this here before writing code that goes to nowhere.
Anyone here with a flash of genius? :-)


> >> >3a) DomU is a PV guest:
> >> >       if DomU installed MCA event handler, it gets notified to perform
> >> >          self-healing
> >> >       if DomU did not install MCA event handler, notify Dom0 to do
> >> >          some operations on DomU (case II)
> >> >       if neither DomU nor Dom0 did not install MCA event handlers,
> >> >          then Xen kills DomU
> >> >3b) DomU is a HVM guest:
> >> >       if DomU features a PV driver then behave as in 3a)
> >>
> >> What significance do pv drivers have here? Or do you mean a pv MCA
> >> driver?
> >
> >Yes, I mean the pv MCA driver.
> >
> >> >       if DomU enabled MCA/MCE via MSR, inject MCE into guest
> >> >       if DomU did not enable MCA/MCE via MSR, notify Dom0
> >> >            to do some operations on DomU (case II)
> >> >       if neither DomU enabled MCA/MCE nor Dom0 did not install
> >> >            MCA event handler, Xen kills DomU
> >>
> >> Injecting an MCE to a hvm guest seems at least questionable. It can't
> >> really do anything about it (it doesn't even know the real topology of
> >> the system it's running on, so addresses stored in MSRs are meaningless
> >> - either you allow them to be read untranslated [in which case the guest
> >> cannot make sense of them] or you do translation for the guest [in which
> >> case it might make assumptions about co-locality of other nearby pages
> >> which will be wrong]).
> >
> >Yes, Xen should do the translation for the guest. The assumptions must
> >be fixed then. I know that's easier said than done.
>
> Exactly - you are proposing to fix all possible OSes, including
> sufficiently old ones. That's impossible. And I can't even see why an OS
> intended to run on native hardware would care to try to deal with
> virtualization aspects like this.

I think, it was not obvious that
Xen should not inject failures into DomU that don't feature
a fault management. In this case, either Dom0 tells Xen what
to do with the DomU or Xen just kills the DomU.

<snippet from above>
> >> >3a) DomU is a PV guest:
                    ....
> >> >       if DomU did not install MCA event handler, notify Dom0 to do
> >> >          some operations on DomU (case II)
> >> >       if neither DomU nor Dom0 did not install MCA event handlers,
> >> >          then Xen kills DomU

> >> >3b) DomU is a HVM guest:
                    ....
> >> >       if DomU did not enable MCA/MCE via MSR, notify Dom0
> >> >            to do some operations on DomU (case II)
> >> >       if neither DomU enabled MCA/MCE nor Dom0 did not install
> >> >            MCA event handler, Xen kills DomU
</snippet>


Christoph

-- 
AMD Saxony, Dresden Germany
Operating System Research Center

Legal Information:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift):
   Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär:
   AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
Geschäftsführer der AMD Saxony LLC:
   Dr. Hans-R. Deppe, Thomas McCoy



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.