[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC: MCA/MCE concept



On Wednesday 30 May 2007 09:19:12 Jan Beulich wrote:
> >case I) - Xen reveives a MCE from the CPU
> >
> >1) Xen MCE handler figures out if error is an correctable error (CE)
> >    or uncorrectable error (UE)
> >2a) error == CE:
> >     Xen notifies Dom0 if Dom0 installed an MCA event handler
> >     for statistical purpose
> >2b) error == UE and UE impacts Xen or Dom0:
>
> A very important aspect here is how you want to classify what impact an
> uncorrectable has - generally, I can see very few situations where you
> could confine the impact to a sub-portion of the system (i.e. a single
> domU, dom0, or Xen). The general rule in my opinion must be to halt the
> system, the question just is how likely it is that you can get a meaningful
> message out (to screen, serial, or logs) that can help analyze the problem
> afterwards. If it is somewhat likely, then dom0 should be involved,
> otherwise Xen should just shut down the system.

Here you can best help out using HW features to handle errors.
AMD CPUs features online-spare RAM and Chipkill since K8 RevF.

CPUs such as the Sparc features Data Poisoning. That would be the
most handy technique that can be used here.

Maybe this line:

> >     Xen does some self-healing

should be this:

            Xen *tries* to do some self-healing
> >         and notifies Dom0 on success if Dom0 installed MCA event handler
> >         or Xen panics on failure

The first implemenation can just panic here. The self-healing will be
implemented and improved over time.

> >2c)  error == UE and UE impacts DomU:
> >      In case of Dom0 installed MCA event handler:
> >          Xen notifies Dom0 and Dom0 tells Xen whether
> >              to also notify DomU and/or does some operations
> >              on the DomU (case II)
> >       In case Dom0 did not install MCA event handler,
> >           Xen notifies DomU
> >3a) DomU is a PV guest:
> >       if DomU installed MCA event handler, it gets notified to perform
> >          self-healing
> >       if DomU did not install MCA event handler, notify Dom0 to do
> >          some operations on DomU (case II)
> >       if neither DomU nor Dom0 did not install MCA event handlers,
> >          then Xen kills DomU
> >3b) DomU is a HVM guest:
> >       if DomU features a PV driver then behave as in 3a)
>
> What significance do pv drivers have here? Or do you mean a pv MCA
> driver?

Yes, I mean the pv MCA driver.

>
> >       if DomU enabled MCA/MCE via MSR, inject MCE into guest
> >       if DomU did not enable MCA/MCE via MSR, notify Dom0
> >            to do some operations on DomU (case II)
> >       if neither DomU enabled MCA/MCE nor Dom0 did not install
> >            MCA event handler, Xen kills DomU
>
> Injecting an MCE to a hvm guest seems at least questionable. It can't
> really do anything about it (it doesn't even know the real topology of the
> system it's running on, so addresses stored in MSRs are meaningless -
> either you allow them to be read untranslated [in which case the guest
> cannot make sense of them] or you do translation for the guest [in which
> case it might make assumptions about co-locality of other nearby pages
> which will be wrong]).

Yes, Xen should do the translation for the guest. The assumptions must
be fixed then. I know that's easier said than done.

> Doing this to a pv domU for purely notification purposes (where the guest
> knows it's running virtualized) is clearly a different matter.

Yes, I agree with you here. The general idea behind informing a DomU
is to let its own fault management handle the error. It is always better to 
let it kill a screen saver process and keep the word processor running than
killing the whole guest. The DomU should crash itself if it thinks that's the
best.


> >case II) - Xen reveives Dom0 instructions via Hypercall
> >
> >There are different reasons, why Xen should do something.
> >
> >   - Dom0 got enough CEs so that UEs are very likely to happen in order
> >      to "circumvent" UEs.
> >   - Possible operations on a DomU
> >        - save/restore DomU
> >        - (live-)migrate DomU to a different physical machine
> >        - etc.
>
> Very heavy-weight operations, which I think are unlikely to succeed if
> you already suspect the system's going to suffer a UE soon.

Yes, they are heavy-weight operations. Do you have some ideas, what
a Dom0 can do?

The idea here is that the Dom0's fault management helps guests to
survive as best as possible.

Christoph

-- 
AMD Saxony, Dresden Germany
Operating System Research Center

Legal Information:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift):
   Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär:
   AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
Geschäftsführer der AMD Saxony LLC:
   Dr. Hans-R. Deppe, Thomas McCoy



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.