[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] RFC: MCA/MCE concept
On Wednesday 30 May 2007 09:19:12 Jan Beulich wrote: > >case I) - Xen reveives a MCE from the CPU > > > >1) Xen MCE handler figures out if error is an correctable error (CE) > > or uncorrectable error (UE) > >2a) error == CE: > > Xen notifies Dom0 if Dom0 installed an MCA event handler > > for statistical purpose > >2b) error == UE and UE impacts Xen or Dom0: > > A very important aspect here is how you want to classify what impact an > uncorrectable has - generally, I can see very few situations where you > could confine the impact to a sub-portion of the system (i.e. a single > domU, dom0, or Xen). The general rule in my opinion must be to halt the > system, the question just is how likely it is that you can get a meaningful > message out (to screen, serial, or logs) that can help analyze the problem > afterwards. If it is somewhat likely, then dom0 should be involved, > otherwise Xen should just shut down the system. Here you can best help out using HW features to handle errors. AMD CPUs features online-spare RAM and Chipkill since K8 RevF. CPUs such as the Sparc features Data Poisoning. That would be the most handy technique that can be used here. Maybe this line: > > Xen does some self-healing should be this: Xen *tries* to do some self-healing > > and notifies Dom0 on success if Dom0 installed MCA event handler > > or Xen panics on failure The first implemenation can just panic here. The self-healing will be implemented and improved over time. > >2c) error == UE and UE impacts DomU: > > In case of Dom0 installed MCA event handler: > > Xen notifies Dom0 and Dom0 tells Xen whether > > to also notify DomU and/or does some operations > > on the DomU (case II) > > In case Dom0 did not install MCA event handler, > > Xen notifies DomU > >3a) DomU is a PV guest: > > if DomU installed MCA event handler, it gets notified to perform > > self-healing > > if DomU did not install MCA event handler, notify Dom0 to do > > some operations on DomU (case II) > > if neither DomU nor Dom0 did not install MCA event handlers, > > then Xen kills DomU > >3b) DomU is a HVM guest: > > if DomU features a PV driver then behave as in 3a) > > What significance do pv drivers have here? Or do you mean a pv MCA > driver? Yes, I mean the pv MCA driver. > > > if DomU enabled MCA/MCE via MSR, inject MCE into guest > > if DomU did not enable MCA/MCE via MSR, notify Dom0 > > to do some operations on DomU (case II) > > if neither DomU enabled MCA/MCE nor Dom0 did not install > > MCA event handler, Xen kills DomU > > Injecting an MCE to a hvm guest seems at least questionable. It can't > really do anything about it (it doesn't even know the real topology of the > system it's running on, so addresses stored in MSRs are meaningless - > either you allow them to be read untranslated [in which case the guest > cannot make sense of them] or you do translation for the guest [in which > case it might make assumptions about co-locality of other nearby pages > which will be wrong]). Yes, Xen should do the translation for the guest. The assumptions must be fixed then. I know that's easier said than done. > Doing this to a pv domU for purely notification purposes (where the guest > knows it's running virtualized) is clearly a different matter. Yes, I agree with you here. The general idea behind informing a DomU is to let its own fault management handle the error. It is always better to let it kill a screen saver process and keep the word processor running than killing the whole guest. The DomU should crash itself if it thinks that's the best. > >case II) - Xen reveives Dom0 instructions via Hypercall > > > >There are different reasons, why Xen should do something. > > > > - Dom0 got enough CEs so that UEs are very likely to happen in order > > to "circumvent" UEs. > > - Possible operations on a DomU > > - save/restore DomU > > - (live-)migrate DomU to a different physical machine > > - etc. > > Very heavy-weight operations, which I think are unlikely to succeed if > you already suspect the system's going to suffer a UE soon. Yes, they are heavy-weight operations. Do you have some ideas, what a Dom0 can do? The idea here is that the Dom0's fault management helps guests to survive as best as possible. Christoph -- AMD Saxony, Dresden Germany Operating System Research Center Legal Information: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |