[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
xen-devel-bounces@xxxxxxxxxxxxxxxxxxx <> wrote: > On Tuesday 17 February 2009 07:41:29 Jiang, Yunhong wrote: >> I think the major difference including: a) How to handle the #MC, i.e. >> reset system, decide impacted components, take recover action like page >> offline etc. b) How to handle error impact guest. As to other item like >> log/telemetry, I think our implementation didn't have much different to >> current implementation. > > The hardware doesn't know what recover actions the software can do. > If page A is faulty, and software maintains a copy in page B, then > software can turn an uncorrectable error into an correctable one. > If the hardware is aware of that copy (memory mirroring done by memory > controller), then the hardware itself turns the uncorrectable error > into an correctable one and reports an correctable error. > > Therefore, I don't see why other flags than correctable and uncorrectable > are needed at all. Christoph, thanks for your reply. I think recoverable means VMM/OS can take recover action like page offline, while unrecoverable means VMM/OS can't do anything and we have to reboot. The main reason we need these flag is, several step is required for MCA handling, for example, when multipel MCE happen to multiple CPU, firstly each CPU check it's own severity, seconldy we need check the most severity CPU and take action. For example, CPU A may get unrecoverable while CPU B get recoverable, they will check the information and the result, and the final solution will be unrecoverable . > > > After some thinking on taking some quick actions, I can > agree on it if it meets the condition below. Be aware, error analyzes > is highly CPU vendor and even CPU family/model specific. Doing a > complete analyzes as Solaris does blows Xen up a *lot*. I didn't check Solaris code, so can Gavin or Frank gives us more information? At least currently it will not be large AFAIK, and if we do need model specific support (I don't know such requirement now, and I suppose it will not be common if exists, please correct me if wrong), dom0 can inform Xen for it. > > Therefore, a *cheap* error analysis must be enough to figure out > if recover actions like page-offlining or cpu offlining > are *obviously* only the right thing to do. Currently we only plan to support these two types, do you have plan for other recover action? And is that action be done better in Dom0 than in Xen? Thanks -- Yunhong Jiang > > If this is not the case, then let Dom0 decide what to do. > > Christoph > > > -- > ---to satisfy European Law for business letters: > Advanced Micro Devices GmbH > Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen > Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni > Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen > Registergericht Muenchen, HRB Nr. 43632 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |