[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
Christoph Egger <mailto:Christoph.Egger@xxxxxxx> wrote: > Ok, here is a different interpretation of what is correctable and > uncorrectable. Uncorrectable in your interpretation means neither hardware > nor software can't > do anything. > Uncorrectable in my interpretation means the hardware can't > correct it, but > software may have more information and correct it. Yes. Maybe "fatal" is more appropriate name here. > >> The main reason we need these flag is, several step is required for MCA >> handling, for example, when multiple MCE happen to multiple CPU, firstly >> each CPU check it's own severity, seconldy we need check the most severity >> CPU and take action. For example, CPU A may get unrecoverable while CPU B >> get recoverable, they will check the information and the result, and the >> final solution will be unrecoverable . > > I brought up an example of a broken memory page for my argumentation, > you bring up a broken CPU for your argumentation. > > We need to find a common denominator to compare. > > If a CPU is completely broken and you are on UP, then game is over. Not > even a reboot can help. On a SMP system, offline the CPU and inform Dom0. Sorry I didn't get relationship between the flags and comparing the two example :$ >> Currently we only plan to support these two types, do you have plan for >> other recover action? And is that action be done better in Dom0 than in >> Xen? > > Yes!! Solaris maintains a list of broken pages which is even persistent > across reboot when the serial number of the DIMM didn't change. > For doing page offlining properly, SUN should design a > hypercall allowing > the Dom0 to give Xen this list as early as possible at boot time. We have a patch to support page offline (sent as RFC to mailing list), and it already export a hypercall for Dom0 to ask Xen to offline pages (this is for proactive action to CE errors from Dom0), also, as Frank suggested, we will add a hypercall for Dom0 to get page's offline status, so it should be OK. > Further, with our Shanghai CPU, we can disable certain parts > of its L3 cache. > Instead of offlining that broken CPU completely, just disable > the broken > part of it. The registers for this is in PCI config space. > Since Xen delegates > PCI access to Dom0, Dom0 can do that. Sorry that I have no idea of Shanghai, but I'm a bit suprised that when error happens to cache, we will transfer control to Dom0 and wait for Dom0's MCA handler to take action to disable the cache, it is really a loooong code path. Per my understanding, if there are issue in cache, we should clear/disable the cache ASAP to avoid more server result, and it is a extreme example to let Xen handle the MCA. Or maybe I missed something important in this feature? BTW, I want to clarify that this patch is for #MC handling (i.e. the "uncorrected" error in your mind). For hardware correctable error (i.e. "correctable") , Xen will do nothing, but just pass it to Dom0 as vIRQ as our previous patch (http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00970.html ) shown, because CE will not impact system. So if the "cache index disable" is to disable part of cache after too many CE (Correctable Error) as proactive action, I think we are on the same page. I attached two foil that are part of our Xen summit presentation. Page 1 is mainly for #MC handling, page2 is for CE handling (though CMCI or polling). The page 1 is described in the patch clearly. Page 2 is what our previous patch did . Thanks -- Yunhong Jiang > > Christoph > > -- > ---to satisfy European Law for business letters: > Advanced Micro Devices GmbH > Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen > Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni > Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen > Registergericht Muenchen, HRB Nr. 43632 Attachment:
MCA.pdf _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |