[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code for Intel/AMD MCA


  • To: Frank van der Linden <Frank.Vanderlinden@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
  • From: "Ke, Liping" <liping.ke@xxxxxxxxx>
  • Date: Tue, 17 Mar 2009 14:59:20 +0800
  • Accept-language: en-US
  • Acceptlanguage: en-US
  • Cc:
  • Delivery-date: Tue, 17 Mar 2009 00:00:44 -0700
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>
  • Thread-index: AcmmjuuLNBJ4wd8/QmiEPfqXHcdbYwAOEIywAAFyffA=
  • Thread-topic: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code for Intel/AMD MCA

Hi, Frank

We did some small test here, found the CMCI problem is caused by the 
"mce_banks_owned" bitmap param passing.
When CMCI happened, we print the bitmap (in smp_cmci_interrupt) value, it's 
correct (For cpu0, 16c).  
When passing into mcheck_mca_logout, it turned to be "0xFFFF~~FFF" which is 
wrong. 
Only for your info -:)

Still, I suggest split the patch since this patch is realy big  -:)

Thanks a lot for your help!
Criping

-----Original Message-----
From: Ke, Liping 
Sent: 2009年3月17日 14:26
To: 'Frank van der Linden'; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common 
code for Intel/AMD MCA

Hi, Frank

I am now doing some tests based on latest Intel platform for this patch 
since CMCI needs some owner_checking and only the owned CPU will report 
the error.

Without the patch, when CMCI happened, since CPU0 is the owner of bank8, 
so when do checking, Only CPU0 will report the error.

Below is the correct log
(XEN) CMCI: cmci_intr happen on CPU3
[root@lke-ep inject]# (XEN) CMCI: cmci_intr happen on CPU2
(XEN) CMCI: cmci_intr happen on CPU0
(XEN) CMCI: cmci_intr happen on CPU1
(XEN) mcheck_poll: bank8 CPU0 status[cc0000800001009f]
(XEN) mcheck_poll: CPU0, SOCKET0, CORE0, APICID[0], thread[0]
(XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU
 0.

After applied your patch, I found all CPUs will report the error. 
Below is the log
(XEN) MCE: The hardware reports a non fatal, correctable i
ncident occured on CPU 0.
(XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU
 2.
(XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU
 3.
(XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU
 1.
(XEN) Bank 8: cc0000c00001009f<1>Bank 8: 8c0000400001009f<1>Bank 8: cc0001c00001
009f<1>MCE: The hardware reports a non fatal, correctable incident occured on CP
U 0.

I noticed your patch has passed in the cmci_owner mask, I can't see the reason 
since 
this is really a big patch. I need some time to figure it out. 


Also we found the polling mechanism has some changes. My feeling is that this 
patch is really too big.
We can't easily figured out the impaction to our checked-in codes right now. 
Just wonder whether you could split this big patch into two parts :-)
part1: mce log telem mechanism and required mce_intel interfaces changes. So 
that we can verify easily
whether the new interfaces works fine for our CMCI as well as non-fatal 
polling. I guess this should not be a big work, 
you can just modify the new telem interfaces machine_check_poll?
part2: common handler part. (including both CMCI parts and non-fatal polling 
parts). 

How do you think about it :-)

Thanks a lot for your help!
Criping



-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx 
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Frank van der Linden
Sent: 2009年3月17日 7:28
To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code 
for Intel/AMD MCA

The following patch reworks the MCA error telemetry handling inside Xen, 
  and shares code between the Intel and AMD implementations as much as 
possible.

I've had this patch sitting around for a while, but it wasn't ported to 
-unstable yet. I finished porting and testing it, and am submitting it 
now, because the Intel folks want to go ahead and submit their new 
changes, so we agreed that I should push our changes first.

Brief explanation of the telemetry part: previously, the telemetry was 
accessed in a global array, with index variables used to access it. 
There were some issues with that: race conditions with regard to new 
machine checks (or CMCIs) coming in while handling the telemetry, and 
interaction with domains having been notified or not, which was a bit 
hairy. Our changes (I should say: Gavin Maltby's changes, as he did the 
bulk of this work for our 3.1 based tree, I merely ported/extended it to 
3.3 and beyond) make telemetry access transactional (think of a 
database). Also, the internal database updates are atomic, since the 
final commit is done by a pointer swap. There is a brief explanation of 
the mechanism in mctelem.h.This patch also removes dom0->domU 
notification, which is ok, since Intel's upcoming changes will replace 
domU notification with a vMCE mechanism anyway.

The common code part is pretty much what it says. It defines a common 
MCE handler, with a few hooks for the special needs of the specific CPUs.

I've been told that Intel's upcoming patch will need to make some parts 
of the common code specific to the Intel CPU again, but we'll work 
together to use as much common code as possible.

- Frank
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.