[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code for Intel/AMD MCA



xen-devel-bounces@xxxxxxxxxxxxxxxxxxx <> wrote:
> Ke, Liping wrote:
>> Hi, Frank
>> 
>> We did some small test here, found the CMCI problem is
> caused by the "mce_banks_owned" bitmap param passing.
>> When CMCI happened, we print the bitmap (in
> smp_cmci_interrupt) value, it's correct (For cpu0, 16c).
>> When passing into mcheck_mca_logout, it turned to be "0xFFFF~~FFF" which
>> is wrong. Only for your info -:) 
>> 
>> Still, I suggest split the patch since this patch is realy big  -:)
>> 
>> Thanks a lot for your help!
>> Criping
> 
> Ok, since the patch is already in (thanks Keir!), let's work out any issues
> you may see with it. 
> 
> When you say that the mce_banks_owned value is wrong in
> mcheck_mca_logout, do you mean that the parameter passing somehow went
> wrong? Is the value ok right before the call, but not in
> mcheck_mca_logout itself? That would be strange.
> 
> You mentioned the polling code: it didn't actually change
> much, in fact,
> the general polling code now is pretty much the same as the
> polling code
> that was originally in mce_intel.c. That code is very similar.
> Some AMD
> code contained AMD-specific things, and has not been touched (it still has
> its own poll function). 
> 
> In general, the change is:
> 
> * machine_check_poll (going over the banks to look for
> non-fatal errors)
> and the loops inside the mcheck handlers that also walk the MC banks,
> have been combined in one function, which is mcheck_mca_logout.
> mcheck_mca_logout still takes a context argument, as machine_check_poll did.
> 
> * The mcheck handlers themselves mostly just call mcheck_cmn_handler,
> which calls mcheck_mca_logout, looks at the context, and then decides what
> to do. 
> 
> * The mctelem structure as returned by mcheck_mca_logout, contains the
> relevant telemetry for the current error. If it should be put in the
> "database" for dom0 retrieval, use mctelem_commit to put it there.
> Otherwise, the usual course of action is to print the values and uses
> mctelem_dismiss to discard the information.
> 
> There is one thing that needs to be addressed: in your Intel-specific
> code, you do not reset the per-bank status in the case of a
> fatal error,
> so that the BIOS may look at the MSRs. mcheck_mca_logout does always
> reset the values. The best way to fix this is probably to pass
> in a flag
> to mcheck_mca_logout to tell it whether it should reset the
> status MSRs
> or not. The caller can then decide if they should be reset later.
> 
> Let me know if you have any more issues, I'll be happy to help merge your
> code and test it. 

We will do more test for it and also check the per-bank status. Originally we 
hope:

a) settle down the interface in 3.4 release for MCA, so that it will not be 
dramatically changed anymore, including the telemetry interface for CE and the 
interface betwwen MCA handler and softIRQ handler (i.e. what we get consensused 
in the mailing list), and your telemetry changes that already in. since usually 
people based their work on some stable release. 

b) If we can get consensus, finished the softIRQ mechanism so that it will get 
a solid base. The spin_lock() and "current" usage in MCA handler is really 
confusing.

But feature freeze now :-(

Or maybe someone can lobby Keir to accept these changes if Ke Liping and I can 
finish these patches this week. Hope it should not be big if we can remove the 
vMCE implementation cleanly.

> 
> - Frank
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.