[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs

>-----Original Message-----
>From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx 
>[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Keir Fraser
>Sent: 2008年12月23日 17:00
>To: Ke, Liping
>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>Subject: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected 
>Machine Check Error Interrupt) for Intel CPUs
>On 23/12/2008 08:40, "Keir Fraser" <keir.fraser@xxxxxxxxxxxxx> wrote:
>>> As for moving *cmci_owner_set* out of stopmachine_run is 
>basically ok for us.
>>> Just one thing: 
>>> CMCI might happen and lost during the very small window 
>(old owner is cleared
>>> while new owner is not set). In order to make sure that 
>CMCI could be
>>> triggered an on the new owner, we need to clear MSR Bank(i) 
>status register
>>> [Corrected Error Counter] field ( We normally do this @ 
>CMCI interrupt
>>> handler, according to spec, if the counter is not cleared, 
>CMCI will not be
>>> triggered any more).
>>> I made a small patch for it in the attachment. How do you think?
>> I don't know very much about CMCI. If you think this is 
>required I will
>> certainly check it in.
>Actually I think this is a good idea, even if we'd stayed with 
>your original
>CMCI patches. I will apply it.
>One thing -- if you want to reduce the window between release 
>of a band by
>its old owner and acquisition by a new owner, we could do the whole lot
>before stop_machine_run()? Maybe cmci_cpu_down(cpu) which 
>would IPI 'cpu' to
>clear its CMCI state and then IPI all other CPUs to pick up 
>the released
>banks. This would be neatly hooked off CPU_DOWN_PREPARE or 
>similar in Linux,
>but Xen doesn't have cpu notifiers. :-) You'd have to call 
>explicitly in cpu_down(). Or perhaps we should have cpu 
>notifier chains in
>Xen too...

Yes, we discussed this when working on the patch. 
Two target for CMCI when CPU offline: a) We'd better not lost CMCI interrupt; 
b)if we do lost CMCI interrupt, we should not block further CMCI interrupt. 
Since CMCI is correctaed error, so target a) is not so strong.

When we place the __cpu_clear_cmci in the stop_machine_run() logic, because the 
interrupt is disabled when the __cpu_clear_cmci() called, it is sure no CMCI 
interrupt is lost and no blocking will happen. In current Xen implementation, 
cpu_mcheck_distribute_cmci() is called after stop_machine_run(), so although 
there may be CMCI interrupt lost, but with Criping's patch, the CMCI will not 
be blocked anymore, and the solution is very clear.

As for your proposal of do it before the stop_machine_run(), it may reduce the 
window, but still can't eliminate the window of lost CMCI interrupt, unless we 
do similar thing in the cmci_cpu_down() (i.e. all CPU is irq_disabled before 
update the CMCI status). It is the same if we pull the notifier chain to Xen. 
How is your idea?

>If we do the above I don't think we need to re-introduce your rollback
>logic. If you think about it, there's no reason to prefer the 
>old owner over
>the new owner, so no reason to roll back. I believe?

Yes, currently we don't need rollback anymore. But if we do it before 
stop_machine_run(), we may still need the rollback for bank owned by the down 
cpu exclusively (not all back is shared between CPUs).

Yunhong Jiang

> -- Keir
>Xen-devel mailing list
Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.