[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: [PATCH 1/4] CPU online/offline support in Xen



Christoph Egger wrote:
On Thursday 11 September 2008 16:15:14 Keir Fraser wrote:
I applied the patch with the following changes:
 * I rewrote your changes to fixup_irqs(). We should force lazy EOIs
*after* we have serviced any straggling interrupts. Also we should actually
clear the EOI stack so it is empty next time the CPU comes online.
 * I simplified your changes to schedule.c in light of the fact we run in
stop_machine context. Hence we can be quite relaxed about locking, for
example.
 * I removed your change to __csched_vcpu_is_migrateable() and instead put
a similar check in csched_load_balance(). I think this is clearer and also
cheaper.

I note that the VCPU currently running on the offlined CPU continues to run
there even after __cpu_disable(), and until that CPU does a final run
through the scheduler soon after. I hope it does not matter there is one
vcpu with v->processor == offlined_cpu for a short while

This is not acceptable regarding to machine check. When Dom0 offlines a
defect cpu, nothing may continue on it or silent data corruption occurs.

I don't see this as a problem for machine check correctness.

If dom0 asks to offline a cpu (because it believes the cpu is busted and
a threat to uptime), that decision is fundamentally asynchronous
to the actual error handling that occured at machine check exception
time:

 - running in whatever context
 - MCE occurs
 - trap to hypervisor MCE handler
        . this decides on hypervisor panic, or other appropriate
          immediate (in handler) response
        . telemetry forwarded to dom0 for logging and analysis
 - assume no hypervisor panic
 - eons pass during which any unconstrained bad data remaining
   after initial handling may go anywhere
 - dom0 gets telemetry and let's say diagnoses a fault and
   decides to call back into the hypervisor to offline the
   offending cpu

Note the "eons pass" bit;  tonnes of instructions may run on the
bad cpu in this time, and a few more for some offline delay won't
hurt.

Gavin

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.