[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] cpuidle causing Dom0 soft lockups



Hi Jan,

Could you try the following debugging patch. it can help to narrow down the 
root cause:

diff -r ea02c95af387 xen/arch/x86/acpi/cpu_idle.c
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -228,7 +228,6 @@ static void acpi_processor_idle(void)

     cpufreq_dbs_timer_suspend();

-    sched_tick_suspend();
     /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();

Regards
Ke
>-----Original Message-----
>From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
>[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Jan Beulich
>Sent: Thursday, January 21, 2010 5:52 PM
>To: xen-devel@xxxxxxxxxxxxxxxxxxx
>Subject: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>On large systems and with Dom0 booting with (significantly) more than
>32 vCPU-s we have got multiple reports that the now by default
>enabled C-state management is causing soft lockups, usually preventing
>the boot from completing.
>
>The observations are:
>
>Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes
>the systems work.
>
>max_cstate=0 makes the systems work.
>
>max_cstate=1 makes the problem less severe on one (bigger) system,
>and eliminates it completely on another (smaller) one.
>
>When appearing to hang, all vCPU-s are in Dom0's timer_interrupt(),
>and all (sometimes all but one) are attempting to acquire xtime_lock.
>However, due to our use of ticket locks we can verify that this is not
>a deadlock (repeatedly sending '0' shows forward progress, as the
>tickets [visible on the stack] continue to increase). Additionally, there
>is always one vCPU that has its polling event channel (used for
>waking the next waiting vCPU when a lock becomes available)
>signaled.
>
>In one case (but not in the other) it is always the same vCPU that
>is apparently taking very long to wake up from the polling request.
>This may be coincidence, but output after sending 'c' also indicates
>a significantly higher (about 3 times) usage value for C2 than the
>second highest one; the duration printed is roughly the same for
>all CPUs.
>
>While I don't know this code well, it would seem that we're suffering
>from extremely long wakeup times. This suggests that there likely is
>a (performance) problem even for smaller numbers of vCPU-s.
>Hence, unless it can be fixed before 4.0 releases, I would suggest
>disabling C-state management by default again.
>
>I can provide full logs in case needed.
>
>Jan
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@xxxxxxxxxxxxxxxxxxx
>http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.