Xen project Mailing List

RE: [Xen-devel] dom0 hang

To: "mukesh.rathor@xxxxxxxxxx" <mukesh.rathor@xxxxxxxxxx>

Date: Tue, 7 Jul 2009 15:14:12 +0800

Accept-language: en-US

Acceptlanguage: en-US

Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, "Kurt C. Hackel" <kurt.hackel@xxxxxxxxxx>, "Tian, Kevin" <kevin.tian@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 07 Jul 2009 00:15:38 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: Acn+tZ24wQgV+EHPRx6jFo5ZCiRy2QAFY0tw

Thread-topic: [Xen-devel] dom0 hang

>-----Original Message----- >From: Mukesh Rathor [mailto:mukesh.rathor@xxxxxxxxxx] >Sent: Tuesday, July 07, 2009 11:47 AM >To: Yu, Ke >Cc: George Dunlap; Tian, Kevin; xen-devel@xxxxxxxxxxxxxxxxxxx; Kurt C. >Hackel >Subject: Re: [Xen-devel] dom0 hang > > >Well, the problem takes long to reproduce (only on certain boxes). And then it >may not always happen. So I want to make sure I understand the fix, as it >was pretty hard to debug. Ok, looking forward your update. > >While the fix will still allow softirqs pending, I guess, functionally >it's OK because after irq disable, it'll check for pending softirq, and >just return. I think the comment about expecting no softirq pending >should be fixed. Right. the comment will also be fixed. > >BTW, why can't the tick be suspended when csched_schedule() concludes >it's idle vcpu before returning? won't that would make it less intrusive. The tick suspend can be put in csched_schedule, but the suspend/resume logic is still needed in acpi_processor_idle anyway, due to another dbs_timer suspend/resume. The intention here is to make acpi_processor_idle the central place for timers which are stoppable during idle period. If there is other stoppable timer in the future, it can be easily added to acpi_processor_idle. So it is clean to keep current logic. and as long as we carefully not over doing the softirq, it looks not so intrusive. How do you think? Best Regards Ke > >thanks, >Mukesh > > >Yu, Ke wrote: >> Hi Mukesh, >> >> Could you please try the following patch, to see if it can resolve the issue >you observed? Thanks. >> >> Best Regards >> Ke >> >> diff -r d461c4d8af17 xen/arch/x86/acpi/cpu_idle.c >> --- a/xen/arch/x86/acpi/cpu_idle.c >> +++ b/xen/arch/x86/acpi/cpu_idle.c >> @@ -228,10 +228,10 @@ static void acpi_processor_idle(void) >> /* >> * sched_tick_suspend may raise TIMER_SOFTIRQ by __stop_timer, >> * which will break the later assumption of no sofirq pending, >> - * so add do_softirq >> + * so process the pending timers >> */ >> - if ( softirq_pending(smp_processor_id()) ) >> - do_softirq(); >> + >> + process_pending_timers(); >> >> /* >> * Interrupts must be disabled during bus mastering calculations and >> >>> -----Original Message----- >>> From: Mukesh Rathor [mailto:mukesh.rathor@xxxxxxxxxx] >>> Sent: Friday, July 03, 2009 9:19 AM >>> To: mukesh.rathor@xxxxxxxxxx >>> Cc: George Dunlap; Tian, Kevin; xen-devel@xxxxxxxxxxxxxxxxxxx; Yu, Ke; >Kurt C. >>> Hackel >>> Subject: Re: [Xen-devel] dom0 hang >>> >>> >>> Hi Kevin/Yu: >>> >>> acpi_processor_idle() >>> { >>> sched_tick_suspend(); >>> /* >>> * sched_tick_suspend may raise TIMER_SOFTIRQ by __stop_timer, >>> * which will break the later assumption of no sofirq pending, >>> * so add do_softirq >>> */ >>> if ( softirq_pending(smp_processor_id()) ) >>> do_softirq(); <=============== >>> >>> local_irq_disable(); >>> if ( softirq_pending(smp_processor_id()) ) >>> { >>> local_irq_enable(); >>> sched_tick_resume(); >>> cpufreq_dbs_timer_resume(); >>> return; >>> } >>> >>> wouldn't the do_softirq() call scheduler with tick suspended, and >>> the scheduler then context switches to another vcpu0 (with *_BOOST) >which >>> would result in the stuck vcpu I described? >>> >>> thanks >>> Mukesh >>> >>> >>> Mukesh Rathor wrote: >>>> ah, i totally missed csched_tick(): >>>> if ( !is_idle_vcpu(current) ) >>>> csched_vcpu_acct(cpu); >>>> >>>> yeah, looks like that's what is going on. i'm still waiting to >>>> reproduce. at first glance, looking at c/s 19460, seems like >>>> suspend/resume, well at least the resume, should happen in >>>> csched_schedule()..... >>>> >>>> thanks, >>>> Mukesh >>>> >>>> >>>> George Dunlap wrote: >>>>> [Oops, adding back in distro list, also adding Kevin Tian and Yu Ke >>>>> who wrote cs 19460] >>>>> >>>>> The functionality I was talking about, subtracting credits and >>>>> clearing BOOST, happens in csched_vcpu_acct() (which is different than >>>>> csched_acct()). vcpu_acct() is called from csched_tick(), which >>>>> should still happen every 10ms on every cpu. >>>>> >>>>> The patch I referred to (cs 19460) disables and re-enables tickers in >>>>> xen/arch/x86/acpi/cpu_idle.c:acpi_processor_idle() every time the >>>>> processor idles. I can't see anywhere else that tickers are disabled, >>>>> so it's probably something not properly re-enabling them again. >>>>> >>>>> Try applying the attached patch to see if that changes anything. (I'm >>>>> on the road, so I can't repro the lockup issue.) If that doesn't >>>>> work, try disabling c-states and see if that helps. Then at least >>>>> we'll know where the problem lies. >>>>> >>>>> -George >>>>> >>>>> On Thu, Jul 2, 2009 at 10:10 PM, Mukesh >>>>> Rathor<mukesh.rathor@xxxxxxxxxx> wrote: >>>>>> that seems to only suspend csched_pcpu.ticker which is csched_tick >>>>>> that is >>>>>> only sorting local runq. >>>>>> >>>>>> again, we are concerned about csched_priv.master_ticker that calls >>>>>> csched_acct? correct, so i can trace that? >>>>>> >>>>>> thanks, >>>>>> mukesh >>>>>> >>>>>> >>>>>> George Dunlap wrote: >>>>>>> Ah, I see that there's been some changes to tick stuff with the >>>>>>> c-state (e.g., cs 19460). It looks like they're supposed to be going >>>>>>> still, but perhaps the tick_suspend() and tick_resume() aren't being >>>>>>> called properly. Let me take a closer look. >>>>>>> >>>>>>> -George >>>>>>> >>>>>>> On Thu, Jul 2, 2009 at 8:14 PM, Mukesh >>> Rathor<mukesh.rathor@xxxxxxxxxx> >>>>>>> wrote: >>>>>>>> George Dunlap wrote: >>>>>>>>> On Thu, Jul 2, 2009 at 4:19 AM, Mukesh >>>>>>>>> Rathor<mukesh.rathor@xxxxxxxxxx> >>>>>>>>> wrote: >>>>>>>>>> dom0 hang: >>>>>>>>>> vcpu0 is trying to wakeup a task and in try_to_wake_up() calls >>>>>>>>>> task_rq_lock(). since the task has cpu set to 1, it gets runq lock >>>>>>>>>> for vcpu1. next it calls resched_task() which results in sending >>>>>>>>>> IPI >>>>>>>>>> to vcpu1. for that, vcpu0 gets into the >>> HYPERVISOR_event_channel_op >>>>>>>>>> HCALL and is waiting to return. Meanwhile, vcpu1 got running, >>>>>>>>>> and is >>>>>>>>>> spinning on it's runq lock in >>>>>>>>>> "schedule():spin_lock_irq(&rq->lock);", >>>>>>>>>> that vcpu0 is holding (and is waiting to return from the HCALL). >>>>>>>>>> >>>>>>>>>> As I had noticed before, vcpu0 never gets scheduled in xen. So >>>>>>>>>> looking further into xen: >>>>>>>>>> >>>>>>>>>> xen: >>>>>>>>>> Both vcpu's are on the same runq, in this case cpu1. But the >>>>>>>>>> priority of vcpu1 has been set to CSCHED_PRI_TS_BOOST. As a >>> result, >>>>>>>>>> the scheduler always picks vcpu1, and vcpu0 is starved. Also, I >>>>>>>>>> see in >>>>>>>>>> kdb that the scheduler timer is not set on cpu 0. That would've >>>>>>>>>> allowed csched_load_balance() to kick in on cpu0. [Also, on >>>>>>>>>> cpu1, the accounting timer, csched_tick, is not set. Altho, >>>>>>>>>> csched_tick() is running on cpu0, it only checks runq for cpu0.] >>>>>>>>>> >>>>>>>>>> Looks like c/s 19500 changed csched_schedule(): >>>>>>>>>> >>>>>>>>>> - ret.time = MILLISECS(CSCHED_MSECS_PER_TSLICE); >>>>>>>>>> + ret.time = (is_idle_vcpu(snext->vcpu) ? >>>>>>>>>> + -1 : MILLISECS(CSCHED_MSECS_PER_TSLICE)); >>>>>>>>>> >>>>>>>>>> The quickest fix for us would be to just back that out. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> BTW, just a comment on following (all in sched_credit.c): >>>>>>>>>> >>>>>>>>>> if ( svc->pri == CSCHED_PRI_TS_UNDER && >>>>>>>>>> !(svc->flags & CSCHED_FLAG_VCPU_PARKED) ) >>>>>>>>>> { >>>>>>>>>> svc->pri = CSCHED_PRI_TS_BOOST; >>>>>>>>>> } >>>>>>>>>> comibined with >>>>>>>>>> if ( snext->pri > CSCHED_PRI_TS_OVER ) >>>>>>>>>> __runq_remove(snext); >>>>>>>>>> >>>>>>>>>> Setting CSCHED_PRI_TS_BOOST as pri of vcpu seems >dangerous. >>> To >>>>>>>>>> me, >>>>>>>>>> since csched_schedule() never checks for time accumulated >by a >>>>>>>>>> vcpu at pri CSCHED_PRI_TS_BOOST, that is same as pinning a >>>>>>>>>> vcpu to a >>>>>>>>>> pcpu. if that vcpu never makes progress, essentially, the >system >>>>>>>>>> has lost a physical cpu. Optionally, csched_schedule() >should >>>>>>>>>> always >>>>>>>>>> check for cpu time accumulated and reduce the priority over >>> time. >>>>>>>>>> I can't tell right off if it already does that. or something like >>>>>>>>>> that :)... my 2 cents. >>>>>>>>> Hmm... what's supposed to happen is that eventually a timer tick >will >>>>>>>>> interrupt vcpu1. If cpu1 is set to be "active", then it will be >>>>>>>>> debited 10ms worth of credit. Eventually, it will go into OVER, >and >>>>>>>>> lose BOOST. If it's "inactive", then when the tick happens, it will >>>>>>>>> be set to "active" and be debited 10ms again, setting it directly >>>>>>>>> into >>>>>>>>> OVER (and thus also losing boost). >>>>>>>>> >>>>>>>>> Can you see if the timer ticks are still happening, and perhaps put >>>>>>>>> some tracing it to verify that what I described above is happening? >>>>>>>>> >>>>>>>>> -George >>>>>>>> George, >>>>>>>> >>>>>>>> Is that in csched_acct()? Looks like that's somehow gotten removed. >If >>>>>>>> true, then may be that's the fundamental problem to chase. >>>>>>>> >>>>>>>> Here's what the trq looks like when hung, not in any schedule >>>>>>>> function: >>>>>>>> >>>>>>>> [0]xkdb> dtrq >>>>>>>> CPU[00]: NOW:0x00003f2db9af369e >>>>>>>> 1: exp=0x00003ee31cb32200 fn:csched_tick >>> data:0000000000000000 >>>>>>>> 2: exp=0x00003ee347ece164 fn:time_calibration >>> data:0000000000000000 >>>>>>>> 3: exp=0x00003ee69a28f04b fn:mce_work_fn >>> data:0000000000000000 >>>>>>>> 4: exp=0x00003f055895e25f fn:plt_overflow >>> data:0000000000000000 >>>>>>>> 5: exp=0x00003ee353810216 fn:rtc_update_second >>> data:ffff83007f0226d8 >>>>>>>> CPU[01]: NOW:0x00003f2db9af369e >>>>>>>> 1: exp=0x00003ee30b847988 fn:s_timer_fn >>> data:0000000000000000 >>>>>>>> 2: exp=0x00003f1b309ebd45 fn:pmt_timer_callback >>> data:ffff83007f022a68 >>>>>>>> >>>>>>>> thanks >>>>>>>> Mukesh >>>>>>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@xxxxxxxxxxxxxxxxxxx >>>> http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

Follow-Ups:

Re: [Xen-devel] dom0 hang
- From: Keir Fraser

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.