[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin
On Wed, 2018-04-11 at 00:59 +0200, Dario Faggioli wrote: > [Adding Andrew, not because I expect anything, but just because > we've chatted about this issue on IRC :-) ] > Except, I did not add it. :-P Anyway... > On Tue, 2018-04-10 at 22:37 +0200, Olaf Hering wrote: > > On Tue, Apr 10, Dario Faggioli wrote: > > > > BUG_ON(__vcpu_on_runq(CSCHED_VCPU(vc))); > > ... patch attached. Olaf, can you give it a try? It should be fine to run it on top of the last debug patch (the one that produced this crash). Regards, Dario > > (XEN) Xen BUG at sched_credit.c:876 > > (XEN) ----[ Xen-4.11.20180410T125709.50f8ba84a5- > > 3.bug1087289_411 x86_64 debug=y Not tainted ]---- > > (XEN) CPU: 118 > > (XEN) RIP: e008:[<ffff82d080229ab4>] > > sched_credit.c#csched_vcpu_migrate+0x27/0x51 > > ... > > (XEN) Xen call trace: > > (XEN) [<ffff82d080229ab4>] > > sched_credit.c#csched_vcpu_migrate+0x27/0x51 > > (XEN) [<ffff82d080236348>] schedule.c#vcpu_move_locked+0xbb/0xc2 > > (XEN) [<ffff82d08023764c>] schedule.c#vcpu_migrate+0x226/0x25b > > (XEN) [<ffff82d08023935f>] context_saved+0x8d/0x94 > > (XEN) [<ffff82d08027797d>] context_switch+0xe66/0xeb0 > > (XEN) [<ffff82d080236943>] schedule.c#schedule+0x5f4/0x627 > > (XEN) [<ffff82d080239f15>] softirq.c#__do_softirq+0x85/0x90 > > (XEN) [<ffff82d080239f6a>] do_softirq+0x13/0x15 > > (XEN) [<ffff82d08031f5db>] vmx_asm_do_vmentry+0x2b/0x30 > > > > Hey... unless I've really put there a totally bogous BUG_ON(), this > looks interesting and potentially useful. > > It says that the vcpu which is being context switched out, and on > which > we are calling vcpu_migrate() on, because we found it to be > VPF_migrating, is actually in the runqueue already when we actually > get > to execute vcpu_migrate()->vcpu_move_locked(). > > Mmm... let's see. > > CPU A CPU B > . . > schedule(current == v) vcpu_set_affinity(v) > prev = current // == v . > schedule_lock(CPU A) . > csched_schedule() schedule_lock(CPU A) > if (runnable(v)) //YES x > runq_insert(v) x > return next != v x > schedule_unlock(CPU A) x // takes the lock > context_switch(prev,next) set_bit(v, > VPF_migrating) [*] > context_saved(prev) // still == v . > v->is_running = 0 schedule_unlock(CPU A) > SMP_MB . > if (test_bit(v, VPF_migrating)) // YES!! > vcpu_migrate(v) . > for { . > schedule_lock(CPU A) . > SCHED_OP(v, pick_cpu) . > set_bit(v, CSCHED_MIGRATING) . > return CPU C . > pick_called = 1 . > schedule_unlock(CPU A) . > schedule_lock(CPU A + CPU C) . > if (pick_called && ...) // YES . > break . > } . > // v->is_running is 0 . > //!test_and_clear(v, VPF_migrating)) is false!! > clear_bit(v, VPF_migrating) . > vcpu_move_locked(v, CPU C) . > BUG_ON(__vcpu_on_runq(v)) . > > [*] after this point, and until someone manages to call > vcpu_sleep(), > v sits in CPU A's runqueue with the VPF_migrating pause flag > set > > So, basically, the race is between context_saved() and > vcpu_set_affinity(). Basically, vcpu_set_affinity() sets the > VPF_migrating pause flags on a vcpu in a runqueue, with the intent of > letting either a vcpu_sleep_nosync() or a reschedule remove it from > there, but context_saved() manage to see the flag, before the removal > can happen. > > And I think this explains also the original BUG at > sched_credit.c:1694 > (it's just a bit more involved). > > As it can be seen above (and also in the code comment in ) there is a > barrier (which further testify that this is indeed a tricky passage), > but I guess it is not that effective! :-/ > > TBH, I have actually never fully understood what that comment really > meant, what the barrier was protecting, and how... e.g., isn't it > missing its paired one? In fact, there's another comment, clearly > related, right in vcpu_set_affinity(). But again I'm a bit at loss at > properly figuring out what the big idea is. > > George, what do you think? Does this make sense? > > Well, I'll think more about this, and to a possible fix, tomorrow > morning. > > Regards, > Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Software Engineer @ SUSE https://www.suse.com/ Attachment:
xen-sched-debug-vcpumigrate-race.patch Attachment:
signature.asc _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |