[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin
On Fri, 2018-04-13 at 09:03 +0000, George Dunlap wrote: > > On Apr 12, 2018, at 6:25 PM, Dario Faggioli <dfaggioli@xxxxxxxx> > > wrote: > > > > On the "other CPU", we might be around here [**]: > > > > static void vcpu_migrate(struct vcpu *v) > > { > > ... > > if ( v->is_running || > > !test_and_clear_bit(_VPF_migrating, &v->pause_flags) )n > > I think the bottom line is, for this test to be valid, then at this > point test_bit(VPF_migrating) *must* imply !vcpu_on_runqueue(v), but > at this point it doesn’t: If someone else has come by and cleared the > bit, done migration, and woken it up, and then someone *else* set the > bit again without taking it off the runqueue, it may still be on the > runqueue. > BTW, I suddenly realized that Olaf, in his reproducer, is changing both hard and soft-affinity. That means two calls to vcpu_set_affinity(). And here's the race (beware, it's a bit of a long chain of events! :-P): CPU A CPU B . . schedule(current == v) vcpu_set_affinity(v) <-- for hard affinity prev = current // == v . schedule_lock(CPU A) . csched_schedule() schedule_lock(CPU A) if (runnable(v)) //YES x runq_insert(v) x return next != v x schedule_unlock(CPU A) x // takes the lock context_switch(prev,next) set_bit(v, VPF_migrating) context_saved(prev) // still == v . v->is_running = 0 schedule_unlock(CPU A) SMP_MB domain_update_node_affinity(v->d) . if (test_bit(v, VPF_migrating) // YES . vcpu_sleep_nosync(v) . schedule_lock(CPU A) . if (!vcpu_runnable(v)) // YES . SCHED_OP(v, sleep) . if (curr_on_cpu(v, CPU A)) // NO . --- . else if (__vcpu_on_runq(v)) // YES . runq_remove(v) . schedule_unlock(CPU A) . vcpu_migrate(v) . for { . schedule_lock(CPU A) . SCHED_OP(v, pick_cpu) . set_bit(v, CSCHED_MIGRATING) . return CPU D . pick_called = 1 . schedule_unlock(CPU A) if (test_bit(v, VPF_migrating)) // YES schedule_lock(CPU A + CPU D) vcpu_sleep_nosync(v) if (pick_called && ) // YES schedule_lock(CPU A) break x } x // CPU B clears VPF_migrating! if (v->is_running || !test_and_clear(v, VPF_migrating)) // NO x --- x vcpu_move_locked(v) x v->processor = CPU D x schedule_unlock(CPU A + CPU D) x // takes *CPU D* lock . if (!vcpu_runnable(v)) // FALSE, as VPF_migrating is now clear --- vcpu_wake(v) schedule_unlock(CPU D) . vcpu_migrate(v) schedule_lock(CPU D) for { if (vcpu_runnable(v)) // YES schedule_lock(CPU D) SCHED_OP(v, wake) x runq_insert(v) // v is now in CPU D's runqueue x runq_tickle(v) x schedule_unlock(CPU D) x // takes the lock . SCHED_OP(v, pick_cpu) . set_bit(v, CSCHED_MIGRATING) . return CPU C . pick_called = 1 . schedule_unlock(CPU D) . . vcpu_set_affinity(v) <-- for soft-affinity . schedule_lock(CPU D) schedule_lock(CPU D + CPU C) set_bit(v, VPF_migrating) x schedule_unlock(CPU D) x // takes the lock . if (pick_called && ...) // YES . break . } . if ( v->is_running || !test_and_clear(v, VPF_migrating)) // FALSE !! vcpu_move_locked(v, CPU C) . BUG_ON(__vcpu_on_runq(v)) . It appears that changing hard-affinity only does not trigger the bug, which would mean this is correct. Also, as Olaf just reported, running with your series (and changing both hard and soft-affinity), also work. Now we have to decide if take your series and backport it (which is what I'm leaning toward), or do something else. But if you don't mind, we'd have to do it on Monday, as I have to run right now. :-P Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Software Engineer @ SUSE https://www.suse.com/ Attachment:
signature.asc _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |