I have two (non exclusive) ideas here:
1. If old_cpu is actually still available, do not move it at all.
2. Use sched_migrate() instead of sched_set_res().
Other possibilities:
3. Make sure that svc->rqd is set to null when the affinity is broken.
Currently on vcpu creation, sched_init_vcpu() expects to set the pcpu; and it looks like for credit2, the svc->rqd may not be set until the first time it's woken up (that's the 'if' part of the 'if/else' clause whose 'else' contains the ASSERT() you're hitting). If when we broke the CPU affinity on suspend, we set the runqueues to NULL, then on wake it would "take" the runqueue assigned by restore_vcpu_affinity().
4. Make sched2_unit_wake() tolerant of pcpus changing under its feet.
#3 would potentially make things more robust, but would require adding some sort of call-back to notify schedulers that affinity had been broken. ATM this might only be used by credit2.
#4 would potentially be dangerous: if some other bit of credit2 code which assumes the svc->rq is valid.
Here is the patch that fixes it for me:
---8<---
diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 83455fbde1c8..dcf202d8b307 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -1246,19 +1246,29 @@ void restore_vcpu_affinity(struct domain *d)
}
}
- res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
+ /* Prefer old cpu if available. */
+ if ( cpumask_test_cpu(old_cpu, cpumask_scratch_cpu(cpu)) )
+ res = get_sched_res(old_cpu);
+ else
+ res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
sched_set_res(unit, res);
spin_unlock_irq(lock);
- /* v->processor might have changed, so reacquire the lock. */
- lock = unit_schedule_lock_irq(unit);
- res = sched_pick_resource(unit_scheduler(unit), unit);
- sched_set_res(unit, res);
- spin_unlock_irq(lock);
-
+ /*
+ * If different cpu was chosen, it was random, let scheduler do proper
+ * decision.
+ */
if ( old_cpu != sched_unit_master(unit) )
+ {
+ /* v->processor might have changed, so reacquire the lock. */
+ lock = unit_schedule_lock_irq(unit);
+ res = sched_pick_resource(unit_scheduler(unit), unit);
+ sched_migrate(unit_scheduler(unit), unit, res->master_cpu);
+ spin_unlock_irq(lock);
+
sched_move_irqs(unit);
+ }
}
rcu_read_unlock(&sched_res_rculock);
---8<---
I have several doubts here:
1. If old_cpu is available, is sched_set_res() needed at all?
2. Should both calls be changed to sched_migrate()? Currently I changed
only the second one, in case scheduler could be confused about
old_cpu not being available anymore.
3. Are there any extra locking requirements for sched_migrate() at this
stage? The long comment above sched_unit_migrate_start() suggests
there might be, but I'm not sure if that's really the case during
resume.
4. Related to the above - should thaw_domains() be modified to call
restore_vcpu_affinity() for all domains first, and unpause only
later? That could reduce locking requirements, I guess.
Unfortunately this code has had a lot of churn since the last time I really engaged with it; I'm going to have to come back to this on Monday.
Jürgen / Dario, any thoughts?
-George