Xen project Mailing List

Re: [Xen-devel] [PATCH v2 04/11] xen: sched: close potential races when switching scheduler to CPUs

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: George Dunlap <george.dunlap@xxxxxxxxxx>

Date: Thu, 7 Apr 2016 15:54:44 +0100

Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Tianyang Chen <tiche@xxxxxxxxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>

Delivery-date: Thu, 07 Apr 2016 15:01:38 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 06/04/16 18:23, Dario Faggioli wrote: > In short, the point is making sure that the actual switch > of scheduler and the remapping of the scheduler's runqueue > lock occur in the same critical section, protected by the > "old" scheduler's lock (and not, e.g., in the free_pdata > hook, as it is now for Credit2 and RTDS). > > Not doing so, is (at least) racy. In fact, for instance, > if we switch cpu X from, Credit2 to Credit, we do: > > schedule_cpu_switch(x, csched2 --> csched): > //scheduler[x] is csched2 > //schedule_lock[x] is csched2_lock > csched_alloc_pdata(x) > csched_init_pdata(x) > pcpu_schedule_lock(x) ----> takes csched2_lock > scheduler[X] = csched > pcpu_schedule_unlock(x) --> unlocks csched2_lock > [1] > csched2_free_pdata(x) > pcpu_schedule_lock(x) --> takes csched2_lock > schedule_lock[x] = csched_lock > spin_unlock(csched2_lock) > > While, if we switch cpu X from, Credit to Credit2, we do: > > schedule_cpu_switch(X, csched --> csched2): > //scheduler[x] is csched > //schedule_lock[x] is csched_lock > csched2_alloc_pdata(x) > csched2_init_pdata(x) > pcpu_schedule_lock(x) --> takes csched_lock > schedule_lock[x] = csched2_lock > spin_unlock(csched_lock) > [2] > pcpu_schedule_lock(x) ----> takes csched2_lock > scheduler[X] = csched2 > pcpu_schedule_unlock(x) --> unlocks csched2_lock > csched_free_pdata(x) > > And if we switch cpu X from RTDS to Credit2, we do: > > schedule_cpu_switch(X, RTDS --> csched2): > //scheduler[x] is rtds > //schedule_lock[x] is rtds_lock > csched2_alloc_pdata(x) > csched2_init_pdata(x) > pcpu_schedule_lock(x) --> takes rtds_lock > schedule_lock[x] = csched2_lock > spin_unlock(rtds_lock) > pcpu_schedule_lock(x) ----> takes csched2_lock > scheduler[x] = csched2 > pcpu_schedule_unlock(x) --> unlocks csched2_lock > rtds_free_pdata(x) > spin_lock(rtds_lock) > ASSERT(schedule_lock[x] == rtds_lock) [3] > schedule_lock[x] = DEFAULT_SCHEDULE_LOCK [4] > spin_unlock(rtds_lock) > > So, the first problem is that, if anything related to > scheduling, and involving CPU, happens at [1] or [2], we: > - take csched2_lock, > - operate on Credit1 functions and data structures, > which is no good! > > The second problem is that the ASSERT at [3] triggers, and > the third that at [4], we screw up the lock remapping we've > done for ourself in csched2_init_pdata()! > > The first problem arises because there is a window during > which the lock is already the new one, but the scheduler is > still the old one. The other two, becase we let schedulers > mess with the lock (re)mapping done by others. > > This patch, therefore, introduces a new hook in the scheduler > interface, called switch_sched, meant at being used when > switching scheduler on a CPU, and implements it for the > various schedulers (that needs it: i.e., all except ARINC653), > so that things are done in the proper order and under the > protection of the best suited (set of) lock(s). It is > necessary to add the hook (as compared to keep doing things > in generic code), because different schedulers may have > different locking schemes. > > Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> Hey Dario! Everything here looks good, except for one thing: the scheduler lock for arinc653 scheduler. :-) What happens now if you assign a cpu to credit2, and then assign it to arinc653? Since arinc doesn't implement the switch_sched() functionality, the per-cpu scheduler lock will still point to the credit2 lock, won't it? Which will *work*, although it will add unnecessary contention to the credit2 lock; until the lock goes away, at which point vcpu_schedule_lock*() will essentially be using a wild pointer. -George > --- > Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx> > Cc: Meng Xu <mengxu@xxxxxxxxxxxxx> > Cc: Tianyang Chen <tiche@xxxxxxxxxxxxxx> > --- > Changes from v1: > > new patch, basically, coming from squashing what were > 4 patches in v1. In any case, with respect to those 4 > patches: > - runqueue lock is back being taken in schedule_cpu_switch(), > as suggested during review; > - add barriers for making sure all initialization is done > when the new lock is assigned, as sugested during review; > - add comments and ASSERT-s about how and why the adopted > locking scheme is safe, as suggested during review. > --- > xen/common/sched_credit.c | 44 ++++++++++++++++++++++++ > xen/common/sched_credit2.c | 81 > +++++++++++++++++++++++++++++++++----------- > xen/common/sched_rt.c | 45 +++++++++++++++++------- > xen/common/schedule.c | 41 +++++++++++++++++----- > xen/include/xen/sched-if.h | 3 ++ > 5 files changed, 172 insertions(+), 42 deletions(-) > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > index 96a245d..540d515 100644 > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -578,12 +578,55 @@ csched_init_pdata(const struct scheduler *ops, void > *pdata, int cpu) > { > unsigned long flags; > struct csched_private *prv = CSCHED_PRIV(ops); > + struct schedule_data *sd = &per_cpu(schedule_data, cpu); > + > + /* > + * This is called either during during boot, resume or hotplug, in > + * case Credit1 is the scheduler chosen at boot. In such cases, the > + * scheduler lock for cpu is already pointing to the default per-cpu > + * spinlock, as Credit1 needs it, so there is no remapping to be done. > + */ > + ASSERT(sd->schedule_lock == &sd->_lock && !spin_is_locked(&sd->_lock)); > > spin_lock_irqsave(&prv->lock, flags); > init_pdata(prv, pdata, cpu); > spin_unlock_irqrestore(&prv->lock, flags); > } > > +/* Change the scheduler of cpu to us (Credit). */ > +static void > +csched_switch_sched(struct scheduler *ops, unsigned int cpu, > + void *pdata, void *vdata) > +{ > + struct schedule_data *sd = &per_cpu(schedule_data, cpu); > + struct csched_private *prv = CSCHED_PRIV(ops); > + struct csched_vcpu *svc = vdata; > + > + ASSERT(svc && is_idle_vcpu(svc->vcpu)); > + > + idle_vcpu[cpu]->sched_priv = vdata; > + > + /* > + * We are holding the runqueue lock already (it's been taken in > + * schedule_cpu_switch()). It actually may or may not be the 'right' > + * one for this cpu, but that is ok for preventing races. > + */ > + spin_lock(&prv->lock); > + init_pdata(prv, pdata, cpu); > + spin_unlock(&prv->lock); > + > + per_cpu(scheduler, cpu) = ops; > + per_cpu(schedule_data, cpu).sched_priv = pdata; > + > + /* > + * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact, > + * if it is free (and it can be) we want that anyone that manages > + * taking it, finds all the initializations we've done above in place. > + */ > + smp_mb(); > + sd->schedule_lock = &sd->_lock; > +} > + > #ifndef NDEBUG > static inline void > __csched_vcpu_check(struct vcpu *vc) > @@ -2067,6 +2110,7 @@ static const struct scheduler sched_credit_def = { > .alloc_pdata = csched_alloc_pdata, > .init_pdata = csched_init_pdata, > .free_pdata = csched_free_pdata, > + .switch_sched = csched_switch_sched, > .alloc_domdata = csched_alloc_domdata, > .free_domdata = csched_free_domdata, > > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c > index 8989eea..60c6f5b 100644 > --- a/xen/common/sched_credit2.c > +++ b/xen/common/sched_credit2.c > @@ -1971,12 +1971,12 @@ static void deactivate_runqueue(struct > csched2_private *prv, int rqi) > cpumask_clear_cpu(rqi, &prv->active_queues); > } > > -static void > +/* Returns the ID of the runqueue the cpu is assigned to. */ > +static unsigned > init_pdata(struct csched2_private *prv, unsigned int cpu) > { > unsigned rqi; > struct csched2_runqueue_data *rqd; > - spinlock_t *old_lock; > > ASSERT(spin_is_locked(&prv->lock)); > ASSERT(!cpumask_test_cpu(cpu, &prv->initialized)); > @@ -2007,44 +2007,89 @@ init_pdata(struct csched2_private *prv, unsigned int > cpu) > activate_runqueue(prv, rqi); > } > > - /* IRQs already disabled */ > - old_lock = pcpu_schedule_lock(cpu); > - > - /* Move spinlock to new runq lock. */ > - per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock; > - > /* Set the runqueue map */ > prv->runq_map[cpu] = rqi; > > cpumask_set_cpu(cpu, &rqd->idle); > cpumask_set_cpu(cpu, &rqd->active); > - > - /* _Not_ pcpu_schedule_unlock(): per_cpu().schedule_lock changed! */ > - spin_unlock(old_lock); > - > cpumask_set_cpu(cpu, &prv->initialized); > > - return; > + return rqi; > } > > static void > csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu) > { > struct csched2_private *prv = CSCHED2_PRIV(ops); > + spinlock_t *old_lock; > unsigned long flags; > + unsigned rqi; > > spin_lock_irqsave(&prv->lock, flags); > - init_pdata(prv, cpu); > + old_lock = pcpu_schedule_lock(cpu); > + > + rqi = init_pdata(prv, cpu); > + /* Move the scheduler lock to the new runq lock. */ > + per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock; > + > + /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */ > + spin_unlock(old_lock); > spin_unlock_irqrestore(&prv->lock, flags); > } > > +/* Change the scheduler of cpu to us (Credit2). */ > +static void > +csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu, > + void *pdata, void *vdata) > +{ > + struct csched2_private *prv = CSCHED2_PRIV(new_ops); > + struct csched2_vcpu *svc = vdata; > + unsigned rqi; > + > + ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu)); > + > + /* > + * We own one runqueue lock already (from schedule_cpu_switch()). This > + * looks like it violates this scheduler's locking rules, but it does > + * not, as what we own is the lock of another scheduler, that hence has > + * no particular (ordering) relationship with our private global lock. > + * And owning exactly that one (the lock of the old scheduler of this > + * cpu) is what is necessary to prevent races. > + */ > + spin_lock_irq(&prv->lock); > + > + idle_vcpu[cpu]->sched_priv = vdata; > + > + rqi = init_pdata(prv, cpu); > + > + /* > + * Now that we know what runqueue we'll go in, double check what's said > + * above: the lock we already hold is not the one of this runqueue of > + * this scheduler, and so it's safe to have taken it /before/ our > + * private global lock. > + */ > + ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->rqd[rqi].lock); > + > + per_cpu(scheduler, cpu) = new_ops; > + per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */ > + > + /* > + * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact, > + * if it is free (and it can be) we want that anyone that manages > + * taking it, find all the initializations we've done above in place. > + */ > + smp_mb(); > + per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock; > + > + spin_unlock_irq(&prv->lock); > +} > + > static void > csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu) > { > unsigned long flags; > struct csched2_private *prv = CSCHED2_PRIV(ops); > struct csched2_runqueue_data *rqd; > - struct schedule_data *sd = &per_cpu(schedule_data, cpu); > int rqi; > > spin_lock_irqsave(&prv->lock, flags); > @@ -2072,11 +2117,6 @@ csched2_free_pdata(const struct scheduler *ops, void > *pcpu, int cpu) > deactivate_runqueue(prv, rqi); > } > > - /* Move spinlock to the original lock. */ > - ASSERT(sd->schedule_lock == &rqd->lock); > - ASSERT(!spin_is_locked(&sd->_lock)); > - sd->schedule_lock = &sd->_lock; > - > spin_unlock(&rqd->lock); > > cpumask_clear_cpu(cpu, &prv->initialized); > @@ -2170,6 +2210,7 @@ static const struct scheduler sched_credit2_def = { > .free_vdata = csched2_free_vdata, > .init_pdata = csched2_init_pdata, > .free_pdata = csched2_free_pdata, > + .switch_sched = csched2_switch_sched, > .alloc_domdata = csched2_alloc_domdata, > .free_domdata = csched2_free_domdata, > }; > diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c > index b96bd93..3bb8c71 100644 > --- a/xen/common/sched_rt.c > +++ b/xen/common/sched_rt.c > @@ -682,6 +682,37 @@ rt_init_pdata(const struct scheduler *ops, void *pdata, > int cpu) > spin_unlock_irqrestore(old_lock, flags); > } > > +/* Change the scheduler of cpu to us (RTDS). */ > +static void > +rt_switch_sched(struct scheduler *new_ops, unsigned int cpu, > + void *pdata, void *vdata) > +{ > + struct rt_private *prv = rt_priv(new_ops); > + struct rt_vcpu *svc = vdata; > + > + ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu)); > + > + /* > + * We are holding the runqueue lock already (it's been taken in > + * schedule_cpu_switch()). It's actually the runqueue lock of > + * another scheduler, but that is how things need to be, for > + * preventing races. > + */ > + ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->lock); > + > + idle_vcpu[cpu]->sched_priv = vdata; > + per_cpu(scheduler, cpu) = new_ops; > + per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */ > + > + /* > + * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact, > + * if it is free (and it can be) we want that anyone that manages > + * taking it, find all the initializations we've done above in place. > + */ > + smp_mb(); > + per_cpu(schedule_data, cpu).schedule_lock = &prv->lock; > +} > + > static void * > rt_alloc_pdata(const struct scheduler *ops, int cpu) > { > @@ -707,19 +738,6 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu) > static void > rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu) > { > - struct rt_private *prv = rt_priv(ops); > - struct schedule_data *sd = &per_cpu(schedule_data, cpu); > - unsigned long flags; > - > - spin_lock_irqsave(&prv->lock, flags); > - > - /* Move spinlock back to the default lock */ > - ASSERT(sd->schedule_lock == &prv->lock); > - ASSERT(!spin_is_locked(&sd->_lock)); > - sd->schedule_lock = &sd->_lock; > - > - spin_unlock_irqrestore(&prv->lock, flags); > - > free_cpumask_var(_cpumask_scratch[cpu]); > } > > @@ -1468,6 +1486,7 @@ static const struct scheduler sched_rtds_def = { > .alloc_pdata = rt_alloc_pdata, > .free_pdata = rt_free_pdata, > .init_pdata = rt_init_pdata, > + .switch_sched = rt_switch_sched, > .alloc_domdata = rt_alloc_domdata, > .free_domdata = rt_free_domdata, > .init_domain = rt_dom_init, > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > index 1941613..5559aa1 100644 > --- a/xen/common/schedule.c > +++ b/xen/common/schedule.c > @@ -1635,11 +1635,11 @@ void __init scheduler_init(void) > int schedule_cpu_switch(unsigned int cpu, struct cpupool *c) > { > struct vcpu *idle; > - spinlock_t *lock; > void *ppriv, *ppriv_old, *vpriv, *vpriv_old; > struct scheduler *old_ops = per_cpu(scheduler, cpu); > struct scheduler *new_ops = (c == NULL) ? &ops : c->sched; > struct cpupool *old_pool = per_cpu(cpupool, cpu); > + spinlock_t * old_lock; > > /* > * pCPUs only move from a valid cpupool to free (i.e., out of any pool), > @@ -1658,11 +1658,21 @@ int schedule_cpu_switch(unsigned int cpu, struct > cpupool *c) > if ( old_ops == new_ops ) > goto out; > > + /* > + * To setup the cpu for the new scheduler we need: > + * - a valid instance of per-CPU scheduler specific data, as it is > + * allocated by SCHED_OP(alloc_pdata). Note that we do not want to > + * initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)). > + * That will be done by the target scheduler, in > SCHED_OP(switch_sched), > + * in proper ordering and with locking. > + * - a valid instance of per-vCPU scheduler specific data, for the idle > + * vCPU of cpu. That is what the target scheduler will use for the > + * sched_priv field of the per-vCPU info of the idle domain. > + */ > idle = idle_vcpu[cpu]; > ppriv = SCHED_OP(new_ops, alloc_pdata, cpu); > if ( IS_ERR(ppriv) ) > return PTR_ERR(ppriv); > - SCHED_OP(new_ops, init_pdata, ppriv, cpu); > vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv); > if ( vpriv == NULL ) > { > @@ -1670,17 +1680,30 @@ int schedule_cpu_switch(unsigned int cpu, struct > cpupool *c) > return -ENOMEM; > } > > - lock = pcpu_schedule_lock_irq(cpu); > - > SCHED_OP(old_ops, tick_suspend, cpu); > + > + /* > + * The actual switch, including (if necessary) the rerouting of the > + * scheduler lock to whatever new_ops prefers, needs to happen in one > + * critical section, protected by old_ops' lock, or races are possible. > + * It is, in fact, the lock of another scheduler that we are taking (the > + * scheduler of the cpupool that cpu still belongs to). But that is ok > + * as, anyone trying to schedule on this cpu will spin until when we > + * release that lock (bottom of this function). When he'll get the lock > + * --thanks to the loop inside *_schedule_lock() functions-- he'll notice > + * that the lock itself changed, and retry acquiring the new one (which > + * will be the correct, remapped one, at that point). > + */ > + old_lock = pcpu_schedule_lock(cpu); > + > vpriv_old = idle->sched_priv; > - idle->sched_priv = vpriv; > - per_cpu(scheduler, cpu) = new_ops; > ppriv_old = per_cpu(schedule_data, cpu).sched_priv; > - per_cpu(schedule_data, cpu).sched_priv = ppriv; > - SCHED_OP(new_ops, tick_resume, cpu); > + SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv); > > - pcpu_schedule_unlock_irq(lock, cpu); > + /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */ > + spin_unlock_irq(old_lock); > + > + SCHED_OP(new_ops, tick_resume, cpu); > > SCHED_OP(old_ops, free_vdata, vpriv_old); > SCHED_OP(old_ops, free_pdata, ppriv_old, cpu); > diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h > index 70c08c6..9cebe41 100644 > --- a/xen/include/xen/sched-if.h > +++ b/xen/include/xen/sched-if.h > @@ -137,6 +137,9 @@ struct scheduler { > void (*free_domdata) (const struct scheduler *, void *); > void * (*alloc_domdata) (const struct scheduler *, struct domain > *); > > + void (*switch_sched) (struct scheduler *, unsigned int, > + void *, void *); > + > int (*init_domain) (const struct scheduler *, struct domain > *); > void (*destroy_domain) (const struct scheduler *, struct domain > *); > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.