[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 04/11] xen: sched: close potential races when switching scheduler to CPUs



On 06/04/16 18:23, Dario Faggioli wrote:
> In short, the point is making sure that the actual switch
> of scheduler and the remapping of the scheduler's runqueue
> lock occur in the same critical section, protected by the
> "old" scheduler's lock (and not, e.g., in the free_pdata
> hook, as it is now for Credit2 and RTDS).
> 
> Not doing  so, is (at least) racy. In fact, for instance,
> if we switch cpu X from, Credit2 to Credit, we do:
> 
>  schedule_cpu_switch(x, csched2 --> csched):
>    //scheduler[x] is csched2
>    //schedule_lock[x] is csched2_lock
>    csched_alloc_pdata(x)
>    csched_init_pdata(x)
>    pcpu_schedule_lock(x) ----> takes csched2_lock
>    scheduler[X] = csched
>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>    [1]
>    csched2_free_pdata(x)
>      pcpu_schedule_lock(x) --> takes csched2_lock
>      schedule_lock[x] = csched_lock
>      spin_unlock(csched2_lock)
> 
> While, if we switch cpu X from, Credit to Credit2, we do:
> 
>  schedule_cpu_switch(X, csched --> csched2):
>    //scheduler[x] is csched
>    //schedule_lock[x] is csched_lock
>    csched2_alloc_pdata(x)
>    csched2_init_pdata(x)
>      pcpu_schedule_lock(x) --> takes csched_lock
>      schedule_lock[x] = csched2_lock
>      spin_unlock(csched_lock)
>    [2]
>    pcpu_schedule_lock(x) ----> takes csched2_lock
>    scheduler[X] = csched2
>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>    csched_free_pdata(x)
> 
> And if we switch cpu X from RTDS to Credit2, we do:
> 
>  schedule_cpu_switch(X, RTDS --> csched2):
>    //scheduler[x] is rtds
>    //schedule_lock[x] is rtds_lock
>    csched2_alloc_pdata(x)
>    csched2_init_pdata(x)
>      pcpu_schedule_lock(x) --> takes rtds_lock
>      schedule_lock[x] = csched2_lock
>      spin_unlock(rtds_lock)
>    pcpu_schedule_lock(x) ----> takes csched2_lock
>    scheduler[x] = csched2
>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>    rtds_free_pdata(x)
>      spin_lock(rtds_lock)
>      ASSERT(schedule_lock[x] == rtds_lock) [3]
>      schedule_lock[x] = DEFAULT_SCHEDULE_LOCK [4]
>      spin_unlock(rtds_lock)
> 
> So, the first problem is that, if anything related to
> scheduling, and involving CPU, happens at [1] or [2], we:
>  - take csched2_lock,
>  - operate on Credit1 functions and data structures,
> which is no good!
> 
> The second problem is that the ASSERT at [3] triggers, and
> the third that at [4], we screw up the lock remapping we've
> done for ourself in csched2_init_pdata()!
> 
> The first problem arises because there is a window during
> which the lock is already the new one, but the scheduler is
> still the old one. The other two, becase we let schedulers
> mess with the lock (re)mapping done by others.
> 
> This patch, therefore, introduces a new hook in the scheduler
> interface, called switch_sched, meant at being used when
> switching scheduler on a CPU, and implements it for the
> various schedulers (that needs it: i.e., all except ARINC653),
> so that things are done in the proper order and under the
> protection of the best suited (set of) lock(s). It is
> necessary to add the hook (as compared to keep doing things
> in generic code), because different schedulers may have
>  different locking schemes.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Hey Dario! Everything here looks good, except for one thing: the
scheduler lock for arinc653 scheduler. :-)  What happens now if you
assign a cpu to credit2, and then assign it to arinc653?  Since arinc
doesn't implement the switch_sched() functionality, the per-cpu
scheduler lock will still point to the credit2 lock, won't it?

Which will *work*, although it will add unnecessary contention to the
credit2 lock;  until the lock goes away, at which point
vcpu_schedule_lock*() will essentially be using a wild pointer.

 -George

> ---
> Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
> Cc: Meng Xu <mengxu@xxxxxxxxxxxxx>
> Cc: Tianyang Chen <tiche@xxxxxxxxxxxxxx>
> ---
> Changes from v1:
> 
> new patch, basically, coming from squashing what were
> 4 patches in v1. In any case, with respect to those 4
> patches:
>  - runqueue lock is back being taken in schedule_cpu_switch(),
>    as suggested during review;
>  - add barriers for making sure all initialization is done
>    when the new lock is assigned, as sugested during review;
>  - add comments and ASSERT-s about how and why the adopted
>    locking scheme is safe, as suggested during review.
> ---
>  xen/common/sched_credit.c  |   44 ++++++++++++++++++++++++
>  xen/common/sched_credit2.c |   81 
> +++++++++++++++++++++++++++++++++-----------
>  xen/common/sched_rt.c      |   45 +++++++++++++++++-------
>  xen/common/schedule.c      |   41 +++++++++++++++++-----
>  xen/include/xen/sched-if.h |    3 ++
>  5 files changed, 172 insertions(+), 42 deletions(-)
> 
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 96a245d..540d515 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -578,12 +578,55 @@ csched_init_pdata(const struct scheduler *ops, void 
> *pdata, int cpu)
>  {
>      unsigned long flags;
>      struct csched_private *prv = CSCHED_PRIV(ops);
> +    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> +
> +    /*
> +     * This is called either during during boot, resume or hotplug, in
> +     * case Credit1 is the scheduler chosen at boot. In such cases, the
> +     * scheduler lock for cpu is already pointing to the default per-cpu
> +     * spinlock, as Credit1 needs it, so there is no remapping to be done.
> +     */
> +    ASSERT(sd->schedule_lock == &sd->_lock && !spin_is_locked(&sd->_lock));
>  
>      spin_lock_irqsave(&prv->lock, flags);
>      init_pdata(prv, pdata, cpu);
>      spin_unlock_irqrestore(&prv->lock, flags);
>  }
>  
> +/* Change the scheduler of cpu to us (Credit). */
> +static void
> +csched_switch_sched(struct scheduler *ops, unsigned int cpu,
> +                    void *pdata, void *vdata)
> +{
> +    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> +    struct csched_private *prv = CSCHED_PRIV(ops);
> +    struct csched_vcpu *svc = vdata;
> +
> +    ASSERT(svc && is_idle_vcpu(svc->vcpu));
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +
> +    /*
> +     * We are holding the runqueue lock already (it's been taken in
> +     * schedule_cpu_switch()). It actually may or may not be the 'right'
> +     * one for this cpu, but that is ok for preventing races.
> +     */
> +    spin_lock(&prv->lock);
> +    init_pdata(prv, pdata, cpu);
> +    spin_unlock(&prv->lock);
> +
> +    per_cpu(scheduler, cpu) = ops;
> +    per_cpu(schedule_data, cpu).sched_priv = pdata;
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, finds all the initializations we've done above in place.
> +     */
> +    smp_mb();
> +    sd->schedule_lock = &sd->_lock;
> +}
> +
>  #ifndef NDEBUG
>  static inline void
>  __csched_vcpu_check(struct vcpu *vc)
> @@ -2067,6 +2110,7 @@ static const struct scheduler sched_credit_def = {
>      .alloc_pdata    = csched_alloc_pdata,
>      .init_pdata     = csched_init_pdata,
>      .free_pdata     = csched_free_pdata,
> +    .switch_sched   = csched_switch_sched,
>      .alloc_domdata  = csched_alloc_domdata,
>      .free_domdata   = csched_free_domdata,
>  
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 8989eea..60c6f5b 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -1971,12 +1971,12 @@ static void deactivate_runqueue(struct 
> csched2_private *prv, int rqi)
>      cpumask_clear_cpu(rqi, &prv->active_queues);
>  }
>  
> -static void
> +/* Returns the ID of the runqueue the cpu is assigned to. */
> +static unsigned
>  init_pdata(struct csched2_private *prv, unsigned int cpu)
>  {
>      unsigned rqi;
>      struct csched2_runqueue_data *rqd;
> -    spinlock_t *old_lock;
>  
>      ASSERT(spin_is_locked(&prv->lock));
>      ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
> @@ -2007,44 +2007,89 @@ init_pdata(struct csched2_private *prv, unsigned int 
> cpu)
>          activate_runqueue(prv, rqi);
>      }
>      
> -    /* IRQs already disabled */
> -    old_lock = pcpu_schedule_lock(cpu);
> -
> -    /* Move spinlock to new runq lock.  */
> -    per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock;
> -
>      /* Set the runqueue map */
>      prv->runq_map[cpu] = rqi;
>      
>      cpumask_set_cpu(cpu, &rqd->idle);
>      cpumask_set_cpu(cpu, &rqd->active);
> -
> -    /* _Not_ pcpu_schedule_unlock(): per_cpu().schedule_lock changed! */
> -    spin_unlock(old_lock);
> -
>      cpumask_set_cpu(cpu, &prv->initialized);
>  
> -    return;
> +    return rqi;
>  }
>  
>  static void
>  csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
>  {
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
> +    spinlock_t *old_lock;
>      unsigned long flags;
> +    unsigned rqi;
>  
>      spin_lock_irqsave(&prv->lock, flags);
> -    init_pdata(prv, cpu);
> +    old_lock = pcpu_schedule_lock(cpu);
> +
> +    rqi = init_pdata(prv, cpu);
> +    /* Move the scheduler lock to the new runq lock. */
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
> +
> +    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
> +    spin_unlock(old_lock);
>      spin_unlock_irqrestore(&prv->lock, flags);
>  }
>  
> +/* Change the scheduler of cpu to us (Credit2). */
> +static void
> +csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
> +                     void *pdata, void *vdata)
> +{
> +    struct csched2_private *prv = CSCHED2_PRIV(new_ops);
> +    struct csched2_vcpu *svc = vdata;
> +    unsigned rqi;
> +
> +    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
> +
> +    /*
> +     * We own one runqueue lock already (from schedule_cpu_switch()). This
> +     * looks like it violates this scheduler's locking rules, but it does
> +     * not, as what we own is the lock of another scheduler, that hence has
> +     * no particular (ordering) relationship with our private global lock.
> +     * And owning exactly that one (the lock of the old scheduler of this
> +     * cpu) is what is necessary to prevent races.
> +     */
> +    spin_lock_irq(&prv->lock);
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +
> +    rqi = init_pdata(prv, cpu);
> +
> +    /*
> +     * Now that we know what runqueue we'll go in, double check what's said
> +     * above: the lock we already hold is not the one of this runqueue of
> +     * this scheduler, and so it's safe to have taken it /before/ our
> +     * private global lock.
> +     */
> +    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->rqd[rqi].lock);
> +
> +    per_cpu(scheduler, cpu) = new_ops;
> +    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, find all the initializations we've done above in place.
> +     */
> +    smp_mb();
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
> +
> +    spin_unlock_irq(&prv->lock);
> +}
> +
>  static void
>  csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>  {
>      unsigned long flags;
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>      struct csched2_runqueue_data *rqd;
> -    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
>      int rqi;
>  
>      spin_lock_irqsave(&prv->lock, flags);
> @@ -2072,11 +2117,6 @@ csched2_free_pdata(const struct scheduler *ops, void 
> *pcpu, int cpu)
>          deactivate_runqueue(prv, rqi);
>      }
>  
> -    /* Move spinlock to the original lock.  */
> -    ASSERT(sd->schedule_lock == &rqd->lock);
> -    ASSERT(!spin_is_locked(&sd->_lock));
> -    sd->schedule_lock = &sd->_lock;
> -
>      spin_unlock(&rqd->lock);
>  
>      cpumask_clear_cpu(cpu, &prv->initialized);
> @@ -2170,6 +2210,7 @@ static const struct scheduler sched_credit2_def = {
>      .free_vdata     = csched2_free_vdata,
>      .init_pdata     = csched2_init_pdata,
>      .free_pdata     = csched2_free_pdata,
> +    .switch_sched   = csched2_switch_sched,
>      .alloc_domdata  = csched2_alloc_domdata,
>      .free_domdata   = csched2_free_domdata,
>  };
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> index b96bd93..3bb8c71 100644
> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c
> @@ -682,6 +682,37 @@ rt_init_pdata(const struct scheduler *ops, void *pdata, 
> int cpu)
>      spin_unlock_irqrestore(old_lock, flags);
>  }
>  
> +/* Change the scheduler of cpu to us (RTDS). */
> +static void
> +rt_switch_sched(struct scheduler *new_ops, unsigned int cpu,
> +                void *pdata, void *vdata)
> +{
> +    struct rt_private *prv = rt_priv(new_ops);
> +    struct rt_vcpu *svc = vdata;
> +
> +    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
> +
> +    /*
> +     * We are holding the runqueue lock already (it's been taken in
> +     * schedule_cpu_switch()). It's actually the runqueue lock of
> +     * another scheduler, but that is how things need to be, for
> +     * preventing races.
> +     */
> +    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->lock);
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +    per_cpu(scheduler, cpu) = new_ops;
> +    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, find all the initializations we've done above in place.
> +     */
> +    smp_mb();
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
> +}
> +
>  static void *
>  rt_alloc_pdata(const struct scheduler *ops, int cpu)
>  {
> @@ -707,19 +738,6 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
>  static void
>  rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>  {
> -    struct rt_private *prv = rt_priv(ops);
> -    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> -    unsigned long flags;
> -
> -    spin_lock_irqsave(&prv->lock, flags);
> -
> -    /* Move spinlock back to the default lock */
> -    ASSERT(sd->schedule_lock == &prv->lock);
> -    ASSERT(!spin_is_locked(&sd->_lock));
> -    sd->schedule_lock = &sd->_lock;
> -
> -    spin_unlock_irqrestore(&prv->lock, flags);
> -
>      free_cpumask_var(_cpumask_scratch[cpu]);
>  }
>  
> @@ -1468,6 +1486,7 @@ static const struct scheduler sched_rtds_def = {
>      .alloc_pdata    = rt_alloc_pdata,
>      .free_pdata     = rt_free_pdata,
>      .init_pdata     = rt_init_pdata,
> +    .switch_sched   = rt_switch_sched,
>      .alloc_domdata  = rt_alloc_domdata,
>      .free_domdata   = rt_free_domdata,
>      .init_domain    = rt_dom_init,
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 1941613..5559aa1 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -1635,11 +1635,11 @@ void __init scheduler_init(void)
>  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>  {
>      struct vcpu *idle;
> -    spinlock_t *lock;
>      void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
>      struct scheduler *old_ops = per_cpu(scheduler, cpu);
>      struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
>      struct cpupool *old_pool = per_cpu(cpupool, cpu);
> +    spinlock_t * old_lock;
>  
>      /*
>       * pCPUs only move from a valid cpupool to free (i.e., out of any pool),
> @@ -1658,11 +1658,21 @@ int schedule_cpu_switch(unsigned int cpu, struct 
> cpupool *c)
>      if ( old_ops == new_ops )
>          goto out;
>  
> +    /*
> +     * To setup the cpu for the new scheduler we need:
> +     *  - a valid instance of per-CPU scheduler specific data, as it is
> +     *    allocated by SCHED_OP(alloc_pdata). Note that we do not want to
> +     *    initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)).
> +     *    That will be done by the target scheduler, in 
> SCHED_OP(switch_sched),
> +     *    in proper ordering and with locking.
> +     *  - a valid instance of per-vCPU scheduler specific data, for the idle
> +     *    vCPU of cpu. That is what the target scheduler will use for the
> +     *    sched_priv field of the per-vCPU info of the idle domain.
> +     */
>      idle = idle_vcpu[cpu];
>      ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
>      if ( IS_ERR(ppriv) )
>          return PTR_ERR(ppriv);
> -    SCHED_OP(new_ops, init_pdata, ppriv, cpu);
>      vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
>      if ( vpriv == NULL )
>      {
> @@ -1670,17 +1680,30 @@ int schedule_cpu_switch(unsigned int cpu, struct 
> cpupool *c)
>          return -ENOMEM;
>      }
>  
> -    lock = pcpu_schedule_lock_irq(cpu);
> -
>      SCHED_OP(old_ops, tick_suspend, cpu);
> +
> +    /*
> +     * The actual switch, including (if necessary) the rerouting of the
> +     * scheduler lock to whatever new_ops prefers,  needs to happen in one
> +     * critical section, protected by old_ops' lock, or races are possible.
> +     * It is, in fact, the lock of another scheduler that we are taking (the
> +     * scheduler of the cpupool that cpu still belongs to). But that is ok
> +     * as, anyone trying to schedule on this cpu will spin until when we
> +     * release that lock (bottom of this function). When he'll get the lock
> +     * --thanks to the loop inside *_schedule_lock() functions-- he'll notice
> +     * that the lock itself changed, and retry acquiring the new one (which
> +     * will be the correct, remapped one, at that point).
> +     */
> +    old_lock = pcpu_schedule_lock(cpu);
> +
>      vpriv_old = idle->sched_priv;
> -    idle->sched_priv = vpriv;
> -    per_cpu(scheduler, cpu) = new_ops;
>      ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
> -    per_cpu(schedule_data, cpu).sched_priv = ppriv;
> -    SCHED_OP(new_ops, tick_resume, cpu);
> +    SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv);
>  
> -    pcpu_schedule_unlock_irq(lock, cpu);
> +    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
> +    spin_unlock_irq(old_lock);
> +
> +    SCHED_OP(new_ops, tick_resume, cpu);
>  
>      SCHED_OP(old_ops, free_vdata, vpriv_old);
>      SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);
> diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
> index 70c08c6..9cebe41 100644
> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -137,6 +137,9 @@ struct scheduler {
>      void         (*free_domdata)   (const struct scheduler *, void *);
>      void *       (*alloc_domdata)  (const struct scheduler *, struct domain 
> *);
>  
> +    void         (*switch_sched)   (struct scheduler *, unsigned int,
> +                                    void *, void *);
> +
>      int          (*init_domain)    (const struct scheduler *, struct domain 
> *);
>      void         (*destroy_domain) (const struct scheduler *, struct domain 
> *);
>  
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.