[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 3/4] xen: sched: reorganize cpu_disable_scheduler()



On Fri, Jul 3, 2015 at 4:49 PM, Dario Faggioli
<dario.faggioli@xxxxxxxxxx> wrote:
> The function is called both when we want to remove a cpu
> from a cpupool, and during cpu teardown, for suspend or
> shutdown. If, however, the boot cpu (cpu 0, most of the
> times) is not present in the default cpupool, during
> suspend or shutdown, Xen crashes like this:
>
>   root@Zhaman:~# xl cpupool-cpu-remove Pool-0 0
>   root@Zhaman:~# shutdown -h now
>   (XEN) ----[ Xen-4.6-unstable  x86_64  debug=y  Tainted:    C ]----
>   ...
>   (XEN) Xen call trace:
>   (XEN)    [<ffff82d0801238de>] _csched_cpu_pick+0x156/0x61f
>   (XEN)    [<ffff82d080123db5>] csched_cpu_pick+0xe/0x10
>   (XEN)    [<ffff82d08012de3c>] vcpu_migrate+0x18e/0x321
>   (XEN)    [<ffff82d08012e4f8>] cpu_disable_scheduler+0x1cf/0x2ac
>   (XEN)    [<ffff82d08018bb8d>] __cpu_disable+0x313/0x36e
>   (XEN)    [<ffff82d080101424>] take_cpu_down+0x34/0x3b
>   (XEN)    [<ffff82d08013097a>] stopmachine_action+0x70/0x99
>   (XEN)    [<ffff82d0801325f0>] do_tasklet_work+0x78/0xab
>   (XEN)    [<ffff82d080132926>] do_tasklet+0x5e/0x8a
>   (XEN)    [<ffff82d08016478c>] idle_loop+0x56/0x6b
>   (XEN)
>   (XEN)
>   (XEN) ****************************************
>   (XEN) Panic on CPU 15:
>   (XEN) Assertion 'cpu < nr_cpu_ids' failed at 
> ...URCES/xen/xen/xen.git/xen/include/xen/cpumask.h:97
>   (XEN) ****************************************
>
> There also are problems when we try to suspend or shutdown
> with a cpupool configured with just one cpu (no matter, in
> this case, whether that is the boot cpu or not):
>
>   root@Zhaman:~# xl create /etc/xen/test.cfg
>   root@Zhaman:~# xl cpupool-migrate test Pool-1
>   root@Zhaman:~# xl cpupool-list -c
>   Name               CPU list
>   Pool-0             0,1,2,3,4,5,6,7,8,9,10,11,13,14,15
>   Pool-1             12
>   root@Zhaman:~# shutdown -h now
>   (XEN) ----[ Xen-4.6-unstable  x86_64  debug=y  Tainted:    C ]----
>   (XEN) CPU:    12
>   ...
>   (XEN) Xen call trace:
>   (XEN)    [<ffff82d08018bb91>] __cpu_disable+0x317/0x36e
>   (XEN)    [<ffff82d080101424>] take_cpu_down+0x34/0x3b
>   (XEN)    [<ffff82d08013097a>] stopmachine_action+0x70/0x99
>   (XEN)    [<ffff82d0801325f0>] do_tasklet_work+0x78/0xab
>   (XEN)    [<ffff82d080132926>] do_tasklet+0x5e/0x8a
>   (XEN)    [<ffff82d08016478c>] idle_loop+0x56/0x6b
>   (XEN)
>   (XEN)
>   (XEN) ****************************************
>   (XEN) Panic on CPU 12:
>   (XEN) Xen BUG at smpboot.c:895
>   (XEN) ****************************************
>
> In both cases, the problem is the scheduler not being able
> to:
>  - move all the vcpus to the boot cpu (as the boot cpu is
>    not in the cpupool), in the former;
>  - move the vcpus away from a cpu at all (as that is the
>    only one cpu in the cpupool), in the latter.
>
> Solution is to distinguish, inside cpu_disable_scheduler(),
> the two cases of cpupool manipulation and teardown. For
> cpupool manipulation, it is correct to ask the scheduler to
> take an action, as pathological situation (like there not
> being any cpu in the pool where to send vcpus) are taken
> care of (i.e., forbidden!) already. For suspend and shutdown,
> we don't want the scheduler to be involved at all, as the
> final goal is pretty simple: "send all the vcpus to the
> boot cpu ASAP", so we just go for it.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> ---
> Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
> Cc: Juergen Gross <jgross@xxxxxxxx>
> ---
>  xen/common/schedule.c |  109 
> ++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 93 insertions(+), 16 deletions(-)
>
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index e83c666..eac8804 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -455,8 +455,8 @@ void vcpu_unblock(struct vcpu *v)
>   * Do the actual movemet of a vcpu from old to new CPU. Locks for *both*
>   * CPUs needs to have been taken already when calling this!
>   */
> -static void vcpu_move(struct vcpu *v, unsigned int old_cpu,
> -                      unsigned int new_cpu)
> +static void _vcpu_move(struct vcpu *v, unsigned int old_cpu,
> +                       unsigned int new_cpu)
>  {
>      /*
>       * Transfer urgency status to new CPU before switching CPUs, as
> @@ -479,6 +479,35 @@ static void vcpu_move(struct vcpu *v, unsigned int 
> old_cpu,
>          v->processor = new_cpu;
>  }
>
> +static void vcpu_move(struct vcpu *v, unsigned int new_cpu)

Not quite a fan of the naming scheme here.  Perhaps vcpu_move and
vcpu_move_locked?

Actually -- looking at this again, is there a reason to pass old_cpu
into _vcpu_move?  It looks like old_vcpu should always be equal to
v->processor.  That would make the function prototypes the same.

> +{
> +    unsigned long flags;
> +    unsigned int cpu = v->processor;
> +    spinlock_t *lock, *new_lock;
> +
> +    /*
> +     * When here, the vcpu should be ready for being moved. This means:
> +     *  - both its original and target processor must be quiet;
> +     *  - it must not be marked as currently running;
> +     *  - the proper flag must be set (i.e., no one must have had any
> +     *    chance to reset it).
> +     */
> +    ASSERT(is_idle_vcpu(curr_on_cpu(cpu)) &&
> +           is_idle_vcpu(curr_on_cpu(new_cpu)));
> +    ASSERT(!v->is_running && test_bit(_VPF_migrating, &v->pause_flags));
> +
> +    lock = per_cpu(schedule_data, v->processor).schedule_lock;
> +    new_lock = per_cpu(schedule_data, new_cpu).schedule_lock;
> +
> +    sched_spin_lock_double(lock, new_lock, &flags);
> +    ASSERT(new_cpu != v->processor);

Don't we need to do the "schedule lock dance" here as well --
double-check to make sure that v->processor hasn't changed since the
time we grabbed the lock?

> +    _vcpu_move(v, cpu, new_cpu);
> +    sched_spin_unlock_double(lock, new_lock, flags);
> +
> +    sched_move_irqs(v);
> +    vcpu_wake(v);
> +}
> +
>  static void vcpu_migrate(struct vcpu *v)
>  {
>      unsigned long flags;
> @@ -543,7 +572,7 @@ static void vcpu_migrate(struct vcpu *v)
>          return;
>      }
>
> -    vcpu_move(v, old_cpu, new_cpu);
> +    _vcpu_move(v, old_cpu, new_cpu);
>
>      sched_spin_unlock_double(old_lock, new_lock, flags);
>
> @@ -616,7 +645,8 @@ int cpu_disable_scheduler(unsigned int cpu)
>      struct vcpu *v;
>      struct cpupool *c;
>      cpumask_t online_affinity;
> -    int    ret = 0;
> +    unsigned int new_cpu;
> +    int ret = 0;
>
>      c = per_cpu(cpupool, cpu);
>      if ( c == NULL )
> @@ -645,25 +675,72 @@ int cpu_disable_scheduler(unsigned int cpu)
>                  cpumask_setall(v->cpu_hard_affinity);
>              }
>
> -            if ( v->processor == cpu )
> +            if ( v->processor != cpu )
>              {
> -                set_bit(_VPF_migrating, &v->pause_flags);
> +                /* This vcpu is not on cpu, so we can move on. */
>                  vcpu_schedule_unlock_irqrestore(lock, flags, v);
> -                vcpu_sleep_nosync(v);
> -                vcpu_migrate(v);
> +                continue;
>              }
> -            else
> -                vcpu_schedule_unlock_irqrestore(lock, flags, v);
>
>              /*
> -             * A vcpu active in the hypervisor will not be migratable.
> -             * The caller should try again after releasing and reaquiring
> -             * all locks.
> +             * If we're here, it means that the vcpu is on cpu. Let's see how
> +             * it's best to send it away, depending on whether we are 
> shutting
> +             * down/suspending, or doing cpupool manipulations.
>               */
> -            if ( v->processor == cpu )
> -                ret = -EAGAIN;
> -        }
> +            set_bit(_VPF_migrating, &v->pause_flags);
> +            vcpu_schedule_unlock_irqrestore(lock, flags, v);
> +            vcpu_sleep_nosync(v);

I don't quite understand this.  By calling _nosync(), you're not
guaranteed that the vcpu has actually stopped running when you call
vcpu_move() down below; and yet inside vcpu_move(), you assert
!v->is_running.

So either at this point, when doing suspend, the vcpu has already been
paused; in which case this is unnecessary; or it hasn't been paused,
in which case we're potentially racing the IPI / deschedule, and will
trip over the ASSERT if we "win".

Did I miss something?  (I'm perfectly ready to believe that I have...)

 -George

> +
> +            /*
> +             * In case of shutdown/suspend, it is not necessary to ask the
> +             * scheduler to chime in. In fact:
> +             *  * there is no reason for it: the end result we are after is
> +             *    just 'all the vcpus on the boot pcpu, and no vcpu anywhere
> +             *    else', so let's just go for it;
> +             *  * it's wrong, when dealing a cpupool with only non-boot 
> pcpus,
> +             *    as the scheduler will always fail to send the vcpus away
> +             *    from the last online (non boot) pcpu!
> +             *
> +             * Therefore, in the shutdown/suspend case, let's just pick one
> +             * of the still online pcpus, and send everyone there. Ideally,
> +             * we would pick up the boot pcpu directly, but we don't know
> +             * which one it is.
> +             *
> +             * OTOH, if the system is still live, and we are here because of
> +             * cpupool manipulations:
> +             *  * it would be best to call the scheduler, as that would serve
> +             *    as a re-evaluation of the placement of the vcpu, taking 
> into
> +             *    account the modified cpupool configuration;
> +             *  * the scheduler will always fine a suitable solution, or
> +             *    things would have failed before getting in here.
> +             *
> +             * Therefore, in the cpupool manipulation case, let's just ask 
> the
> +             * scheduler to do its job, via calling vcpu_migrate().
> +             */
> +            if ( unlikely(system_state == SYS_STATE_suspend) )
> +            {
> +                /*
> +                 * The boot pcpu is, usually, pcpu #0, so using 
> cpumask_first()
> +                 * actually helps us to achieve our ultimate goal quicker.
> +                 */
> +                cpumask_andnot(&online_affinity, &cpu_online_map, 
> cpumask_of(cpu));
> +                new_cpu = cpumask_first(&online_affinity);
> +                vcpu_move(v, new_cpu);
> +            }
> +            else
> +            {
> +                /*
> +                 * The only caveat, in this case, is that if the vcpu active
> +                 * in the hypervisor, it won't be migratable. In this case,
> +                 * the caller should try again after releasing and reaquiring
> +                 * all locks.
> +                 */
> +                vcpu_migrate(v);
>
> +                if ( v->processor == cpu )
> +                    ret = -EAGAIN;
> +            }
> +        }
>          domain_update_node_affinity(d);
>      }
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.