[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 44/60] xen/sched: add fall back to idle vcpu when scheduling unit



On 03.07.2019 09:13, Juergen Gross wrote:
> On 02.07.19 17:01, Jan Beulich wrote:
>>>>> On 28.05.19 at 12:32, <jgross@xxxxxxxx> wrote:
>>> When running an idle vcpu in a non-idle scheduling unit use a specific
>>> guest idle loop not performing any tasklets, memory scrubbing and
>>> livepatching in order to avoid populating the cpu caches with memory
>>> used by other domains (as far as possible). Softirqs are considered to
>>> be save (timers might want to be excluded, but this can be fine-tuned
>>> later).
>>
>> How could timers be legitimately excluded? And how are softirqs
>> (which similarly can't be excluded here) any less risky than e.g.
>> tasklets?
> 
> At least some timers are for other guests. I can drop mentioning
> timers.
> 
> Tasklets are sometimes used for deferred processing of guest specific
> actions, like continue_hypercall_on_cpu(). This is something we really
> don't want here.

Yet what softirqs act on you won't know either.

>>> --- a/xen/arch/x86/domain.c
>>> +++ b/xen/arch/x86/domain.c
>>> @@ -159,6 +159,23 @@ static void idle_loop(void)
>>>       }
>>>   }
>>> +/*
>>> + * Idle loop for siblings of active schedule units.
>>> + * We don't do any standard idle work like tasklets, page scrubbing or
>>> + * livepatching.
>>> + */
>>> +static void guest_idle_loop(void)
>>> +{
>>> +    unsigned int cpu = smp_processor_id();
>>> +
>>> +    for ( ; ; )
>>> +    {
>>> +        if ( !softirq_pending(cpu) )
>>> +            sched_guest_idle(pm_idle, cpu);
>>> +        do_softirq();
>>> +    }
>>> +}
>>
>> In the comment I think you mean "siblings of <whatever> in
>> active schedule units"?
> 
> Is "siblings of cpus in guest mode" fine?

Sure. As would perhaps be "siblings in active schedule units".

>>> @@ -172,6 +189,10 @@ void startup_cpu_idle_loop(void)
>>>   static void noreturn continue_idle_domain(struct vcpu *v)
>>>   {
>>> +    /* Idle vcpus might be attached to non-idle units! */
>>> +    if ( !is_idle_domain(v->sched_unit->domain) )
>>> +        reset_stack_and_jump(guest_idle_loop);
>>> +
>>>       reset_stack_and_jump(idle_loop);
>>>   }
>>
>> You're aware that there's a call to check_for_livepatch_work() hidden
>> in reset_stack_and_jump(), which you say you don't want to allow in
>> this context?
> 
> Good point.
> 
> IMO it would be best to have a "no-livepatch" variant of
> reset_stack_and_jump().

That's what I too was thinking.

>>> --- a/xen/common/schedule.c
>>> +++ b/xen/common/schedule.c
>>> @@ -82,7 +82,18 @@ static struct scheduler __read_mostly ops;
>>>   static inline struct vcpu *sched_unit2vcpu_cpu(struct sched_unit *unit,
>>>                                                  unsigned int cpu)
>>>   {
>>> -    return unit->domain->vcpu[unit->unit_id + per_cpu(sched_res_idx, cpu)];
>>> +    unsigned int idx = unit->unit_id + per_cpu(sched_res_idx, cpu);
>>> +    const struct domain *d = unit->domain;
>>> +    struct vcpu *v;
>>> +
>>> +    if ( idx < d->max_vcpus && d->vcpu[idx] )
>>> +    {
>>> +        v = d->vcpu[idx];
>>> +        if ( v->new_state == RUNSTATE_running )
>>> +            return v;
>>
>> Isn't this enough of the cache fill half of a gadget to warrant use of
>> array_index_nospec() or alike?
> 
> The input data is in no way user controlled. Do we really want to add
> barriers before each array access?

array_index_nospec() does not involve any barriers. Indeed I'd have
been more hesitant to suggest a change here if that would involve
adding a barrier.

As to "in no way guest controlled" - did you perhaps see the
discussion regarding speculative NULL derefs that Norbert and I had
on his grant table changes? If NULL can plausibly be found in
d->vcpu[idx] (after all you check for it), then a PV guest could
control what a speculative deref would produce as data, and hence
what might be used for further indirection.

>>> @@ -209,19 +223,11 @@ static inline void vcpu_runstate_change(
>>>       v->runstate.state = new_state;
>>>   }
>>> -static inline void sched_unit_runstate_change(struct sched_unit *unit,
>>> -    bool running, s_time_t new_entry_time)
>>> +void sched_guest_idle(void (*idle) (void), unsigned int cpu)
>>>   {
>>> -    struct vcpu *v;
>>> -
>>> -    for_each_sched_unit_vcpu ( unit, v )
>>> -        if ( running )
>>> -            vcpu_runstate_change(v, v->new_state, new_entry_time);
>>> -        else
>>> -            vcpu_runstate_change(v,
>>> -                ((v->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
>>> -                 (vcpu_runnable(v) ? RUNSTATE_runnable : 
>>> RUNSTATE_offline)),
>>> -                new_entry_time);
>>> +    atomic_inc(&get_sched_res(cpu)->urgent_count);
>>> +    idle();
>>> +    atomic_dec(&get_sched_res(cpu)->urgent_count);
>>>   }
>>
>> What is "urgent" about an idle vCPU filling an empty sched unit slot?
>> That is, why do you need to prevent the thread from sleeping as
>> power efficiently as possible (potentially allowing the sibling thread
>> to even use more resources)?
> 
> The deeper the thread is sleeping the longer it will take to wake it up
> for synchronized context switching. I'd like to avoid additional
> latencies.

Whether to trade latencies for power savings imo is a decision to be made
by an admin, not us. Use of C-states incurs higher latencies elsewhere as
well, so if they're deemed a problem, use of deeper C-states could be
disallowed via already available command line options.

>>> @@ -1719,25 +1760,25 @@ static struct sched_unit *do_schedule(struct 
>>> sched_unit *prev, s_time_t now,
>>>       if ( prev->next_time >= 0 ) /* -ve means no limit */
>>>           set_timer(&sd->s_timer, now + prev->next_time);
>>> -    if ( likely(prev != next) )
>>> -        sched_switch_units(sd, next, prev, now);
>>> +    sched_switch_units(sd, next, prev, now);
>>>       return next;
>>>   }
>>> -static void context_saved(struct vcpu *prev)
>>> +static void context_saved(struct sched_unit *unit)
>>>   {
>>> -    struct sched_unit *unit = prev->sched_unit;
>>> -
>>>       unit->is_running = 0;
>>>       unit->state_entry_time = NOW();
>>> +    get_sched_res(smp_processor_id())->prev = NULL;
>>>       /* Check for migration request /after/ clearing running flag. */
>>>       smp_mb();
>>> -    sched_context_saved(vcpu_scheduler(prev), unit);
>>> +    sched_context_saved(vcpu_scheduler(unit->vcpu), unit);
>>
>> An example of it unlikely being just one of the vCPU-s in a unit that
>> you actually want to deal with.
> 
> No. All vcpus of a unit share the same scheduler.

Oh, right.

> OTOH I think there is some room for tuning here: vcpu_scheduler is
> doing quite some work to find the correct struct scheduler. Replacing
> it by unit_scheduler() might be a good idea.

Yes please, if such is available; otherwise I'd have suggested to have
it available. If all vCPU-s of a unit share the same scheduler (which is
to be expected, i.e. I should have paid more attention to the further
context above), why would vcpu_scheduler() be any more expensive than
unit_scheduler() anyway? It ought to simply be unit_scheduler(v->unit).

Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.