Xen project Mailing List

Re: [Xen-devel] [PATCH 44/60] xen/sched: add fall back to idle vcpu when scheduling unit

Date: Wed, 3 Jul 2019 07:54:28 +0000

Accept-language: en-US

Authentication-results: spf=none (sender IP is ) smtp.mailfrom=JBeulich@xxxxxxxx;

Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Konrad Wilk <konrad.wilk@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Julien Grall <julien.grall@xxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

Delivery-date: Wed, 03 Jul 2019 07:57:09 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHVMOcYbFMIR8NfFESq4t5aQyYeU6a4e+iAgAALVgA=

Thread-topic: [PATCH 44/60] xen/sched: add fall back to idle vcpu when scheduling unit

On 03.07.2019 09:13, Juergen Gross wrote: > On 02.07.19 17:01, Jan Beulich wrote: >>>>> On 28.05.19 at 12:32, <jgross@xxxxxxxx> wrote: >>> When running an idle vcpu in a non-idle scheduling unit use a specific >>> guest idle loop not performing any tasklets, memory scrubbing and >>> livepatching in order to avoid populating the cpu caches with memory >>> used by other domains (as far as possible). Softirqs are considered to >>> be save (timers might want to be excluded, but this can be fine-tuned >>> later). >> >> How could timers be legitimately excluded? And how are softirqs >> (which similarly can't be excluded here) any less risky than e.g. >> tasklets? > > At least some timers are for other guests. I can drop mentioning > timers. > > Tasklets are sometimes used for deferred processing of guest specific > actions, like continue_hypercall_on_cpu(). This is something we really > don't want here. Yet what softirqs act on you won't know either. >>> --- a/xen/arch/x86/domain.c >>> +++ b/xen/arch/x86/domain.c >>> @@ -159,6 +159,23 @@ static void idle_loop(void) >>> } >>> } >>> +/* >>> + * Idle loop for siblings of active schedule units. >>> + * We don't do any standard idle work like tasklets, page scrubbing or >>> + * livepatching. >>> + */ >>> +static void guest_idle_loop(void) >>> +{ >>> + unsigned int cpu = smp_processor_id(); >>> + >>> + for ( ; ; ) >>> + { >>> + if ( !softirq_pending(cpu) ) >>> + sched_guest_idle(pm_idle, cpu); >>> + do_softirq(); >>> + } >>> +} >> >> In the comment I think you mean "siblings of <whatever> in >> active schedule units"? > > Is "siblings of cpus in guest mode" fine? Sure. As would perhaps be "siblings in active schedule units". >>> @@ -172,6 +189,10 @@ void startup_cpu_idle_loop(void) >>> static void noreturn continue_idle_domain(struct vcpu *v) >>> { >>> + /* Idle vcpus might be attached to non-idle units! */ >>> + if ( !is_idle_domain(v->sched_unit->domain) ) >>> + reset_stack_and_jump(guest_idle_loop); >>> + >>> reset_stack_and_jump(idle_loop); >>> } >> >> You're aware that there's a call to check_for_livepatch_work() hidden >> in reset_stack_and_jump(), which you say you don't want to allow in >> this context? > > Good point. > > IMO it would be best to have a "no-livepatch" variant of > reset_stack_and_jump(). That's what I too was thinking. >>> --- a/xen/common/schedule.c >>> +++ b/xen/common/schedule.c >>> @@ -82,7 +82,18 @@ static struct scheduler __read_mostly ops; >>> static inline struct vcpu *sched_unit2vcpu_cpu(struct sched_unit *unit, >>> unsigned int cpu) >>> { >>> - return unit->domain->vcpu[unit->unit_id + per_cpu(sched_res_idx, cpu)]; >>> + unsigned int idx = unit->unit_id + per_cpu(sched_res_idx, cpu); >>> + const struct domain *d = unit->domain; >>> + struct vcpu *v; >>> + >>> + if ( idx < d->max_vcpus && d->vcpu[idx] ) >>> + { >>> + v = d->vcpu[idx]; >>> + if ( v->new_state == RUNSTATE_running ) >>> + return v; >> >> Isn't this enough of the cache fill half of a gadget to warrant use of >> array_index_nospec() or alike? > > The input data is in no way user controlled. Do we really want to add > barriers before each array access? array_index_nospec() does not involve any barriers. Indeed I'd have been more hesitant to suggest a change here if that would involve adding a barrier. As to "in no way guest controlled" - did you perhaps see the discussion regarding speculative NULL derefs that Norbert and I had on his grant table changes? If NULL can plausibly be found in d->vcpu[idx] (after all you check for it), then a PV guest could control what a speculative deref would produce as data, and hence what might be used for further indirection. >>> @@ -209,19 +223,11 @@ static inline void vcpu_runstate_change( >>> v->runstate.state = new_state; >>> } >>> -static inline void sched_unit_runstate_change(struct sched_unit *unit, >>> - bool running, s_time_t new_entry_time) >>> +void sched_guest_idle(void (*idle) (void), unsigned int cpu) >>> { >>> - struct vcpu *v; >>> - >>> - for_each_sched_unit_vcpu ( unit, v ) >>> - if ( running ) >>> - vcpu_runstate_change(v, v->new_state, new_entry_time); >>> - else >>> - vcpu_runstate_change(v, >>> - ((v->pause_flags & VPF_blocked) ? RUNSTATE_blocked : >>> - (vcpu_runnable(v) ? RUNSTATE_runnable : >>> RUNSTATE_offline)), >>> - new_entry_time); >>> + atomic_inc(&get_sched_res(cpu)->urgent_count); >>> + idle(); >>> + atomic_dec(&get_sched_res(cpu)->urgent_count); >>> } >> >> What is "urgent" about an idle vCPU filling an empty sched unit slot? >> That is, why do you need to prevent the thread from sleeping as >> power efficiently as possible (potentially allowing the sibling thread >> to even use more resources)? > > The deeper the thread is sleeping the longer it will take to wake it up > for synchronized context switching. I'd like to avoid additional > latencies. Whether to trade latencies for power savings imo is a decision to be made by an admin, not us. Use of C-states incurs higher latencies elsewhere as well, so if they're deemed a problem, use of deeper C-states could be disallowed via already available command line options. >>> @@ -1719,25 +1760,25 @@ static struct sched_unit *do_schedule(struct >>> sched_unit *prev, s_time_t now, >>> if ( prev->next_time >= 0 ) /* -ve means no limit */ >>> set_timer(&sd->s_timer, now + prev->next_time); >>> - if ( likely(prev != next) ) >>> - sched_switch_units(sd, next, prev, now); >>> + sched_switch_units(sd, next, prev, now); >>> return next; >>> } >>> -static void context_saved(struct vcpu *prev) >>> +static void context_saved(struct sched_unit *unit) >>> { >>> - struct sched_unit *unit = prev->sched_unit; >>> - >>> unit->is_running = 0; >>> unit->state_entry_time = NOW(); >>> + get_sched_res(smp_processor_id())->prev = NULL; >>> /* Check for migration request /after/ clearing running flag. */ >>> smp_mb(); >>> - sched_context_saved(vcpu_scheduler(prev), unit); >>> + sched_context_saved(vcpu_scheduler(unit->vcpu), unit); >> >> An example of it unlikely being just one of the vCPU-s in a unit that >> you actually want to deal with. > > No. All vcpus of a unit share the same scheduler. Oh, right. > OTOH I think there is some room for tuning here: vcpu_scheduler is > doing quite some work to find the correct struct scheduler. Replacing > it by unit_scheduler() might be a good idea. Yes please, if such is available; otherwise I'd have suggested to have it available. If all vCPU-s of a unit share the same scheduler (which is to be expected, i.e. I should have paid more attention to the further context above), why would vcpu_scheduler() be any more expensive than unit_scheduler() anyway? It ought to simply be unit_scheduler(v->unit). Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.