Xen project Mailing List

Re: [Xen-devel] [PATCH v4] xen: sched: convert RTDS from time to event driven model

To: Tianyang Chen <tiche@xxxxxxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Wed, 3 Feb 2016 13:39:44 +0100

Cc: george.dunlap@xxxxxxxxxx, Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>

Delivery-date: Wed, 03 Feb 2016 12:39:55 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, 2016-02-02 at 13:09 -0500, Tianyang Chen wrote: > On 2/2/2016 10:08 AM, Dario Faggioli wrote: > > On Sun, 2016-01-31 at 23:32 -0500, Tianyang Chen wrote: > > >Â > > > +ÂÂÂÂstruct rt_private *prv = rt_priv(ops); > > > +ÂÂÂÂstruct list_head *replq = rt_replq(ops); > > > +ÂÂÂÂstruct list_head *iter; > > > + > > > +ÂÂÂÂASSERT( spin_is_locked(&prv->lock) ); > > > + > > Is this really useful? Considering where this is called, I don't > > think > > it is. > > > > This basically comes from __q_insert(). > I see. Well it's still not necessary, IMO, so don't add it. At some point, we may want to remove it from __runq_insert() too. The bottom line is: prv->lock is, currently, the runqueue lock (it's the lock for everything! :-D). It is pretty obvious that you need the runq lock to manipulate the runqueue. It is not the runqueue that you are manipulating here, it is the replenishment queue, but as a matter of fact, prv->lock serializes that too. So, if anything, it would be more useful to add a comment somewhere, noting (reinforcing, I should probably say) the point that also this new queue for replenishment is being serialized by prv->lock, than an assert like this one. > I have been searching for theÂ > definitions of vcpu_schedule_lock_irq() and pcpu_schedule_lock_irq(). > DoÂ > you know where they are defined? I did some name search in the > entireÂ > xen directory but still nothing. > They are auto-generated. Look inÂxen/include/xen/sched-if.h. > > > +ÂÂÂÂASSERT( !__vcpu_on_p(svc) ); > > > + > > > +ÂÂÂÂlist_for_each(iter, replq) > > > +ÂÂÂÂ{ > > > +ÂÂÂÂÂÂÂÂstruct rt_vcpu * iter_svc = __p_elem(iter); > > > +ÂÂÂÂÂÂÂÂif ( svc->cur_deadline <= iter_svc->cur_deadline ) > > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂbreak; > > > +ÂÂÂÂ} > > > + > > > +ÂÂÂÂlist_add_tail(&svc->p_elem, iter); > > > +} > > This looks ok. The list_for_each()+list_ad_tail() part would be > > basically identical to what we have inside __runq_insert(), > > thgough, > > wouldn't it? > > > > It may make sense to define an _ordered_queue_insert() (or > > _deadline_queue_insert(), since it's alwas the deadline that is > > compared) utility function that we can then call in both cases. > > > > > @@ -449,6 +507,7 @@ rt_init(struct scheduler *ops) > > > ÂÂÂÂÂÂINIT_LIST_HEAD(&prv->sdom); > > > ÂÂÂÂÂÂINIT_LIST_HEAD(&prv->runq); > > > ÂÂÂÂÂÂINIT_LIST_HEAD(&prv->depletedq); > > > +ÂÂÂÂINIT_LIST_HEAD(&prv->replq); > > > > > Is there a reason why you don't do at least the allocation of the > > timer > > here? Because, to me, this looks the best place for that. > > > > > ÂÂÂÂÂÂcpumask_clear(&prv->tickled); > > > > > > @@ -473,6 +532,9 @@ rt_deinit(const struct scheduler *ops) > > > ÂÂÂÂÂÂÂÂÂÂxfree(_cpumask_scratch); > > > ÂÂÂÂÂÂÂÂÂÂ_cpumask_scratch = NULL; > > > ÂÂÂÂÂÂ} > > > + > > > +ÂÂÂÂkill_timer(prv->repl_timer); > > > + > > > ÂÂÂÂÂÂxfree(prv); > > > > > And here, you're freeing prv, without having freed prv->timer, > > which is > > therefore leaked. > > > > > @@ -632,6 +699,17 @@ rt_vcpu_insert(const struct scheduler *ops, > > > struct vcpu *vc) > > > ÂÂÂÂÂÂvcpu_schedule_unlock_irq(lock, vc); > > > > > > ÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_insert); > > > + > > > +ÂÂÂÂif( prv->repl_timer == NULL ) > > > +ÂÂÂÂ{ > > > +ÂÂÂÂÂÂÂÂ/* vc->processor has been set in schedule.c */ > > > +ÂÂÂÂÂÂÂÂcpu = vc->processor; > > > + > > > +ÂÂÂÂÂÂÂÂrepl_timer = xzalloc(struct timer); > > > + > > > +ÂÂÂÂÂÂÂÂprv->repl_timer = repl_timer; > > > +ÂÂÂÂÂÂÂÂinit_timer(repl_timer, repl_handler, (void *)ops, cpu); > > > +ÂÂÂÂ} > > > Â } > > > > > This belong to rt_init, IMO. > > > > When a new pool is created, the corresponding scheduler is alsoÂ > initialized. But there is no pcpu assigned to that pool. > Sure, and in fact, above, when commenting on rt_init() I said "Is there a reason why you don't do at least the allocation of the timer here?" But you're right, here I put it like all should belong to rt_init(), I should have been more clear. > If init_timer()Â > happens in rt_init, how does it know which cpu to pick? > When move_domain() is called, there must be some pcpus in theÂ > destination pool and then vcpu_insert happens. > Allocating memory for the rt_priv copy of this particular scheduler instance, definitely belong in rt_init(). Then, in this case, we could do the allocation in rt_init() and do the initialization later, perhaps here, or the first time that the timer needs to be started, or when the first pCPU is assigned to this scheduler (by being assigned to the cpupool this scheduler services). I honestly prefer the latter. That is to say, in rt_alloc_pdata(), you check whether prv->repl_timer is NULL and you allocate and initialize it for the pCPU being brought up. I also would put an explicit 'prv- >repl_timer=NULL', with a comment that proper allocation and initialization will happen later, and the reason why that is the case, in rt_init(). What do you think? > > > @@ -841,7 +881,6 @@ rt_schedule(const struct scheduler *ops, > > > s_time_t > > > now, bool_t tasklet_work_sched > > > ÂÂÂÂÂÂ/* burn_budget would return for IDLE VCPU */ > > > ÂÂÂÂÂÂburn_budget(ops, scurr, now); > > > > > > -ÂÂÂÂ__repl_update(ops, now); > > > > > __repl_update() going away is of course fine. But we need to enact > > the > > budget enforcement logic somewhere in here, don't we? > > > > Sorry about that, I meant to add it in this version. > More on this later. But then I'm curios. With this patch applied, it you set b=400/p=1000 for a domain, was the 40% utilization being enforced properly? > > > @@ -1058,25 +1103,39 @@ rt_vcpu_wake(const struct scheduler *ops, > > > struct vcpu *vc) > > > ÂÂÂÂÂÂif ( unlikely(svc->flags & RTDS_scheduled) ) > > > ÂÂÂÂÂÂ{ > > > ÂÂÂÂÂÂÂÂÂÂset_bit(__RTDS_delayed_runq_add, &svc->flags); > > > -ÂÂÂÂÂÂÂÂreturn; > > > +ÂÂÂÂÂÂÂÂgoto out; > > > ÂÂÂÂÂÂ} > > > > > In this case, I'm less sure. I think this should just return as > > well, > > but it may depend on how other stuff are done (like budget > > enforcement), so it's hard to think at it right now, but consider > > this > > when working on next version. > > > > OK. For the second case, I guess there is nothing preventing a vcpu > thatÂ > sleeps for a long time on runq to be picked before other vcpus. > By "the second case" you mean this one about __RTDS_delayed_runq_add? If yes, I'm not following what you mean with "a vcpu sleeps for a long time on runq etc." This is what we are talking about here. A vCPU is selected to be preempted, during rt_schedule(), but it is still runnable, so we need to put it back to the runqueue. You can't just do that (put it back in the runqueue), because, in RTDS, like in Credit2, the runqueue is shared between multiple pCPUs (see commitÂbe9a0806e197f4fd3fa6). So what do we do, we raise the *_delayed_runq_add flag and continue and leave the vcpu alone. This happens, for instance, when there is a race between scheduling out a vcpu that is going to sleep, and the vcpu in question being woken back up already! It's not frequent, but we need to take care of that. At some point (sooner rather tan later, most likely) the context switch will actually happen, and the vcpu that is waking up is put in the runqueue. > And also, repl_handler removes vcpus that are not runnable and if > theyÂ > are not added back here, where should we start tracking them? This > goesÂ > back to if it's necessary to remove a un-runnable vcpu. > I also don't get what you mean by "tracking" (here and in other places). Anyway, what I meant when I said that what to do in the "delayed add" case depends on other things is, for instance, this: the vcpu is waking up right now, but it is going to be inserted in the runqueue later. When (as per "where in the code", "in what function") do you want to program the timer, here in rt_vcpu_wake(), or changing __replq_insert(), as said somewhere else, or in rt_context_saved(), where the vcpu is actually added to the runq? This kind of things... All this being said, taking a bit of a step back, and considering, not only the replenishment event and/or timer (re)programming issue, but also the deadline that needs updating, I think you actually need to find a way to check for both, in this patch. In fact, if we aim at simplifying rt_context_saved() too --and we certainly are-- in this case that a vcpu is actually being woken up, but it's not being put back in the runqueue until context switch, when it is that we check if it's deadline passed and it hence refreshed scheduling parameters? So, big recap, I think a possible variant of rt_vcpu_wake() would look as follows: Â- if vcpu is running or in a q Â Â ==> /* nothing. */ Â- esle if vcpu is delayed_runq_add ==> if (now >= deadline?) { Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â update_deadline(); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if ( __vcpu_on_p(svc) ) _replq_remove(); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â } Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â _replq_insert(); Â- else Â Â Â Â Â Â Â Â Â Â Â Â Â Â ==> if (now >= deadline?) { Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â update_deadline(); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if ( __vcpu_on_p(svc) ) _replq_remove(); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â } Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â _runq_insert(); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â _replq_insert(); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â runq_tickle(); This would mean something like this: static void rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc) { ÂÂÂÂstruct rt_vcpu * const svc = rt_vcpu(vc); ÂÂÂÂs_time_t now = NOW(); ÂÂÂÂstruct rt_private *prv = rt_priv(ops); ÂÂÂÂBUG_ON( is_idle_vcpu(vc) ); ÂÂÂÂif ( unlikely(curr_on_cpu(vc->processor) == vc) ) ÂÂÂÂ{ ÂÂÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_wake_running); ÂÂÂÂÂÂÂÂreturn; ÂÂÂÂ} ÂÂÂÂ/* on RunQ/DepletedQ, just update info is ok */ ÂÂÂÂif ( unlikely(__vcpu_on_q(svc)) ) ÂÂÂÂ{ ÂÂÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_wake_onrunq); ÂÂÂÂÂÂÂÂreturn; ÂÂÂÂ} ÂÂÂÂif ( likely(vcpu_runnable(vc)) ) ÂÂÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_wake_runnable); ÂÂÂÂelse ÂÂÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_wake_not_runnable); ÂÂÂÂif ( now >= svc->cur_deadline) { ÂÂÂÂÂÂÂÂrt_update_deadline(now, svc); __replq_remove(svc); } __replq_insert(svc); ÂÂÂÂ/* If context hasn't been saved for this vcpu yet, we can't put it on ÂÂÂÂÂ* the Runqueue/DepletedQ. Instead, we set a flag so that it will be ÂÂÂÂÂ* put on the Runqueue/DepletedQ after the context has been saved. ÂÂÂÂÂ*/ ÂÂÂÂif ( unlikely(svc->flags & RTDS_scheduled) ) ÂÂÂÂ{ ÂÂÂÂÂÂÂÂset_bit(__RTDS_delayed_runq_add, &svc->flags); ÂÂÂÂÂÂÂÂreturn; ÂÂÂÂ} ÂÂÂÂ/* insert svc to runq/depletedq because svc is not in queue now */ ÂÂÂÂ__runq_insert(ops, svc); ÂÂÂÂrunq_tickle(ops, snext); ÂÂÂÂreturn; } And, at this point, I'm starting thinking again that we want to call runq_tickle() in rt_context_saved(), at least in case __RTDS_delayed_runq_add was set as, if we don't, we're missing tickling for the vcpu being inserted because of that. Of course, all the above, taken with a little bit of salt... i.e., I do not expect to have got all the details right, especially in the code, but it should at least be effective in showing the idea. Thoughts? > > > +ÂÂÂÂÂÂÂÂÂÂÂÂ/* when the replenishment happens > > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂ* svc is either on a pcpu or on > > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂ* runq/depletedq > > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂ*/ > > > +ÂÂÂÂÂÂÂÂÂÂÂÂif( __vcpu_on_q(svc) ) > > > +ÂÂÂÂÂÂÂÂÂÂÂÂ{ > > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ/* put back to runq */ > > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ__q_remove(svc); > > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ__runq_insert(ops, svc); > > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂrunq_tickle(ops, svc); > > > +ÂÂÂÂÂÂÂÂÂÂÂÂ} > > > + > > Aha! So, it looks the budget enforcement logic ended up here? > > Mmm... > > no, I think that doing it in rt_schedule(), as we agreed, would > > make > > things simpler and better looking, both here and there. > > > > This is the replenishment timer handler, and it really should do > > one > > thins: handle replenishment events. That's it! ;-P > > > > Oh I might be misunderstanding what should be put in rt_schedule(). > WasÂ > it specifically for handling a vcpu that shouldn't be taken off a > core? > Why are you referring this piece of code as budget enforcement? > > Isn't it necessary to modify the runq here after budget > replenishmentÂ > has happened? If runq goes out of order here, will rt_schedule() beÂ > sorting it? > Ops, yes, you are right, this is not budget enforcement. So, budget enforcement looks still to be missing from this patch (and that is why I asked whether it is actually working). And this chunk of code (and perhaps this whole function) is probably ok, it's just a bit hard to understand what it does... So, do you mind if we take a step back again, and analyze the situation. When the timer fires, and this handler is executed and a replenishment event taken off the replenishment queue, the following situations are possible: Â1) the replenishment is for a running vcpu Â2) the replenishment is for a runnable vcpu in the runq Â3) the replenishment is for a runnable vcpu in the depletedq Â4) the replenishment is for a sleeping vcpu So, 4) is never going to happen, right?, as we're stopping the timer when a vcpu blocks. If we go for this, let's put an ASSERT() and a comment as a guard for that. If we do the optimization in this very patch. If we leave it for another patch, we need to handle this case, I guess. In all 1), 2) and 3) cases, we need to remove the replenishment event from the replenishment queue, so just (in the code) do it upfront. What other actions need to happen in each case, 1), 2) and 3)? Can you summarize then in here, so we can decide how to handle each situation? I'd say that, in both case 2) and case 3), the vcpu needs to be taken out of the queue where they are, and (re-)inserted in the runqueue, and then we need to tickle... Is that correct? I'd also say that, in case 1), we also need to tickle because the deadline may now be have become bigger than some other vcpu that is in the runqueue? Also, under what conditions we need, as soon as replenishment for vcpu X happened, to queue another replenishment for the same vcpu, at its next deadline? Always, I would say (unless the vcpu is sleeping, because of the optimization you introduced)? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.