[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V2 1/1] Improved RTDS scheduler



On Mon, 2016-01-25 at 18:00 -0500, Meng Xu wrote:
> On Mon, Jan 25, 2016 at 5:04 PM, Tianyang Chen <tiche@xxxxxxxxxxxxxx>
> wrote:
> > I have removed some of the Ccs so they won't get bothered as we
> > discussed
> > previously.
> > 
> > On 1/25/2016 4:00 AM, Dario Faggioli wrote:
> > > 
> > > On Thu, 2015-12-31 at 05:20 -0500, Tianyang Chen wrote:
> > > > 
> > > > 
> > > > @@ -147,6 +148,16 @@ static unsigned int nr_rt_ops;
> > > > ÂÂÂ* Global lock is referenced by schedule_data.schedule_lock
> > > > from all
> > > > ÂÂÂ* physical cpus. It can be grabbed via
> > > > vcpu_schedule_lock_irq()
> > > > ÂÂÂ*/
> > > > +
> > > > +/* dedicated timer for replenishment */
> > > > +static struct timer repl_timer;
> > > > +
> > > 
> > > So, there's always only one timer... Even if we have multiple
> > > cpupool
> > > with RTDS as their scheduler, they share the replenishment timer?
> > > I
> > > think it makes more sense to make this per-scheduler.
> > > 
> > Yeah, I totally ignored the case for cpu-pools. It looks like when
> > a
> > cpu-pool is created, it copies the scheduler struct and calls
> > rt_init()
> > where a private field is initialized. So I assume the timer should
> > be put
> > inside the scheduler private struct? Now that I think about it, the
> > timer is
> > hard-coded to run on cpu0. If there're lots of cpu-pools but the
> > replenishment can only be done on the same pcpu, would that be a
> > problem?
> > Should we keep track of all instances of schedulers (nr_rt_ops
> > counts how
> > many) and just put times on different pcpus?
> > 
> > > > +/* controls when to first start the timer*/
> > > > +static int timer_started;
> > > > +
> > > 
> > > I don't like this, and I don't think we need it. In fact, you
> > > removed
> > > it yourself from v3, AFAICT.
> > > 
> > > > @@ -635,6 +652,13 @@ rt_vcpu_insert(const struct scheduler
> > > > *ops,
> > > > struct vcpu *vc)
> > > > 
> > > > ÂÂÂÂÂÂ/* add rt_vcpu svc to scheduler-specific vcpu list of the
> > > > dom */
> > > > ÂÂÂÂÂÂlist_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);
> > > > +
> > > > +ÂÂÂÂif(!timer_started)
> > > > +ÂÂÂÂ{
> > > > +ÂÂÂÂÂÂÂÂ/* the first vcpu starts the timer for the first
> > > > time*/
> > > > +ÂÂÂÂÂÂÂÂtimer_started = 1;
> > > > +ÂÂÂÂÂÂÂÂset_timer(&repl_timer,svc->cur_deadline);
> > > > +ÂÂÂÂ}
> > > > Â }
> > > > 
> > > This also seems to be gone in v3, which is good. In fact, it uses
> > > timer_started, which I already said I didn't like.
> > > 
> > > About the actual startup of the timer (no matter whether for
> > > first time
> > > or not). Here, you were doing it in _vcpu_insert() and not in
> > > _vcpu_wake(); in v3 you're doing it in _vcpu_wake() and not in
> > > _runq_insert()... Which one is the proper way?
> > > 
> > 
> > Correct me if I'm wrong, at the beginning of the boot process, all
> > vcpus are
> > put to sleep/not_runnable after insertions. Therefore, the timer
> > should
> > start when the first vcpu wakes up. I think the wake() in v3 should
> > be
> > correct.
> > 
> > 
> > > > @@ -792,44 +816,6 @@ __runq_pick(const struct scheduler *ops,
> > > > const
> > > > cpumask_t *mask)
> > > > Â }
> > > > 
> > > > Â /*
> > > > - * Update vcpu's budget and
> > > > - * sort runq by insert the modifed vcpu back to runq
> > > > - * lock is grabbed before calling this function
> > > > - */
> > > > -static void
> > > > -__repl_update(const struct scheduler *ops, s_time_t now)
> > > > -{
> > > > 
> > > Please, allow me to say that seeing this function going away,
> > > fills my
> > > heart with pure joy!! :-D
> > > 
> > > > @@ -889,7 +874,7 @@ rt_schedule(const struct scheduler *ops,
> > > > s_time_t
> > > > now, bool_t tasklet_work_sched
> > > > ÂÂÂÂÂÂÂÂÂÂ}
> > > > ÂÂÂÂÂÂ}
> > > > 
> > > > -ÂÂÂÂret.time = MIN(snext->budget, MAX_SCHEDULE); /* sched
> > > > quantum */
> > > > +ÂÂÂÂret.time = snext->budget; /* invoke the scheduler next
> > > > time */
> > > > ÂÂÂÂÂÂret.task = snext->vcpu;
> > > > 
> > > This is ok as it is done in v3 (i.e., snext->budget if !idle, -1
> > > if
> > > idle).
> > > 
> > > > @@ -1074,14 +1055,7 @@ rt_vcpu_wake(const struct scheduler
> > > > *ops,
> > > > struct vcpu *vc)
> > > > ÂÂÂÂÂÂ/* insert svc to runq/depletedq because svc is not in
> > > > queue now
> > > > */
> > > > ÂÂÂÂÂÂ__runq_insert(ops, svc);
> > > > 
> > > > -ÂÂÂÂ__repl_update(ops, now);
> > > > -
> > > > -ÂÂÂÂASSERT(!list_empty(&prv->sdom));
> > > > -ÂÂÂÂsdom = list_entry(prv->sdom.next, struct rt_dom,
> > > > sdom_elem);
> > > > -ÂÂÂÂonline = cpupool_scheduler_cpumask(sdom->dom->cpupool);
> > > > -ÂÂÂÂsnext = __runq_pick(ops, online); /* pick snext from ALL
> > > > valid
> > > > cpus */
> > > > -
> > > > -ÂÂÂÂrunq_tickle(ops, snext);
> > > > +ÂÂÂÂrunq_tickle(ops, svc);
> > > > 
> > > And this is another thing I especially like of this patch: it
> > > makes the
> > > wakeup path a lot simpler and a lot more similar to how it looks
> > > like
> > > in the other schedulers.
> > > 
> > > Good job with this. :-)
> > > 
> > > > @@ -1108,15 +1078,8 @@ rt_context_saved(const struct scheduler
> > > > *ops,
> > > > struct vcpu *vc)
> > > > ÂÂÂÂÂÂif ( test_and_clear_bit(__RTDS_delayed_runq_add, &svc-
> > > > >flags) &&
> > > > ÂÂÂÂÂÂÂÂÂÂÂlikely(vcpu_runnable(vc)) )
> > > > ÂÂÂÂÂÂ{
> > > > +ÂÂÂÂÂÂÂÂ/* only insert the pre-empted vcpu back*/
> > > > ÂÂÂÂÂÂÂÂÂÂ__runq_insert(ops, svc);
> > > > -ÂÂÂÂÂÂÂÂ__repl_update(ops, NOW());
> > > > -
> > > > -ÂÂÂÂÂÂÂÂASSERT(!list_empty(&prv->sdom));
> > > > -ÂÂÂÂÂÂÂÂsdom = list_entry(prv->sdom.next, struct rt_dom,
> > > > sdom_elem);
> > > > -ÂÂÂÂÂÂÂÂonline = cpupool_scheduler_cpumask(sdom->dom-
> > > > >cpupool);
> > > > -ÂÂÂÂÂÂÂÂsnext = __runq_pick(ops, online); /* pick snext from
> > > > ALL
> > > > cpus */
> > > > -
> > > > -ÂÂÂÂÂÂÂÂrunq_tickle(ops, snext);
> > > > ÂÂÂÂÂÂ}
> > > 
> > > Mmm... I'll think about this more and let you know... But out of
> > > the
> > > top of my head, I think the tickling has to stay? You preempted a
> > > vcpu
> > > from the pcpu where it was running, maybe some other pcpu is
> > > either
> > > idle or running a vcpu with a later deadline, and should come and
> > > pick
> > > this one up?
> 
> If that's the case, why should we preempt this VCPU? We should use
> the
> top VCPu in the runq to preempt the lowest priority VCPU, right?
> 
Yeah, in theory, and as far as things goes by "just" looking at
runq_tickle() in sched_rt.c, you're right.

Tickling is (like everything in life :-P) not perfect, though. At least
it's not "instantaneously" that a tickled pcpu come and pick up work...
Something may happen that perturbates the scenario one depicted in his
head when thinking at what will happen after tickling (e.g., the pcpu
being tickled may be doing something which can't be interrupted for a
while).

The scheduler should be robust, and don't explode or don't exhibit
wrong behavior in case, and I think the current code is ok in that
sense.

My idea was that adding one more "tickling point" would make it more
robust, exactly in that regard, i.e., in tolerating anomalies due to
tickling resulting in the pcpu that picked up the work was a different
one from what we expected.

But this indeed introduces more overhead... So, I agree, let's not do
that and see if we encounter problems. We'll come back here if we do.
:-)

> > gEDF allows this but there is overhead and may not be worth it. I
> > have no
> > stats to support this but there are some papers on restricting what
> > tasks
> > can migrate. We can discuss more if we need extra logic here.
> 
> As to gEDF, the scheduling policy does not restrict what tasks can
> migrate, except for the VCPU's hard affinity set by users.ÂÂSo
> migrating is an option. but we should avoid the unnecessary
> preemption/migration.
> 
Agreed.

> > > > +ÂÂÂÂÂÂÂÂif( min_repl> svc->cur_deadline )
> > > > +ÂÂÂÂÂÂÂÂ{
> > > > +ÂÂÂÂÂÂÂÂÂÂÂÂmin_repl = svc->cur_deadline;
> > > > +ÂÂÂÂÂÂÂÂ}
> > > > +ÂÂÂÂÂÂÂÂ/* reinsert the vcpu if its deadline is updated */
> > > > +ÂÂÂÂÂÂÂÂ__q_remove(svc);
> > > > +ÂÂÂÂÂÂÂÂ__runq_insert(ops, svc);
> > > > 
> > > One more proof of what I was trying to say. Is it really this
> > > handler's
> > > job to --basically-- re-sort the runqueue? I don't think so.
> > > 
> > > What is the specific situation that you are trying to handle like
> > > this?
> > > 
> > Right, if we want to count deadline misses, it could be done when a
> > vcpu is
> > picked. However, when selecting the most imminent "inter-release
> > time" of
> > all runnable vcpu, the head of the runq could be missing its
> > deadline and
> > the cur-deadline could be in the past. How do we handle this
> > situation? We
> > still need to scan the runq right?
> 
> I think Dario's point is that:
> All VCPUs on runq should still have remaining budget, so they should
> not have come into the situation of replenishing their budget. So in
> the replenishment handler, runq should not be called. 
>
Exactly.

> I agree the runq
> should not be called to replenish the budget. But the top VCPU in the
> runq may be the next earliest one that should get its budget
> replenished. 
>
With "the top VCPU in the runq" you mean the one that is running on a
pCPU? No, I don't think you refer to that one since, as you said,
running vCPUs are not in the runqueues.

And then why it is only the first vCPU in the runqueue Âyou seem to
care about. I appreciate it has the shortest deadline, but I can't tell
if that's enough to assume we only need to care about it. I probably do
not recall with 100% accuracy the details of the DS algorithm that you
want to implement... In particular, whether or not a replenishment need
to be programmed when the task/vcpu becomes running, so do feel free to
summarize it here for me. :-)

In any case, whatever the algorithm says, what I'm proposing is
something completely different and general enough that would,
hopefully, make it easier to reason on things.

Can we have, together with the timer, a list of replenishment events?
I'm talking about an actual list of entries, where each entry contains
the following information:
Â- time at which the replenishment should happen
Â- amount of replenishment
Â- vcpu to be replenished

The list will be kept in order of replenishment time, and the timer
will always be programmed to fire at the most imminent replenishment
time instant (which will correspond to the first entry in the list).

When the timer fires, it picks up the first entry, takes it out of the
list, does the replenishment and takes care of the effects of the
replenishment itself (deadline update, preemption, runq re-insertion,
moving from depletedq to runq), depending on what the status of the
vcpu being replenished was.

After doing all this, the timer reprograms itself to fire at the
replenishment time of the next (which just became the new first) entry
in the list.

At least, this is the idea. Now, for the implementation:

Â1. instead of only checking the first entry, it's wise that the timer
  handler start to walk the list, and, considering the current time
  (what NOW() gives you) takes care of all entries whose
  replenishment time has passed. This is to deal with cases whereÂ
  the handler may fire a little bit later than expected, and moreÂ
  than just one replenishment event should have occurred already.Â
  Since the list is in order, it is ok to stop as soon as the first
  entry with a replenishment time which is actually in the future is
  found;

Â2. embedded lists give us the opportunity to place one data structureÂ
  in more than one list/queue. So, instead of actually dynamicallyÂ
  allocating and using a dedicated data structure for eachÂ
  replenishment event, I think we can:
  Â- just add another embedded list element in rt_vcpu (just likeÂ
   Âq_elem),
  Â- use that to queue the rt_vcpu-sin the replenishment events list
  Â- add the fields necessary to handle replenishment directly inÂ
   Ârt_vcpu (assuming we need any... If next replenishment time isÂ
   Âthe next absolute deadline and amount is the budget, everything
   Âwe need should already be there)

So.. What do you think?

> > This discussion was before I figured out things about idle_vcpu[]
> > and
> > tasklet. A vcpu could be preempted and put back to either runq or
> > depletedq
> > if a tasklet is scheduled. It could also end up in a depletedq in
> > other
> > situations. I guess Meng's point is this vcpu should be running
> > constantly
> > without being taken off if there is no tasklet, in an effort to
> > follow EDF.
> Hi Tianyang,
> 
> I think Dario made a good point. In order to avoid vcpu being taken
> off from the core, we can handle it in the rt_schedule(), the budget
> enforcement function. This could probably make the code cleaner.
> 
Exactly. Budget enforcement is perfectly fine being done in
rt_schedule().

> > > > 
> > > > +
> > > > +ÂÂÂÂ/* if timer was triggered but none of the vcpus
> > > > +ÂÂÂÂÂ* need to be refilled, set the timer to be the
> > > > +ÂÂÂÂÂ* default period + now
> > > > +ÂÂÂÂÂ*/
> > > > +ÂÂÂÂif(min_repl == LONG_MAX)
> > > > +ÂÂÂÂ{
> > > > +ÂÂÂÂÂÂÂÂset_timer(&repl_timer, now + RTDS_DEFAULT_PERIOD);
> > > > 
> > > I agree with Meng's point in this thread: this should not be
> > > necessary.
> > > If it is, it's most likely because of a bug or to something else.
> > > 
> > > Let's figure out what it is, and fix it properly. (I see that in
> > > v3
> > > this logic is gone, so hopefully you found and fixed the issue
> > > already.)
> > > 
> > Yeah. Like I said the timer is originally programmed to fire when
> > the first
> > vcpu is inserted but all vcpus are not runnable at the beginning of
> > boot
> > process. If the timer is triggered before any vcpu wakes up, there
> > is
> > nothing on queue at all. This should be fixed with wake() in V3.
> 
> Great! I'm wondering if we should look at the v3 version, which
> should
> have fixed many issues? We can decide if the runningq is needed in
> v3.
>
The fact that v3 added runningq is the reason why I'm commenting on
this version: I don't think it's necessary at all, and it was easier to
focus on other issues with that out of the way.

What I've seen in v3 seems ok to me. I can take another look, but I
guess the best thing to do would be to put together a v4 with the fixes
to the issue v2 had taken from v3, the runningq taken away, and the
replenished queue implemented as I described (if you agree with it).

> Dario, What do you think? Right now, I'm kind of lost how we should
> proceed next step? Should we modify based on this version or the
> latest posted version v3?
> 
As you wish, but I'd base the new version on this version, and
"backport" good stuff from v3 to here (again, as it doesn't have the
runningq in the way)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.