[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.



Hi Dario and Dagaen,

2015-06-09 5:53 GMT-07:00 Dario Faggioli <dario.faggioli@xxxxxxxxxx>:
>
> Hello Dagaen,
>
> Thanks for doing this. The first question I have is whether you run any
> test/benchmark that you can show the result of here.

Thank you very much Dario for your quick review!

>
> For instance, a few weeks ago, Meng (while trying to do a completely
> different thing) sent here on the list some numbers about the frequency
> of the scheduler being called, and the overhead caused by that... Would
> it be hard to run the same experiments and collect the numbers, with and
> without your patch?

I think there should be two ways to evaluate the performance:
1) Using xentrace to trace the frequency of the scheduler being called
and the overhead, when we run workload inside guest domains;
2) Using cyclictest as Dario mentioned before to test the real-time
performance at end user. Dagaen, I can provide you the commands to run
it, which is actually quite simple to run.

>
> On Mon, 2015-06-08 at 07:46 -0400, Dagaen Golomb wrote:
> > To do this, we create a new list that holds, for each
> > vcpu, the time least into the future that it may need to be
> > rescheduled.
> >
> Ok. Actually, what I really expected to see in this patch was, either:
>  a) a new, scheduler wide, list of replenishment times _and_ a timer,
>     always (re)programmed to fire at the most imminent one; or
>  b) one timer for each vcpu, always (re)programmed to fire at the time
>     instant when the replenishment for that vcpu should happen.

Ah...

Here what we are trying to do is:
Reuse the existing timer that is used by rt_schedule(). We use  that
timer as the timer to invoke the schedule function (rt_schedule). So
that timer has two purposes:
(a) budget enforcement as it was;
(b) budget replenish (and deadline enforcement since deadline = period
for each VCPU).

When the timer fires, the rt_schedule is called: enforcement the
budget, replenish the budget of VCPU(s), and pick the (next) VCPU.

In other words, we can think about the scheduler in another way:
The scheduling decision happens at the following two time points:
(a) budget enforcement as it was;
(b) budget replenish and deadline enforcement;

Whenever any of the above two situations occurs, the scheduler may
need to pick the next VCPU to run or preempt a current running VCPU.
Therefore, the logic for scheduling decision is unavoidable when
either of these two situation occurs, IMO.

In terms of performance/overhead, I think the option b) you pointed
out has the benefit of low overhead in updating the budget because we
don't need to hold the lock. However, the budget replenish occurs when
deadline of the VCPU is changed (which means the next period of the
VCPU arrives). Then it means the scheduler (may) need to make a new
scheduling decision, in which situation, the scheduler will hold the
lock for the runq. So I'm curious about the source of overhead the
option b) could save over the current implementation/design Dagaen is
doing.

Of course, it is hard to tell the performance difference until we
really run it. IMHO, we should at least have some mental excise and
have good reasons to believe the performance could improve under some
situations.

>
> And note that, when I say "timer", I mean an actual Xen timer, i.e.,
> those things that are started, stopped, and with a timer handling
> routine being called when they expire. For an example, you can have a
> look, in sched_credit.c, at 'ticker' or 'master_ticker'.
>
> Deciding between a) or b) isn't easy, without actually trying to
> implement them. I personally prefer b), and I think it would be a
> superior long term solution (see right below), but given te current
> architecture of the RTDS scheduler (e.g., the fact that it has a global
> runq, etc), I won't nack a), which would most likely be simpler.
>
> Performance and overhead wise, again, hard to tell in advance. b) would
> introduce more overhead (as there are more timers), but will likely
> reveal more scalable (which is not something of great importance for
> this scheduler) and may allow for better latency (which _is_ something
> of great importance for this scheduler :-) ), since there won't be the
> need to lock the whole scheduler during the replenishments (which is,
> OTOH, necessary with a)).

Right. This is true. But when we are replenishing the budget, we are
also changing the deadline, which affect VCPUs' priority and then
affect schedule decision.

>
> And that's it for the replenishments. For enforcing the budget, well, we
> have the ret.time value of the task_slice struct returned by
> rt_schedule, so that's event driven already, and we don't need to do
> much about it.
>
> Does all this make sense? Am I making myself clear enough?
> If not, feel free to ask.

Thank you very much for your detailed explanation. :-)
I just had a little confusion on how much the overhead the option b)
could save.

>
> >  The scheduler chooses the lowest time off of this
> > list and waits until the specified time instead of running every
> > 1 ms as it did before.
> >
> Yeah, well, I see what you mean and how you this is actually succeeding
> (at least AFAICT, by looking at the code, again, it would be nice to
> have some numbers) in improving the scheduler behavior.
>
> However, this transition toward event driven-ness has two goals:
>  * improve the scheduler behavior [check, at least to some extent]
>  * improve the code (readability, maintainability, etc.)
>    [not check at all :-(]

I see. We did consider the readability and maintainability factor in
the design but I think we just neglect the standards that "define" the
rules of readability and maintainability. Is there some documentation
that we could follow?

>
> As an example, the __repl_update() function: it's called 2 times inside
> rt_schedule() and a third from rt_context_saved(), which is basically
> like saying it's called 3 times from rt_schedule(), and this always made
> very few sense to me.

I see. This part should be improved. This is not good for sure.

> In fact, I think that scheduling decisions and
> replenishment events shouldn't be so tightly coupled. There are
> interdependencies, of course (a replenishment could call for a
> scheduling decision to be taken), but not like this.

I see. I think this is the root reason why we had one kind of design
and you had another kind of design. :-)

I see how you think this interdependecies should be handled (via Xen
timers per VCPU in option b), but I didn't quite get the
reason/principles why you think the current design is not good to
handle such interdependencies. :-(

> That's why I say
> I'd like to see a timer handling routine dealing with replenishment, and
> let rt_schedule() do it's job, which is:
>  - enforcing the budget of the running vcpu. If it expires,
>    _(re)program_ the replenishment timer
>  - choose the best vcpu to run next, if necessary
>
> With this patch, you're still using rt_schedule() to do both scheduling
> and replenishments, although you're now doing it at actual replenishment
> times, instead of checking for them every millisecond.
>
> Also, the bits and pieces that you need to add in order to deal with
> this new queue is, really, making things even more complex than they are
> now, which is undesirable.
>
> So, all this being said, let me know what you think about it (and that
> applies to everyone else as well, of course, in particular Meng).

I think I didn't quite get the underlining principles/reasons why the
current design is not good to handle the interdependencies between
budget replenishment and scheduling decisions.

I can understand the strength of option b). I think I can have better
understanding of option b) if I could understand the underlining
principles/reasons/rationale why the current design is not good. (I'm
not saying/arguing the current design is good. I just tried to
understand the weakness of it so that we could know how to handle it
best.)

>
> I won't comment on the code in too much details, as it will require some
> quite radical restructuring, but I'll skim through it and provide some
> hints anyway.

Sure.

>
> > Signed-off-by: Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx>
> > Signed-off-by: Meng Xu <mengxu@xxxxxxxxxxxxx>
> > ---
> >  xen/common/sched_rt.c |  319 
> > ++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 222 insertions(+), 97 deletions(-)
> >
> First of all, it's a big patch. It's only changing one file and one
> logical component, and for that reason, TBH, it's not that hard to
> review. Still I think you can break it in at least two, and perhaps even
> more, if you try to implement what I described above.
>
> > diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> > index 7c39a9e..25f0458 100644
> > --- a/xen/common/sched_rt.c
> > +++ b/xen/common/sched_rt.c
>
> > @@ -156,6 +160,7 @@ struct rt_vcpu {
> >      s_time_t cur_budget;        /* current budget */
> >      s_time_t last_start;        /* last start time */
> >      s_time_t cur_deadline;      /* current deadline for EDF */
> > +    s_time_t next_sched_needed; /* next time to make scheduling decision */
> >
> As an example of why I said that things should become simpler, and are
> instead becoming more complex: with my solution, you don't need anything
> like this. In fact, budget enforcement is already being done ok in
> rt_schedule(), so no need to do anything more/different about it.
> Replenishments should be programmed to fire at cur_deadline, so again,
> no need to add this new field, and multiplex its actual value between
> budget and deadline (which, yes, counts as being something rather
> complex for me! :-D).
>
> > @@ -361,6 +387,12 @@ __q_remove(struct rt_vcpu *svc)
> >          list_del_init(&svc->q_elem);
> >  }
> >
> > +static inline void __t_remove(struct rt_vcpu *svc)
> > +{
> > +     if( !list_empty(&svc->t_elem) )
> > +             list_del_init(&svc->t_elem);
> > +}
> > +
> >
> You're using hard tabs for indenting here. As you see everywhere esle,
> Xen wants 4 spaces for this.
>
> >  /*
> >   * Insert svc with budget in RunQ according to EDF:
> >   * vcpus with smaller deadlines go first.
> > @@ -395,6 +427,72 @@ __runq_insert(const struct scheduler *ops, struct 
> > rt_vcpu *svc)
> >  }
> >
> >  /*
> > + * Insert svc into the timerq, maintaining ascending order by 
> > next_sched_needed.
> > + */
> > +static void __timerq_insert(const struct scheduler *ops, struct rt_vcpu 
> > *svc)
> > +{
> > +    struct rt_private *prv = rt_priv(ops);
> > +    struct list_head *timerq = rt_timerq(ops);
> > +    struct list_head *iter = timerq;
> > +
> > +    ASSERT( spin_is_locked(&prv->lock) );
> > +
> > +    ASSERT( list_empty(&svc->t_elem) );
> > +
> The blank line between the asserts, I'd kill it.
>
> > +     list_for_each(iter, timerq)
> > +     {
> >
> You're still using tabs, and mixing them with spaces, making things look
> even more cumbersome.
>
> > +/*
> > + * Return how far the lowest time on the timerq (that is after NOW) is in 
> > the
> > + * future.
> > + * Lock must be grabbed before calling this.
> > + */
> > +static s_time_t __timerq_next(const struct scheduler *ops, s_time_t now)
> > +{
> > +    struct list_head *timerq = rt_timerq(ops);
> > +    struct list_head *iter, *tmp;
> > +
> > +    list_for_each_safe(iter, tmp, timerq)
> > +    {
> > +        struct rt_vcpu * iter_svc = __t_elem(iter);
> > +
> > +        if ( iter_svc->next_sched_needed > now )
> > +            return (iter_svc->next_sched_needed - now);
> > +        else
> > +            __t_remove(iter_svc);
> > +    }
> > +
> > +    return MAX_SCHEDULE;
> > +}
> >
> If using a timer, you can just get rid of items while, in the timer
> routine, you handle the event associated to them.
>
> Also, why is MAX_SCHEDULE still there?

Is it ok to use a BUG_ON here if it will never be called?

>
> > +/*
> > + * Updates the next_sched_needed field for a vcpu, which is used for
> > + * determining the next needed schedule time for this vcpu. Then updates
> > + * timerq via reinsert.
> > + */
> > +static void update_sched_time(const struct scheduler *ops, s_time_t now,
> > +                              struct rt_vcpu *svc)
> > +{
> > +    /* update next needed schedule time */
> > +    if( test_bit(__RTDS_scheduled, &svc->flags) )
> > +        svc->next_sched_needed = now + svc->cur_budget;
> > +    else
> > +        svc->next_sched_needed = svc->cur_deadline;
> > +
> > +    /* Remove and reinsert to maintain sorted order in timerq */
> > +    __t_remove(svc);
> > +    __timerq_insert(ops, svc);
> > +
> > +    return;
> > +}
> >
> And here's the multiplexing, which --even worse-- (may) require
> rearranging the queue! We really don't need anything like this.
>
> >  /*
> > + * Pick a cpu where to run a vcpu,
> > + * possibly kicking out the vcpu running there
> > + * Called by wake() and context_saved()
> > + * We have a running candidate here, the kick logic is:
> > + * Among all the cpus that are within the cpu affinity
> > + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> > + * 2) if there are any idle vcpu, kick it.
> > + * 3) now all pcpus are busy;
> > + *    among all the running vcpus, pick lowest priority one
> > + *    if snext has higher priority, kick it.
> > + *
> > + * TODO:
> > + * 1) what if these two vcpus belongs to the same domain?
> > + *    replace a vcpu belonging to the same domain introduces more overhead
> > + *
> > + * lock is grabbed before calling this function
> > + */
> > +static void
> > +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> > +{
> > +    struct rt_private *prv = rt_priv(ops);
> > +    struct rt_vcpu *latest_deadline_vcpu = NULL; /* lowest priority */
> > +    struct rt_vcpu *iter_svc;
> > +    struct vcpu *iter_vc;
> > +    int cpu = 0, cpu_to_tickle = 0;
> > +    cpumask_t not_tickled;
> > +    cpumask_t *online;
> > +
> >
> [snip]
>
> And here you are moving a function up. In general, it is better to have
> separate patches that do the movings, with the fact that the are all
> about moving and that the code is not being changed, clearly stated in
> the commit message. This is because it is really really hard to figure
> out, e.g. from here, whether you also changed something in the function
> while moving it, making reviewing a lot harder and more prone to error.
>
> That being said, in this specific case, you're moving up runq_tickle(),
> the purpose of which is to trigger a call to rt_schedule() (after going
> through the softirq machinery)... in order to call it in rt_schedule().
> That's pretty tricky, and is not how tickling should work.
>
> Again, with an approach that really take advantage of timers, this won't
> be necessary.

The reason why we need to call runq_tickle here is to pick the "best"
CPU to run the next VCPU.
Consider the following situation:
The timer may be invoked on a CPU i, which has a medium-priority VCPU
running. However, we may have another CPU j, which has the lowest
priority VCPU running. The next VCPU should be place onto the CPU j to
run instead of on CPU i based on the global EDF scheduling policy.
Because we did not control where the timer is placed when we arm the
timer (in the current design), the timer may be invoked on CPU i, and
that's why we need runq_tickle() here.

But this is a (minor) issue since we need to settle down the design first. :-)

Thank you again for your helpful comments!

Best regards,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.