[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.



Hi Dario,

First I think I got most of the points you raised/explained! They are
very very clear and thanks for the detailed explanation of your idea!

I have some minor questions about your comments/advices.

>
>> 2015-06-09 5:53 GMT-07:00 Dario Faggioli <dario.faggioli@xxxxxxxxxx>:
>
>> > Thanks for doing this. The first question I have is whether you run any
>> > test/benchmark that you can show the result of here.
>>
>> Thank you very much Dario for your quick review!
>>
> It wasn't as quick as I wanted it to be. I'll improve on that. :-)
>
>> > For instance, a few weeks ago, Meng (while trying to do a completely
>> > different thing) sent here on the list some numbers about the frequency
>> > of the scheduler being called, and the overhead caused by that... Would
>> > it be hard to run the same experiments and collect the numbers, with and
>> > without your patch?
>>
>> I think there should be two ways to evaluate the performance:
>> 1) Using xentrace to trace the frequency of the scheduler being called
>> and the overhead, when we run workload inside guest domains;
>>
> Yes, that counts more as a way of testing whether what is done in here
> actually works, as we expect the scheduler to be invoked a lot less, and
> that would tell us whether or not it is the case.
>
> So, please, provide these numbers (at least the number/frequency of
> invocation, overhead measurements can come later) as soon as practical.
>
>> 2) Using cyclictest as Dario mentioned before to test the real-time
>> performance at end user. Dagaen, I can provide you the commands to run
>> it, which is actually quite simple to run.
>>
> It is indeed, and that's more similar to a proper performance evaluation
> of an aspect which is rather critical for this scheduler, and really
> important for people wanting to use it.
>
> So, yes, seeing some results would be great, independently from the
> specific work done in this patch.

Right! I have some ideas about this test but won't want to mess up the
focus of this thread. :-)
I will raise this test again when we come to it.


>
>> > On Mon, 2015-06-08 at 07:46 -0400, Dagaen Golomb wrote:
>> > > To do this, we create a new list that holds, for each
>> > > vcpu, the time least into the future that it may need to be
>> > > rescheduled.
>> > >
>> > Ok. Actually, what I really expected to see in this patch was, either:
>> >  a) a new, scheduler wide, list of replenishment times _and_ a timer,
>> >     always (re)programmed to fire at the most imminent one; or
>> >  b) one timer for each vcpu, always (re)programmed to fire at the time
>> >     instant when the replenishment for that vcpu should happen.
>>
>> Ah...
>>
>> Here what we are trying to do is:
>> Reuse the existing timer that is used by rt_schedule().
>>
> Yeah, and what I can't figure out is why you decided to do so. The
> reason I don't like it is that things become a lot (more) complex to
> understand, maintain and modify.

Now I get why you think it is harder to maintain. Right. reusing the
timer will just make the rt_schedule complex and make the hot path
longer.  If other schedulers are using this timer only for the budget
enforcement, it won't be a good idea to use it for other purposes in
the RTDS scheduler, considering the design consistency of different
schedulers.

>
>>  We use  that
>> timer as the timer to invoke the schedule function (rt_schedule). So
>> that timer has two purposes:
>> (a) budget enforcement as it was;
>> (b) budget replenish (and deadline enforcement since deadline = period
>> for each VCPU).
>>
> Exactly, the timer has two purposes. It actually has 3, as you're saying
> yourself ("budget replenish (*and* deadline enforcement"). It should
> have exactly one. :-)
>
>> When the timer fires, the rt_schedule is called: enforcement the
>> budget, replenish the budget of VCPU(s), and pick the (next) VCPU.
>>
> Right. OTOH, I think that all it should do is pick the next vcpu, if
> necessary. In fact, the purpose of all this that Dagaen is doing is to
> dramatically reduce the calls to rt_scheduler(), when that is not
> necessary. Well, replenishing the budget of all running vcpus around is
> not the core scheduling function's business, especially considering that
> we're holding a global spinlock.
>
>> In other words, we can think about the scheduler in another way:
>> The scheduling decision happens at the following two time points:
>> (a) budget enforcement as it was;
>> (b) budget replenish and deadline enforcement;
>>
> The scheduler is all that's in sched_rt.c, point here is how things
> should be organized.

Right. Got it.

>
>> Whenever any of the above two situations occurs, the scheduler may
>> need to pick the next VCPU to run or preempt a current running VCPU.
>> Therefore, the logic for scheduling decision is unavoidable when
>> either of these two situation occurs, IMO.
>>
> A budget replenishment for a vcpu with a big period, in a busy system,
> may well end up not requiring any immediate rescheduling of none of the
> lower period vcpus, may not it?

Right. VCPUs with larger period should get replenished with lower frequency.

>
> So, with your approach, it is like this, when replenishment time for
> vcpu X arrives:
>
>   timer interrupt
>     TIMER_SOFTIRQ raised
>       process softirqs
>         SCHEDULE_SOFTIRQ raised
>           process softirqs
>             rt_schedule()
>               [spin_lock]
>               burn_budget(scurr)
>               __repl_update()  (goes through all runnable vcpus!!)
>               snext = __runq_pick(): snext == scurr

Ah, I neglected this situation. As you said below, if scurr has higher
priority than the top VCPU of runq, we are going to tickle the scurr,
which should not happen.

>               runq_tickle(snext)...WAIT, WHAT?!?! :-O
>
> I mean, and I'm noticing this now, if the replenishments done during a
> particular call to rt_schedule() are not enough to change the situation
> on that particular pcpu, and hence the task which was running (and that
> you are deliberately disturbing with _a_full_execution_ of the
> scheduler's core function!) should continue to do so, you're calling
> runq_tickle() right on it. So, AFAICT, you're tickling scurr, not a
> newly replenished vcpu!
>
> Am I missing or misreading something? Let's assume not, and see what
> happens in this case...

You are right! Although this issue (i.e., tickling on scurr instead of
the next high priority VCPU) can be "fixed" (dirtily), it can be
avoided with the design option a) you said.

>
> Looking at runq_tickle(), if there are idle pcpus, you'll probably
> tickle one of them, which will likely pick up the newly replenished
> vcpu, like this (as a followup of the 'callchain' above):
>
>   process softirqs
>     SCHEDULE_SOFTIRQ
>       rt_schedule()
>         [spin_lock]
>         __repl_update()  (goes through all runnable vcpus!! AGAIN!!)
>         snext = runq_pick(): snext == vcpu X
>         [spin_unlock]
>
> If there are no idle pcpus, you probably won't tickle anyone, which is
> also fine.
>
> So, yay, it works!! Is this abuse of runq_ticke() there by desing? :-D

Yes. It abuse the use of rt_schedule() actually. Some of the
operations (e.g., __repl_update() ) are unnecessarily called twice,
which is a bad thing in the hot path. :-(

>
> Jokes apart, if the above analysis is accurate, I think this is a good
> enough example of what I meant when saying to Dagaen "this is making
> things too complex".
Yes. The flow chart you drew is very clear! Thanks!

@Dagaen, what do you think? Please comment on Dario's reply with your
opinion and raise any of your concerns.

>
>> In terms of performance/overhead, I think the option b) you pointed
>> out has the benefit of low overhead in updating the budget because we
>> don't need to hold the lock. However, the budget replenish occurs when
>> deadline of the VCPU is changed (which means the next period of the
>> VCPU arrives). Then it means the scheduler (may) need to make a new
>> scheduling decision, in which situation, the scheduler will hold the
>> lock for the runq. So I'm curious about the source of overhead the
>> option b) could save over the current implementation/design Dagaen is
>> doing.
>>
> Leave my b) option alone, for now. With a) done as I'm suggesting, for
> one, you'll be removing __repl_update() from rt_schedule(), which means
> no longer going through the list of all runnable vcpus with the global
> scheduler spin lock held, which really is something we should aim at for
> this scheduler, sooner rather than later.
>
> Here's how I envision things to go. Again, I'm talking about sticking
> with option a), so no per-vcpu timers, just 1 timer and a global queue,
> which now is a replenishment queue:
>
>   timer interrupt
>     TIMER_SOFTIRQ raised
>       process softirqs
>         replenishment_timer_handler()
>           [spin_lock]
>             <for_each_replenishment_event(repl_time < NOW()) {
>                replenish(vcpu)
>                runq_tickle(vcpu)
>              }>
>           [spin_lock]
>
> Then, on the tickled pcpus (if any):
>
>   process softirqs
>     SCHEDULE_SOFTIRQ
>       rt_schedule()
>         [spin_lock]
>         snext = runq_pick(): snext == vcpu X
>         [spin_unlock]
>
> And get rid of __repl_update(), which makes my eyes bleed every time I
> open sched_rt.c. :-)
>
> Oh, and note that we probably can use a new spin lock for the
> replenishment queue, different from the existing global scheduler
> spinlock, lowering contention and increasing scalabiliy even further!

Here is the main question I have about your advice.
If we are introducing a new spin lock for the replenishment queue,
what is the relation between the new lock and the old lock of the
runq?
Because the deadline decides the priority of VCPUs and thus decides
the ordering of VCPUs in the runq, the "replenish(vcpu)" will operate
on the runq as well. As shown in the workflow in your reply:
     >                replenish(vcpu)
     >                runq_tickle(vcpu)
The runq_tickle(vcpu) will pick the desired CPU. On that CPU,
rt_schedule will pick snext by runq_pick(). Therefore, the replenished
VCPU need to be resorted in the runq. So replenish(vcpu) will operates
on the runq.

Another question in my mind is: do we really need the replenish queue
to record the replenish time? Because the period = deadline in the
current implementation, the deadline of VCPUs in runq is actually the
replenish time. (We are reusing the runq here and I'm unsure if it
good or not in terms of maintenance.)

>
>> > >  The scheduler chooses the lowest time off of this
>> > > list and waits until the specified time instead of running every
>> > > 1 ms as it did before.
>> > >
>> > Yeah, well, I see what you mean and how you this is actually succeeding
>> > (at least AFAICT, by looking at the code, again, it would be nice to
>> > have some numbers) in improving the scheduler behavior.
>> >
>> > However, this transition toward event driven-ness has two goals:
>> >  * improve the scheduler behavior [check, at least to some extent]
>> >  * improve the code (readability, maintainability, etc.)
>> >    [not check at all :-(]
>>
>> I see. We did consider the readability and maintainability factor in
>> the design but I think we just neglect the standards that "define" the
>> rules of readability and maintainability. Is there some documentation
>> that we could follow?
>>
> There is not, and probably never will be... and that's fine, because
> that is not something you write down in a document.
>
> It's trial and error, following suits and making and looking at examples
> (like the analysis I made above).

I see. I like the analysis you made very much!  Thanks! :-)

>
>> I see how you think this interdependecies should be handled (via Xen
>> timers per VCPU in option b), but I didn't quite get the
>> reason/principles why you think the current design is not good to
>> handle such interdependencies. :-(
>>
> Right. I hope this is more clear now.

Yes. I think I get your idea now. The explanation is very clear and neat! :-)

Thank you very much and best regards,

Meng
-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.