Xen project Mailing List

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.

To: Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Wed, 17 Jun 2015 02:20:36 +0200

Cc: Wei Liu <wei.liu2@xxxxxxxxxx>, Sisu Xi <xisisu@xxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Meng Xu <mengxu@xxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Chong Li <lichong659@xxxxxxxxx>

Delivery-date: Wed, 17 Jun 2015 00:21:08 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, 2015-06-16 at 13:07 -0400, Dagaen Golomb wrote: > I'm not replying inline because this is a more general response that > is not limited > to any particular comment. > That's fine. Thanks for this actually... I love discussing these things, it makes me remind the time when I was doing these stuff myself, and makes me feel young! :-P > Separating the replenishment from the scheduler may be problematic. The nature > of the gEDF scheduler is that when a priority changes it should be instantly > reflected. > 'instantly'. Is there such a thing? :-O > If replenishments were done by themsevles, then the scheduler may > decide to wait for some period of time, and in this same time period a > replenishment > could change what should be executing. > Well, the scheduler then would better *not* decide to wait for any period of time. It better act like this: as soon as a replenishment happens, and a new deadline is assigned to a vCPU, put the vCPU in the runqueue, in the proper position; and if such a newly replenished vCPU should now preempt any other, currently running, vCPU, 'tickle' the pCPU in question and let it reschedule, and pick up the newly replenished vCPU. > Either the gEDF policy will be > violated or > the replenishment routine would have to notify the scheduler after any > replenishment, requiring some form of interaction and possibly more scheduler > invocations than the current structure. > Obviously not. In fact, it's your 'current structure' that already invokes the scheduler for any replenishment. Doing more than that, would be really hard (impossible, I should say). I fact, it's the s_timer that you're making fire for any replenishment, and replenishments happen inside rt_schedule() itself. This is _the_definition_ of invoking the scheduler! OTOH, by using a (some) replenishment timer(s), the scheduler *may* *potentially* be invoked for any replenishment, yes, but it may also very well not. In fact, if even after a replenishment, the new deadline of the replenished vCPU is is father than the deadlines of all the currently running vCPUs, there's no need to tickle any pCPU, no need to call the scheduler. BTW, from what you say, it seems to me that we are in agreement: we _do_not_ want to call the scheduler upon any replenishment. Do as I say and we'll achieve that. > The other option is for the scheduler to > check for replenishments, as it does now. > Yeah, the wrong option. :-D :-D > Without any interaction a violation of > the policy could ensue. The gEDF is not like the credit scheduler where there > is > an accounting period where, during this time, policy may be violated > temporarily. > Yep, I do have an idea of what EDF is. > Further, with two timer routines we now need to deal with any ordering > or preemption > between them (or logic to prevent / order such) which I could argue is far > more > complexity than having them done at once as it is now. > Mmm... I think I'm starting to lose you. Ordering between two timers? I don't see what you mean. Allow me, though, to walk through the behavior of your current approach. I've done this already in a previous email, but no big deal re-doing it. Oh, actually, the current approach is buggy, as you tickle the wrong vCPU, but let's forget about that (let's assume it's fixed). Let's say that time for a replenishment to vCPU w has come. on pCPU X: . timer interrupt . TIMER_SOFTIRQ raised . process softirqs . SCHEDULE_SOFTIRQ raised . process softirqs . rt_schedule() . [spin_lock] . burn_budget(scurr) . __repl_update() (goes through all runnable vcpus!!) . snext = __runq_pick() -> says vCPU w should run on pCPU Y . runq_tickle(snext) ----> tickle pCPU Y [*] . [spin_unlock] . on pCPU Y: . process softirqs . SCHEDULE_SOFTIRQ . rt_schedule() . [spin_lock] . burn_budget(scurr) . __repl_update() (goes through all runnable vcpus!! AGAIN!!) . snext = runq_pick() ----------> pick vCPU w . [spin_unlock] . context switch: vCPU w now runs on pCPU Y So, you forced pCPU X through a full execution of rt_schedule(), with the spinlock, the scan of runqueue and depleted queue, and everything, just for the sake of figuring out that pCPU Y should reschedule. Then you (as soon as practical) actually reschedule on pCPU Y and context switch in favour of vCPU w, enacting EDF. That's 2 scheduler invocations, and rt_schedule() is still ugly and complex to read, maintain and improve. [*] and don't forget that this needs fixing, because right now it's just incorrect, and (just speculating) making runq_pick() behave in the way you want, may not be super-simple. Oh, I was almost forgetting! Wonder what happens in the time interval between when __repl_update() is called (from inside rt_schedule()) on pCPU X, and the actual time instant of the context switch on pCPU Y? Yeah, right, what happens is that you're violating EDF. :-/ This can be accounted as a form of overhead introduced by the implementation, I guess. In this specific case, it's due to the fact that you can't, from pCPU X, just change *instantly* what's running on pCPU Y. No, you've got to send an IPI, then Y has to react and it has to schedule, etc. This is of course dependant on the architecture of the scheduler. Xen's scheduler works like this. FWIW, Linux's scheduler, wrt this, also does. So, see? Transient violation of EDF is just there, no matter the approach! Let's look at the timers way: on pCPU X: . timer interrupt . TIMER_SOFTIRQ raised . process softirqs . replenishment_timer_handler() . [spin_lock] . <for_each_depleted_vcpu(i, i.repl_time < NOW()) { . replenish(i) ---------> replenish vCPU w . runq_tickle(i) -------> tickle pCPU Y . }> . [spin_lock] . on pCPU Y: . process softirqs . SCHEDULE_SOFTIRQ . rt_schedule() . [spin_lock] . burn_budget(scurr) . snext = runq_pick() ----------> pick vCPU w . [spin_unlock] . context switch: vCPU x now runs on pCPU Y So this means only 1 invocation of the scheduler, and only on the pCPU that actually needs to reschedule. There's some overhead due to the timer, of course, but, notably, this allows for shorter critical sections, not to mention better looking and easier to maintain code. Are we still violating EDF? Of course we are, between the replenishment and context switch time, as above. That's unavoidable, I'm afraid. Summarizing, the two solutions are on par wrt temporary violation of the theoretical algorithm, but the timer based approach has loads of other benefits, so let's go with timers. :-) > If both are armed for the same time, which should execute? Scheduler > first before > a possibly applicable replenishment? Or replenishment always first? > Both with added > logic to enforce this and prevent preemption, of course. > I really don't see what you mean here. There won't be anything like that to take care of. Xen offers timers as an abstraction, and deals with them itself. You only need to take care of properly serializing rt_schedule() and the timer handling routine, for the code sections that require that. > Due to this, it is my belief that by the nature of the gEDF policy the > current solution is > better, mostly because it appears to actually be the least complex way that is > functionally correct. > Not really, no. I wouldn't be convinced of this, even if what you have right now were functionally correct. > The gEDF policy just isn't well suited for > decoupling events, as > the events are highly dependent on one another, and particularly dependent at > an > instantaneous timescale. > And here it come 'instantaneous timescale' again. No, sorry, but I don't buy this. In fact, why does it not apply to vCPUs' wakeups? I think it does, conceptually... So, should we move the wakeup logic inside rt_schedule() too? > It would be a hard pitch for a gEDF scheduler with a > similar "accounting period" where the gEDF policy could be violated. > There's no accounting period of any sort. It's called overhead! What we need to do is to try to keep it as small as possible. And I'm quite sure that an effective way to do so is to keep critical sections short, especially when protected by (potentially) highly contented spin locks. RTDS, currently, suffer from poor scalability (and, I suspect, poor latency as well, at least under pressure), and one of the reasons why the work you're doing is interesting is that it can alleviate this... if done with that in mind, of course. > That is > blasphemy in the real-time world. > Blasphemy, that's a nice one! :-D :-D Well, I've never been good at religions, I have to admit. So, yes, I can live with being called blasphemous, I guess. :-) > Its also worth noting that the stereotypical textbook event-driven > model is as we have > now: a single event loop that handles events. In this case the scheduler is > the > conceptually the main loop and this makes perfect sense: its deciding > what to run > (think of the running VCPUs as callback functions and the abstraction > falls into place > perfectly). The event loop needs some information (about replenishments) > before > deciding, and collecting this would be part of the decode and decision > phase for an > event signal. > Talking about 'stereotypes', I don't have any textbook describing an event-driven model at hand right now. However, you RT-Xen people like and use LitmusRT, don't you? Well, you're doing the right thing, because it's a great piece of software! If you're interested, have a look in there. Spoiler: you'll find a lot of timers. :-D :-D More seriously, it's of course rather different, as Linux is not Xen, but since you're looking for stereotypes: - Linux (or at lease LitmusRT patch 2014.2, i.e., what I'm looking at right now) has ticks. This means a timer fires every X milliseconds (on every CPU), and task_tick_litmus() is run as a consequence. Such function does not (at all!) invoke the scheduler directly, it much rather just checks whether doing so is necessary, in this case because of some task having exhausted its budget. If it happened, it calls litmus_reschedule_local() (see below). I know, this is budget enforcement, not replenishment, but I think it works as a good example of my point about using timers. - Actually, budget enforcement can be done, in LitmusRT, in two ways. One is tick based (described above), the other, called 'precise', is timer based. To see that in action, check how the enforcement timer is handled, e.g., in update_enforcement_timer(). Look at on_enforcement_timeout(), and find that it also _does_not_ schedule. It again just asks for the scheduler to be invoked as soon as practical, via (again) litmus_reschedule_local(). - Finally, a 'job release', in LitmusRT terminology, is probably what is most close to a budget replenishment for us here. In fact, when a new RT job is released, it's given full budget, and its scheduling parameters (deadline, in most cases) are updated accordingly, like it happens, for us, with replenishments. Check, therefore, how that happens in add_release() and __add_release(). You'll find a call to arm_release_timer() which calls, when firing, on_release_timer(). From there, go, e.g., to gsnedf_release_jobs(), and follow the call chain through check_for_preemptions() --> preempt() --> preempt_if_preemptable() --> litmus_reschedule(), which then calls either set_tsk_need_resched(current) or smp_send_reschedule(cpu). litmus_reschedule_local(), and similar functions, do something very similar to our tickling, i.e., they ask a specific CPU to go through the scheduler. When this happens via set_tsk_need_resched(current) it's like, in Xen, tickling the pCPU we're on. When it happens via smp_send_reschedule(cpu) it's like, in Xen, tickling the remote pCPU cpu. Now, although some years have passed, I've interacted quite a bit with LitmusRT folks. They're very smart, and usually very good at what they do, and I'd be surprised to find them guilty of that much real-time blasphemy. :-P You also may know that there is an EDF implementation in mainline Linux, and you may would like to go have a look there too. I'm not including any example from there because I'm one of the main authors of it, so it's pretty obvious that it uses the approach I think it's best (and, more important, it would be rather inelegant to cite myself! :-D). Anyway, if you fancy doing that, the core is in kernel/sched/deadline.c. > Not only is there a complexity issue, but as mentioned before this may be the > best performing option, making the most information available and the > best timing > decision possible. Adding a few extra cycles to a hot path, even with > a lock being > held, is worth it if the scheduler and context switching is done less. > Right, so let's use timers, they require less or equal scheduler invocations wrt to... Always invoking the scheduler!! (Not sure why you mention the number of context switches, that's independent on which of the two approaches you choose.) > For the above reasons I think you should reconsider the current > implementation, as it > appears it may be the least complex and easiest to reason about > solution. > Well, no, sorry. :-( I mean, I appreciate that what you have now (once fixed), may be effective in avoiding some unnecessary invocation of the scheduler, with respect to what we currently have in the repo, so I'm not saying it can't be seen as an improvement. It's a "nice hack", actually. However, if we're still aiming at making this scheduler a first class citizen within Xen, we're interested in less hacks, rather than in more --no matter how nice they are. So, with that in mind, it would be a step in the wrong direction, IMO. > Let me know if I'm missing some key insight into how the behavior > could be implemented > correctly and beautifully using the multiple timer approach. > well, I tried. I really did. > I simply > don't see how it can > be done without heavy interaction and information sharing between them > which really > defeats the purpose. > No, it doesn't. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.