[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2] Modified RTDS scheduler to use an event-driven model instead of polling.

Hi Dario,

(I understand and agree most of your comments, but have one concern
about your comment. I will comment below.)

Hi Dagaen,
Please comment on my comment if you have any concern/idea.

2015-07-07 7:03 GMT-07:00 Dario Faggioli <dario.faggioli@xxxxxxxxxx>:
> On Mon, 2015-07-06 at 22:51 -0700, Meng Xu wrote:
> > >> > +static void replenishment_handler(void* data)
> > >> > +{
> > >> > +    const struct scheduler *ops = data;
> > >> > +    struct rt_private *prv = rt_priv(ops);
> > >> > +    struct list_head *depletedq = rt_depletedq(ops);
> > >> > +    struct list_head *iter;
> > >> > +    struct list_head *tmp;
> > >> > +    struct rt_vcpu *svc = NULL;
> > >> > +    s_time_t now = NOW();
> > >> > +    int new_position = 0;
> > >> > +    unsigned long flags;
> > >> > +
> > >> > +    spin_lock_irqsave(&prv->lock, flags);
> > >> > +
> > >> > +    // Replenish the vCPU that triggered this, and resort it into runq
> > >> > +    list_for_each_safe(iter, tmp, depletedq)
> > >> > +    {
> > >> > +        svc = __q_elem(iter);
> > >> > +        if ( now >= svc->cur_deadline )
> > >> > +        {
> > >> > +            rt_update_deadline(now, svc);
> > >> > +            __q_remove(svc); /* remove from depleted queue */
> > >> > +            new_position = __runq_insert(ops, svc); /* add to runq */
> > >> > +        }
> > >> > +        else break; // leave list_for_each_safe
> > >> > +    }
> > >> > +
> > >> > +    // If we become one of top [# CPUs] in the runq, tickle it
> > >> > +    // TODO: make this work when multiple tickles are required
> > >>
> > >> Why do you need multiple tickles?
> > >>
> > > Because the loop above may have been done more than one replenishment,
> > > which makes sense.
> >
> > Ah-ha, I got the reason.I should have read it twice to figure it out. :-)
> > >
> > > While we are here. I've said that I don't like the fact that we have one
> > > big spinlock for the whole scheduler. However, we do have it right now,
> > > so let's make good use of it.
> >
> > OK.
> >
> > >
> > > So (but please double check what I'm about to say, and provide your
> > > view), it should be safe to access the various svc->vc->processor from
> > > inside here, since we're holding the big log. If that's the case, then
> > > it should be fairly easy to figure out which are the pCPUs that needs to
> > > reschedule (playing with the index, etc), and then use their
> > > vc->processor to decide what pCPU to actually tickle, isn't this the
> > > case?
> >
> > Hmm, I don't quite get what you meant here although I read it for
> > three times. :-(
> >
> Yeah, sorry, my bad. :-/

No problem at all. :-) Thank you so much for your such detailed
explanation!  :-D

> > Did you mean we can decide which PCPUs to tickle for the several
> > "highest" priority vcpus after their budgets get replenished?
> > If so, doesn't that mean that we will "duplicate" (part of) the
> > runq_tickle code to find the pCPUs to preempt? Is it better than the
> > approach that just call runq_tickle each time whenever a high priority
> > "ready" VCPU is found?
> >
> I was just sketching a possible improvement, which as usual is something
> ttivially done, without actually writing the code. However, let me try
> again.

Thank you very much! This is "really" helpful and explains a lot!

> Let's look at runq_tickle(). It is, right now, called from:
>  (1) rt_vcpu_wake(), and it _does_belong_ in there (at least, the logic
>      it implements does), but it's called in a way that I really don't
>      like. In fact, there is a call to __repl_update(), followed by a
>      __runq_pick(), and I think both should go away;
>  (2) in rt_context_saved(), again, following a replenishment update and
>      a pick. Same as above, I'd love for the update+pick to be kick out
>      of here as well;
>  (3) with Dagaen patch, it's called from the replenishment handler, and
>      again I think the rescheduling request logic (i.e., what
>      runq_tickle() currently implements) is fine in there, although how
>      this is done needs to improve to support tickling multiple pCPUs.

Generally speaking, I agree that runq_tickle can go into these three
functions and the implementation will be a little different but more

> Let's look at these one by one.
> 1) rt_vcpu_wake():
>    You are waking up vCPU i, and you're doing it right until the call to
>    __runq_insert(), where you insert the vCPU on the runq. But then, why
>    do you call __repl_update()? Why, at every vCPU wakeup, you want to
>    scan the whole runqueue updating everyone's budget? I mean, I (not
>    completely, but, well...) understand why you've been doing this
>    _until_now_, but after Dagaen patch, we have the replenishment timer,
>    and replenishments will be done by the replenishment timer handler,
>    so you don't need to do this in here any longer!
>    So, basically, I would like Dagaen, as a part of this really cool
>    work he's doing, to remove completely the __repl_update() and
>    __runq_pick() logic from rt_vcpu_wake().  Then, it is ok to call
>    runq_tickle(), but you'll be calling it like this:
>      runq_tickle(ops, svc);
>   i.e., you'll go checking whether the vCPU that just woke up needs to
>   preempt any other one currently running on a pCPU, AND THAT'S IT!
>   We're dealing with a vCPU waking up, not with a wakeup+replenishment
>   +pick+reschedule... Let's keep functions focused on their own purpose,
>   as we were saying in the previous thread, ok? :-)
>   So, as far as this function is concerned, it is ok to call
>   runq_tickle(), and runq_tickle() is doing the right thing already, in
>   the way it's currently implemented, provided it's called _on_ the
>   waking vCPU, not on something else.

Agree! Thanks! :-)

> 2) rt_context_saved()
>    This looks a lot similar to (1). In fact, the replenishment update
>    and the pick should be removed from here completely. In fact, with a
>    similar reasoning to the above case, what is this function doing? It
>    is inserting a vCPU in the runq. It's actually rather similar to a
>    wake-up, the only difference being the wake-up is an actual insertion
>    (i.e., the vCPU --most times-- was not there before), while in this
>    case it is a re-insertion (i.e., the vCPU was there already, but we
>    needed to delay fiddling with it).
>    So, indeed, everything just said in (1) applies here too, and I'd
>    like this function to be restructured accordingly (kill
>    __repl_update(), kill __runq_pick(), kill snext, and call
>    runq_tickle(ops, svc)).
> Now, allow me to stop for a sec, and make some considerations. Dagaen,
> in his patch, is changing __runq_insert(), making it return the position
> in the runqueue where the vCPU has been inserted. This is actually a
> good thing, IMO.
> In different, and generally more scalable implementations of global EDF,
> e.g., where you have multiple runqueues, and hence multiple spinlocks,
> doing something like this is an absolute nightmare. More specifically,
> what is really really difficult (not to implement, but to implement in a
> way that does not defeat the scalability benefits of splitting
> runqueues) is to actually have a reliable and low overhead way of
> looking at other queues and/or pCPUs, to figure out what vCPUs are
> running there, and whether or not the one we are dealing with (e.g.,
> during a vCPU wakeup) should preempt any one of them. E.g., if you look
> at the Linux EDF implementation, which uses different queues, there is a
> whole data structure dedicate to that, implemented in its own source
> file!
> Here, everything is below the umbrella of the big, scheduler wide,
> spinlock... So, really, it's rather easy to keep track of that, and that
> is actually what runq_tickle() is doing (well, it does not keep track of
> it, it goes figuring that out), and what Dagaen is doing by returning
> the index is exactly another way to do that.
> That's actually what I was (badly) trying to hint at in my previous
> mail. Certainly, we don't want to duplicate code from runq_tickle() by
> cutting-&-pasting it around... But maybe we can change things in a way
> that they're even simpler than they are now (which is really what I
> hoped, when thinking to the outcomes of this work, and at using this
> design for it! :-D).

Agree up to this point.

> So, it looks to me that, as far as (1) and (2) are concerned, since we
> are "just" inserting a vCPU in the runq, if we have M pCPUs, and we know
> whether we inserted it within the first M spots, we already have what we
> want, or am I missing something? And if __runq_insert() now (with Dagaen
> patch) tells us this, well, we can simplify the tickling logic, can't
> we?

I think you might assume that the first M VCPUs  in the runq are the
current running VCPUs on the M pCPUs. Am I correct? (From what you
described in the following example, I think I'm correct. ;-) )

> With an example:
> We are waking up (or re-inserting, in rt_context_saved()) vCPU j. We
> have 6 pCPUs. __runq_insert() tells us that it put vCPU j at the 3rd
> place in the runq. This means vCPU j should be set to run as soon as
> possible. So, if vCPU j is 3rd in runq, either
>  (a) there are only 3 runnable vCPUs (i.e., if we are waking up j, there
>      were 2 of them, and j is the third; if we are in context_saved,
>      there already where 3, and j just got it's deadline postponed, or
>      someone else got its one replenished);
>  (b) there are more than 3 runnable vCPUs, i.e., there is at least a 4th
>      vCPU --say vCPU k-- in the runq, which was the 3rd before vCPU j
>      were woken (or re-inserted), but now became the 4th, because
>      deadline(j)<deadline(k).
> In case (a), there are for sure idle pCPUs, and we should tickle one of
> them.

I tell that you make the above assumption from here.

However, in the current implementation, runq does not hold the current
running VCPUs on the pCPUs. We remove the vcpu from runq in
rt_schedule() function. What you described above make perfect sense
"if" we decide to make runq hold the current running VCPUs.

Actually, after thinking about the example you described, I think we
can hold the current running VCPUs *and* the current idle pCPUs in the
scheduler-wide structure; In other words, we can have another runningq
(not runq) and a idle_pcpu list in the rt_private; Now all VCPUs are
stored in three queues: runningq, runq, and depletedq, in increasing
priority order.

When we make the tickle decision, we only need to scan the idle_pcpu
and then runningq to figure out which pCPU to tickle. All of other
design you describe still hold here, except that the position where a
VCPU is inserted into runq cannot directly give us which pCPU to
tickle. What do you think?

> In case (b) there may be idle pCPUs (and, if that's the case, we
> should tickle one of them, of course) or not. If not, we need to go
> figure out which pCPU to tickle, which is exactly what runq_tickle()
> does, but we at least know for sure that we want to tickle the pCPU
> where vCPU k runs, or others where vCPUs with deadline greater than vCPU
> k run.
> Does this make sense?

Yes, if we decide to hold the currently running VCPUs in
scheduler-wide structure: it can be runq or runningq.

> If yes, I think this means we can (and hence
> should) restructure runq_tickle() in order to make it behave as above,
> as I think it would be more quick and effective. It's probably necessary
> to turn the for_each_cpu() loop in there, in a runq scan, but one that
> only starts from a specific vCPU (the one that is ->next wrt the newly
> inserted vCPU, k, in the example above), and stopping at the M-eth vCPU
> in the queue. It's probably necessary to have not only the index, but
> also an rt_vcpu pointer, from __runq_insert() (or maybe not, if svc is
> already available, and we chan check and traverse its q_elem->next
> pointer), and to pass either (or both) to runq_tickle()... But this is
> better determined while implementing things.
> Honestly, I originally thought that runq_tickle() could be simplified
> even more, by looking at the svc->vc->processor field of the vCPU that
> our vCPU is inserted before. I see now, however, that this can't be
> done, as we, not always tickle exactly the pCPU of the vCPU that is now
> next to us (i.e., the pCPU where vCPU k runs, in the example), but we
> may need to tickle the one that is running the latest deadline vCPU,
> which may be another one.

Right. any pCPU might be tickled. It depends on which vCPU is running
on it and what the priority the vCPU has.

> Still, I think I gave enough material for an actual optimization. What
> do you think?

Yes. It is very clear.
The only thing is how we are going to "realize" your assumption. :-)
I'm not so sure if we do need the runningq  to hold the current
running VCPUs since we can still use "for_each_cpu()" to find all
pCPUs used in the scheduler and get the current running VCPU on it.
Adding a new field to hold all such running VCPUs might help speed up
the searching, but need more space. So there is a balance between
space and time.

What is your opinion?  :-)

> Oh, just out of the top of my head, I think that to implement the above
> more easily and effectively, it would make sense to track the idle and
> the already tickled pCPUs at a scheduler-wide level. If that reveals to
> be true, have a look at how Credit2 is doing it, it's not that
> difficult.


> Let's go back to cases above, and look at (3). Well, right now, I don't
> think I see much alternatives to putting the tickling in the loop, and
> (potentially) issuing one call to runq_tickle() for each vCPU that an
> execution instance of the replenishment handler actually replenishes.
> That may not look cheap, but:
>  - it won't be so common that we'll replenish tens of vCPUs in a single
>    instance. Ideally, each execution of the handler will replenish (and
>    hence insert) only one vCPU (and hence tickle only one pCPU), with
>    the loop being there to cover for non-ideal scenarios due to
>    overhead, etc;
>  - with the optimization to the tickling logic suggested above, it
>    should be a bit cheaper to issue multiple tickling calls;
> So, I think I've said everything I wanted to say, and I hope this is a
> bit more clear now.

Thank you again! It is very clear and I'm clear which part is unclear now. :-D

> Dagaen, Meng, any question?
> I really think that, if we manage to implement all this, code quality
> and performance would improve a lot. Oh, considering all the various and
> (logically) different changes that I've hinted at, the patch needs to
> become a patch series! :-D

Sure! Dagaen, what do you think?

> Thanks and Regards,

Thank you for your guidance and best regards,


Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.