[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4] xen: sched: convert RTDS from time to event driven model



On Tue, 2016-02-02 at 13:09 -0500, Tianyang Chen wrote:
> On 2/2/2016 10:08 AM, Dario Faggioli wrote:
> > On Sun, 2016-01-31 at 23:32 -0500, Tianyang Chen wrote:
> > >Â
> > > +ÂÂÂÂstruct rt_private *prv = rt_priv(ops);
> > > +ÂÂÂÂstruct list_head *replq = rt_replq(ops);
> > > +ÂÂÂÂstruct list_head *iter;
> > > +
> > > +ÂÂÂÂASSERT( spin_is_locked(&prv->lock) );
> > > +
> > Is this really useful? Considering where this is called, I don't
> > think
> > it is.
> > 
> 
> This basically comes from __q_insert(). 
>
I see. Well it's still not necessary, IMO, so don't add it. At some
point, we may want to remove it from __runq_insert() too.

The bottom line is: prv->lock is, currently, the runqueue lock (it's
the lock for everything! :-D). It is pretty obvious that you need the
runq lock to manipulate the runqueue.

It is not the runqueue that you are manipulating here, it is the
replenishment queue, but as a matter of fact, prv->lock serializes that
too. So, if anything, it would be more useful to add a comment
somewhere, noting (reinforcing, I should probably say) the point that
also this new queue for replenishment is being serialized by prv->lock, 
than an assert like this one.

> I have been searching for theÂ
> definitions of vcpu_schedule_lock_irq() and pcpu_schedule_lock_irq().
> DoÂ
> you know where they are defined? I did some name search in the
> entireÂ
> xen directory but still nothing.
> 
They are auto-generated. Look inÂxen/include/xen/sched-if.h.

> > > +ÂÂÂÂASSERT( !__vcpu_on_p(svc) );
> > > +
> > > +ÂÂÂÂlist_for_each(iter, replq)
> > > +ÂÂÂÂ{
> > > +ÂÂÂÂÂÂÂÂstruct rt_vcpu * iter_svc = __p_elem(iter);
> > > +ÂÂÂÂÂÂÂÂif ( svc->cur_deadline <= iter_svc->cur_deadline )
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂbreak;
> > > +ÂÂÂÂ}
> > > +
> > > +ÂÂÂÂlist_add_tail(&svc->p_elem, iter);
> > > +}
> > This looks ok. The list_for_each()+list_ad_tail() part would be
> > basically identical to what we have inside __runq_insert(),
> > thgough,
> > wouldn't it?
> > 
> > It may make sense to define an _ordered_queue_insert() (or
> > _deadline_queue_insert(), since it's alwas the deadline that is
> > compared) utility function that we can then call in both cases.
> > 
> > > @@ -449,6 +507,7 @@ rt_init(struct scheduler *ops)
> > > ÂÂÂÂÂÂINIT_LIST_HEAD(&prv->sdom);
> > > ÂÂÂÂÂÂINIT_LIST_HEAD(&prv->runq);
> > > ÂÂÂÂÂÂINIT_LIST_HEAD(&prv->depletedq);
> > > +ÂÂÂÂINIT_LIST_HEAD(&prv->replq);
> > > 
> > Is there a reason why you don't do at least the allocation of the
> > timer
> > here? Because, to me, this looks the best place for that.
> > 
> > > ÂÂÂÂÂÂcpumask_clear(&prv->tickled);
> > > 
> > > @@ -473,6 +532,9 @@ rt_deinit(const struct scheduler *ops)
> > > ÂÂÂÂÂÂÂÂÂÂxfree(_cpumask_scratch);
> > > ÂÂÂÂÂÂÂÂÂÂ_cpumask_scratch = NULL;
> > > ÂÂÂÂÂÂ}
> > > +
> > > +ÂÂÂÂkill_timer(prv->repl_timer);
> > > +
> > > ÂÂÂÂÂÂxfree(prv);
> > > 
> > And here, you're freeing prv, without having freed prv->timer,
> > which is
> > therefore leaked.
> > 
> > > @@ -632,6 +699,17 @@ rt_vcpu_insert(const struct scheduler *ops,
> > > struct vcpu *vc)
> > > ÂÂÂÂÂÂvcpu_schedule_unlock_irq(lock, vc);
> > > 
> > > ÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_insert);
> > > +
> > > +ÂÂÂÂif( prv->repl_timer == NULL )
> > > +ÂÂÂÂ{
> > > +ÂÂÂÂÂÂÂÂ/* vc->processor has been set in schedule.c */
> > > +ÂÂÂÂÂÂÂÂcpu = vc->processor;
> > > +
> > > +ÂÂÂÂÂÂÂÂrepl_timer = xzalloc(struct timer);
> > > +
> > > +ÂÂÂÂÂÂÂÂprv->repl_timer = repl_timer;
> > > +ÂÂÂÂÂÂÂÂinit_timer(repl_timer, repl_handler, (void *)ops, cpu);
> > > +ÂÂÂÂ}
> > > Â }
> > > 
> > This belong to rt_init, IMO.
> > 
> 
> When a new pool is created, the corresponding scheduler is alsoÂ
> initialized. But there is no pcpu assigned to that pool. 
>
Sure, and in fact, above, when commenting on rt_init() I said "Is there
a reason why you don't do at least the allocation of the timer here?"

But you're right, here I put it like all should belong to rt_init(), I
should have been more clear.

> If init_timer()Â
> happens in rt_init, how does it know which cpu to pick?
> When move_domain() is called, there must be some pcpus in theÂ
> destination pool and then vcpu_insert happens.
> 
Allocating memory for the rt_priv copy of this particular scheduler
instance, definitely belong in rt_init().

Then, in this case, we could do the allocation in rt_init() and do the
initialization later, perhaps here, or the first time that the timer
needs to be started, or when the first pCPU is assigned to this
scheduler (by being assigned to the cpupool this scheduler services).

I honestly prefer the latter. That is to say, in rt_alloc_pdata(), you
check whether prv->repl_timer is NULL and you allocate and initialize
it for the pCPU being brought up. I also would put an explicit 'prv-
>repl_timer=NULL', with a comment that proper allocation and
initialization will happen later, and the reason why that is the case,
in rt_init().

What do you think?

> > > @@ -841,7 +881,6 @@ rt_schedule(const struct scheduler *ops,
> > > s_time_t
> > > now, bool_t tasklet_work_sched
> > > ÂÂÂÂÂÂ/* burn_budget would return for IDLE VCPU */
> > > ÂÂÂÂÂÂburn_budget(ops, scurr, now);
> > > 
> > > -ÂÂÂÂ__repl_update(ops, now);
> > > 
> > __repl_update() going away is of course fine. But we need to enact
> > the
> > budget enforcement logic somewhere in here, don't we?
> > 
> 
> Sorry about that, I meant to add it in this version.
> 
More on this later. But then I'm curios. With this patch applied, it
you set b=400/p=1000 for a domain, was the 40% utilization being
enforced properly?

> > > @@ -1058,25 +1103,39 @@ rt_vcpu_wake(const struct scheduler *ops,
> > > struct vcpu *vc)
> > > ÂÂÂÂÂÂif ( unlikely(svc->flags & RTDS_scheduled) )
> > > ÂÂÂÂÂÂ{
> > > ÂÂÂÂÂÂÂÂÂÂset_bit(__RTDS_delayed_runq_add, &svc->flags);
> > > -ÂÂÂÂÂÂÂÂreturn;
> > > +ÂÂÂÂÂÂÂÂgoto out;
> > > ÂÂÂÂÂÂ}
> > > 
> > In this case, I'm less sure. I think this should just return as
> > well,
> > but it may depend on how other stuff are done (like budget
> > enforcement), so it's hard to think at it right now, but consider
> > this
> > when working on next version.
> > 
> 
> OK. For the second case, I guess there is nothing preventing a vcpu
> thatÂ
> sleeps for a long time on runq to be picked before other vcpus.
>
By "the second case" you mean this one about __RTDS_delayed_runq_add?
If yes, I'm not following what you mean with "a vcpu sleeps for a long
time on runq etc."

This is what we are talking about here. A vCPU is selected to be
preempted, during rt_schedule(), but it is still runnable, so we need
to put it back to the runqueue. You can't just do that (put it back in
the runqueue), because, in RTDS, like in Credit2, the runqueue is
shared between multiple pCPUs (see commitÂbe9a0806e197f4fd3fa6).

So what do we do, we raise the *_delayed_runq_add flag and continue and
leave the vcpu alone. This happens, for instance, when there is a race
between scheduling out a vcpu that is going to sleep, and the vcpu in
question being woken back up already! It's not frequent, but we need to
take care of that. At some point (sooner rather tan later, most likely)
the context switch will actually happen, and the vcpu that is waking up
is put in the runqueue.

> And also, repl_handler removes vcpus that are not runnable and if
> theyÂ
> are not added back here, where should we start tracking them? This
> goesÂ
> back to if it's necessary to remove a un-runnable vcpu.
> 
I also don't get what you mean by "tracking" (here and in other
places).

Anyway, what I meant when I said that what to do in the "delayed add"
case depends on other things is, for instance, this: the vcpu is waking
up right now, but it is going to be inserted in the runqueue later.
When (as per "where in the code", "in what function") do you want to
program the timer, here in rt_vcpu_wake(), or changing
__replq_insert(), as said somewhere else, or in rt_context_saved(),
where the vcpu is actually added to the runq? This kind of things...


All this being said, taking a bit of a step back, and considering, not
only the replenishment event and/or timer (re)programming issue, but
also the deadline that needs updating, I think you actually need to
find a way to check for both, in this patch.

In fact, if we aim at simplifying rt_context_saved() too --and we
certainly are-- in this case that a vcpu is actually being woken up,
but it's not being put back in the runqueue until context switch, when
it is that we check if it's deadline passed and it hence refreshed
scheduling parameters?


So, big recap, I think a possible variant of rt_vcpu_wake() would look
as follows:

Â- if vcpu is running or in a q   ==> /* nothing. */

Â- esle if vcpu is delayed_runq_add ==> if (now >= deadline?) {
                     update_deadline();
                      if ( __vcpu_on_p(svc) )
                                            _replq_remove();
                    }
                    _replq_insert();

Â- else               ==> if (now >= deadline?) {
                     update_deadline();
                      if ( __vcpu_on_p(svc) )
      
                                     _replq_remove();
        Â
           }
                    _runq_insert();
                    _replq_insert();
                    runq_tickle();

This would mean something like this:

static void
rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
{
ÂÂÂÂstruct rt_vcpu * const svc = rt_vcpu(vc);
ÂÂÂÂs_time_t now = NOW();
ÂÂÂÂstruct rt_private *prv = rt_priv(ops);

ÂÂÂÂBUG_ON( is_idle_vcpu(vc) );

ÂÂÂÂif ( unlikely(curr_on_cpu(vc->processor) == vc) )
ÂÂÂÂ{
ÂÂÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_wake_running);
ÂÂÂÂÂÂÂÂreturn;
ÂÂÂÂ}

ÂÂÂÂ/* on RunQ/DepletedQ, just update info is ok */
ÂÂÂÂif ( unlikely(__vcpu_on_q(svc)) )
ÂÂÂÂ{
ÂÂÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_wake_onrunq);
ÂÂÂÂÂÂÂÂreturn;
ÂÂÂÂ}

ÂÂÂÂif ( likely(vcpu_runnable(vc)) )
ÂÂÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_wake_runnable);
ÂÂÂÂelse
ÂÂÂÂÂÂÂÂSCHED_STAT_CRANK(vcpu_wake_not_runnable);


ÂÂÂÂif ( now >= svc->cur_deadline)
    {
ÂÂÂÂÂÂÂÂrt_update_deadline(now, svc);
        __replq_remove(svc);
    }

    __replq_insert(svc);

ÂÂÂÂ/* If context hasn't been saved for this vcpu yet, we can't put it on
ÂÂÂÂÂ* the Runqueue/DepletedQ. Instead, we set a flag so that it will be
ÂÂÂÂÂ* put on the Runqueue/DepletedQ after the context has been saved.
ÂÂÂÂÂ*/
ÂÂÂÂif ( unlikely(svc->flags & RTDS_scheduled) )
ÂÂÂÂ{
ÂÂÂÂÂÂÂÂset_bit(__RTDS_delayed_runq_add, &svc->flags);
ÂÂÂÂÂÂÂÂreturn;
ÂÂÂÂ}

ÂÂÂÂ/* insert svc to runq/depletedq because svc is not in queue now */
ÂÂÂÂ__runq_insert(ops, svc);
ÂÂÂÂrunq_tickle(ops, snext);

ÂÂÂÂreturn;
}

And, at this point, I'm starting thinking again that we want to call
runq_tickle() in rt_context_saved(), at least in case
__RTDS_delayed_runq_add was set as, if we don't, we're missing tickling
for the vcpu being inserted because of that.

Of course, all the above, taken with a little bit of salt... i.e., I do
not expect to have got all the details right, especially in the code,
but it should at least be effective in showing the idea.

Thoughts?


> > > +ÂÂÂÂÂÂÂÂÂÂÂÂ/* when the replenishment happens
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂ* svc is either on a pcpu or on
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂ* runq/depletedq
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂ*/
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂif( __vcpu_on_q(svc) )
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂ{
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ/* put back to runq */
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ__q_remove(svc);
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ__runq_insert(ops, svc);
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂrunq_tickle(ops, svc);
> > > +ÂÂÂÂÂÂÂÂÂÂÂÂ}
> > > +
> > Aha! So, it looks the budget enforcement logic ended up here?
> > Mmm...
> > no, I think that doing it in rt_schedule(), as we agreed, would
> > make
> > things simpler and better looking, both here and there.
> > 
> > This is the replenishment timer handler, and it really should do
> > one
> > thins: handle replenishment events. That's it! ;-P
> > 
> 
> Oh I might be misunderstanding what should be put in rt_schedule().
> WasÂ
> it specifically for handling a vcpu that shouldn't be taken off a
> core?
> Why are you referring this piece of code as budget enforcement?
> 
> Isn't it necessary to modify the runq here after budget
> replenishmentÂ
> has happened? If runq goes out of order here, will rt_schedule() beÂ
> sorting it?
> 
Ops, yes, you are right, this is not budget enforcement. So, budget
enforcement looks still to be missing from this patch (and that is why
I asked whether it is actually working).

And this chunk of code (and perhaps this whole function) is probably
ok, it's just a bit hard to understand what it does...


So, do you mind if we take a step back again, and analyze the
situation. When the timer fires, and this handler is executed and a
replenishment event taken off the replenishment queue, the following
situations are possible:

Â1) the replenishment is for a running vcpu
Â2) the replenishment is for a runnable vcpu in the runq
Â3) the replenishment is for a runnable vcpu in the depletedq
Â4) the replenishment is for a sleeping vcpu

So, 4) is never going to happen, right?, as we're stopping the timer
when a vcpu blocks. If we go for this, let's put an ASSERT() and a
comment as a guard for that. If we do the optimization in this very
patch. If we leave it for another patch, we need to handle this case, I
guess.

In all 1), 2) and 3) cases, we need to remove the replenishment event
from the replenishment queue, so just (in the code) do it upfront.

What other actions need to happen in each case, 1), 2) and 3)? Can you
summarize then in here, so we can decide how to handle each situation?

I'd say that, in both case 2) and case 3), the vcpu needs to be taken
out of the queue where they are, and (re-)inserted in the runqueue, and
then we need to tickle... Is that correct?

I'd also say that, in case 1), we also need to tickle because the
deadline may now be have become bigger than some other vcpu that is in
the runqueue?

Also, under what conditions we need, as soon as replenishment for vcpu
X happened, to queue another replenishment for the same vcpu, at its
next deadline? Always, I would say (unless the vcpu is sleeping,
because of the optimization you introduced)?

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.