[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4] xen: sched: convert RTDS from time to event driven model

On 2/3/2016 7:39 AM, Dario Faggioli wrote:
On Tue, 2016-02-02 at 13:09 -0500, Tianyang Chen wrote:
On 2/2/2016 10:08 AM, Dario Faggioli wrote:
On Sun, 2016-01-31 at 23:32 -0500, Tianyang Chen wrote:

+    struct rt_private *prv = rt_priv(ops);
+    struct list_head *replq = rt_replq(ops);
+    struct list_head *iter;
+    ASSERT( spin_is_locked(&prv->lock) );
Is this really useful? Considering where this is called, I don't
it is.

This basically comes from __q_insert().

I see. Well it's still not necessary, IMO, so don't add it. At some
point, we may want to remove it from __runq_insert() too.

The bottom line is: prv->lock is, currently, the runqueue lock (it's
the lock for everything! :-D). It is pretty obvious that you need the
runq lock to manipulate the runqueue.

It is not the runqueue that you are manipulating here, it is the
replenishment queue, but as a matter of fact, prv->lock serializes that
too. So, if anything, it would be more useful to add a comment
somewhere, noting (reinforcing, I should probably say) the point that
also this new queue for replenishment is being serialized by prv->lock,
than an assert like this one.


I have been searching for the
definitions of vcpu_schedule_lock_irq() and pcpu_schedule_lock_irq().
you know where they are defined? I did some name search in the
xen directory but still nothing.

They are auto-generated. Look in xen/include/xen/sched-if.h.

+    ASSERT( !__vcpu_on_p(svc) );
+    list_for_each(iter, replq)
+    {
+        struct rt_vcpu * iter_svc = __p_elem(iter);
+        if ( svc->cur_deadline <= iter_svc->cur_deadline )
+                break;
+    }
+    list_add_tail(&svc->p_elem, iter);
This looks ok. The list_for_each()+list_ad_tail() part would be
basically identical to what we have inside __runq_insert(),
wouldn't it?

It may make sense to define an _ordered_queue_insert() (or
_deadline_queue_insert(), since it's alwas the deadline that is
compared) utility function that we can then call in both cases.

@@ -449,6 +507,7 @@ rt_init(struct scheduler *ops)
+    INIT_LIST_HEAD(&prv->replq);

Is there a reason why you don't do at least the allocation of the
here? Because, to me, this looks the best place for that.


@@ -473,6 +532,9 @@ rt_deinit(const struct scheduler *ops)
           _cpumask_scratch = NULL;
+    kill_timer(prv->repl_timer);

And here, you're freeing prv, without having freed prv->timer,
which is
therefore leaked.

@@ -632,6 +699,17 @@ rt_vcpu_insert(const struct scheduler *ops,
struct vcpu *vc)
       vcpu_schedule_unlock_irq(lock, vc);

+    if( prv->repl_timer == NULL )
+    {
+        /* vc->processor has been set in schedule.c */
+        cpu = vc->processor;
+        repl_timer = xzalloc(struct timer);
+        prv->repl_timer = repl_timer;
+        init_timer(repl_timer, repl_handler, (void *)ops, cpu);
+    }

This belong to rt_init, IMO.

When a new pool is created, the corresponding scheduler is also
initialized. But there is no pcpu assigned to that pool.

Sure, and in fact, above, when commenting on rt_init() I said "Is there
a reason why you don't do at least the allocation of the timer here?"

But you're right, here I put it like all should belong to rt_init(), I
should have been more clear.

If init_timer()
happens in rt_init, how does it know which cpu to pick?
When move_domain() is called, there must be some pcpus in the
destination pool and then vcpu_insert happens.

Allocating memory for the rt_priv copy of this particular scheduler
instance, definitely belong in rt_init().

Then, in this case, we could do the allocation in rt_init() and do the
initialization later, perhaps here, or the first time that the timer
needs to be started, or when the first pCPU is assigned to this
scheduler (by being assigned to the cpupool this scheduler services).

I honestly prefer the latter. That is to say, in rt_alloc_pdata(), you
check whether prv->repl_timer is NULL and you allocate and initialize
it for the pCPU being brought up. I also would put an explicit 'prv-
repl_timer=NULL', with a comment that proper allocation and
initialization will happen later, and the reason why that is the case,
in rt_init().

What do you think?

Yeah it makes sense to put some comments in rt_init to remind readers that the timer will be taken care of later in alloc_pdata().

@@ -841,7 +881,6 @@ rt_schedule(const struct scheduler *ops,
now, bool_t tasklet_work_sched
       /* burn_budget would return for IDLE VCPU */
       burn_budget(ops, scurr, now);

-    __repl_update(ops, now);

__repl_update() going away is of course fine. But we need to enact
budget enforcement logic somewhere in here, don't we?

Sorry about that, I meant to add it in this version.

More on this later. But then I'm curios. With this patch applied, it
you set b=400/p=1000 for a domain, was the 40% utilization being
enforced properly?

@@ -1058,25 +1103,39 @@ rt_vcpu_wake(const struct scheduler *ops,
struct vcpu *vc)
       if ( unlikely(svc->flags & RTDS_scheduled) )
           set_bit(__RTDS_delayed_runq_add, &svc->flags);
-        return;
+        goto out;

In this case, I'm less sure. I think this should just return as
but it may depend on how other stuff are done (like budget
enforcement), so it's hard to think at it right now, but consider
when working on next version.

OK. For the second case, I guess there is nothing preventing a vcpu
sleeps for a long time on runq to be picked before other vcpus.

By "the second case" you mean this one about __RTDS_delayed_runq_add?
If yes, I'm not following what you mean with "a vcpu sleeps for a long
time on runq etc."

This is what we are talking about here. A vCPU is selected to be
preempted, during rt_schedule(), but it is still runnable, so we need
to put it back to the runqueue. You can't just do that (put it back in
the runqueue), because, in RTDS, like in Credit2, the runqueue is
shared between multiple pCPUs (see commit be9a0806e197f4fd3fa6).

So what do we do, we raise the *_delayed_runq_add flag and continue and
leave the vcpu alone. This happens, for instance, when there is a race
between scheduling out a vcpu that is going to sleep, and the vcpu in
question being woken back up already! It's not frequent, but we need to
take care of that. At some point (sooner rather tan later, most likely)
the context switch will actually happen, and the vcpu that is waking up
is put in the runqueue.

Yeah I agree with this situation. But I meant to say when a vcpu is waking up while it's still on a queue. See below.

And also, repl_handler removes vcpus that are not runnable and if
are not added back here, where should we start tracking them? This
back to if it's necessary to remove a un-runnable vcpu.

I also don't get what you mean by "tracking" (here and in other

So "tracking" basically means adding a vcpu to the replenishment queue. "Stop tracking" means removing a vcpu from the replenishment queue.

Anyway, what I meant when I said that what to do in the "delayed add"
case depends on other things is, for instance, this: the vcpu is waking
up right now, but it is going to be inserted in the runqueue later.
When (as per "where in the code", "in what function") do you want to
program the timer, here in rt_vcpu_wake(), or changing
__replq_insert(), as said somewhere else, or in rt_context_saved(),
where the vcpu is actually added to the runq? This kind of things...

All this being said, taking a bit of a step back, and considering, not
only the replenishment event and/or timer (re)programming issue, but
also the deadline that needs updating, I think you actually need to
find a way to check for both, in this patch.

In fact, if we aim at simplifying rt_context_saved() too --and we
certainly are-- in this case that a vcpu is actually being woken up,
but it's not being put back in the runqueue until context switch, when
it is that we check if it's deadline passed and it hence refreshed
scheduling parameters?

So, big recap, I think a possible variant of rt_vcpu_wake() would look
as follows:

  - if vcpu is running or in a q     ==> /* nothing. */

  - esle if vcpu is delayed_runq_add ==> if (now >= deadline?) {
                                           if ( __vcpu_on_p(svc) )

  - else                             ==> if (now >= deadline?) {
                                           if ( __vcpu_on_p(svc) )



This would mean something like this:

static void
rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     struct rt_vcpu * const svc = rt_vcpu(vc);
     s_time_t now = NOW();
     struct rt_private *prv = rt_priv(ops);

     BUG_ON( is_idle_vcpu(vc) );

     if ( unlikely(curr_on_cpu(vc->processor) == vc) )

For the this situation, a vcpu is waking up on a pcpu. It could be taken off from the replq inside the timer handler at the end. So, if we decide to not replenish any sleeping/un-runnable vcpus any further, we should probably add it back to replq here right?

     /* on RunQ/DepletedQ, just update info is ok */
     if ( unlikely(__vcpu_on_q(svc)) )

For this situation, a vcpu could be going through the following phases:

on runq --> sleep --> (for a long time) --> wake up

When it wakes up, it could stay in the front of the runq because cur_deadline hasn't been updated. It will, inherently have some high priority-ness (because of EDF) and be picked over other vcpus right? Do we want to update it's deadline and budget here and re-insert it accordingly?

     if ( likely(vcpu_runnable(vc)) )

     if ( now >= svc->cur_deadline)
         rt_update_deadline(now, svc);


For here, consider this scenario:

on pcpu (originally on replq) --> sleep --> context_saved() but still asleep --> wake()

Timer handler isn't triggered before it awakes. Then this vcpu will be added to replq again while it's already there. I'm not 100% positive if this situation is possible though. But judging from rt_context_saved(), it checks if a vcpu is runnable before adding back to the runq.

     /* If context hasn't been saved for this vcpu yet, we can't put it on
      * the Runqueue/DepletedQ. Instead, we set a flag so that it will be
      * put on the Runqueue/DepletedQ after the context has been saved.
     if ( unlikely(svc->flags & RTDS_scheduled) )
         set_bit(__RTDS_delayed_runq_add, &svc->flags);

I agree with this. Since replenishment happens above already, it can be inserted back to runq correctly in rt_context_saved().

     /* insert svc to runq/depletedq because svc is not in queue now */
     __runq_insert(ops, svc);
     runq_tickle(ops, snext);


And, at this point, I'm starting thinking again that we want to call
runq_tickle() in rt_context_saved(), at least in case
__RTDS_delayed_runq_add was set as, if we don't, we're missing tickling
for the vcpu being inserted because of that.

Of course, all the above, taken with a little bit of salt... i.e., I do
not expect to have got all the details right, especially in the code,
but it should at least be effective in showing the idea.


So just to wrap up my thoughts for budget replenishment inside of wake(), a vcpu can be in different states(on pcpu, on queue etc.) when it wakes up.

1)wake up on a pcpu:
Do nothing as it may have some remaining budget and we just ignore it's cur_deadline when it wakes up. I can't find a corresponding pattern in DS but this can be treated like a server being pushed back because of the nap.

2)wake up on a queue:
Update the deadline(refill budget) of it and re-insert it into runq for the reason stated above.

3)wake up in the middle of context switching:
Update the deadline(refill budget) of it so it can be re-inserted into runq in rt_context_saved().

4)wake up on nowhere ( sleep on pcpu--> context_saved() --> wake()):
Replenish it and re-insert it back to runq.

5)other? I can't think of any for now.

As for replenishment queue manipulation, we should always try and add it back to replq because we don't know if the timer handler has taken it off or not.

The timer handler and wake() is coupled in a sense that they both manipulate replq.

+            /* when the replenishment happens
+             * svc is either on a pcpu or on
+             * runq/depletedq
+             */
+            if( __vcpu_on_q(svc) )
+            {
+                /* put back to runq */
+                __q_remove(svc);
+                __runq_insert(ops, svc);
+                runq_tickle(ops, svc);
+            }
Aha! So, it looks the budget enforcement logic ended up here?
no, I think that doing it in rt_schedule(), as we agreed, would
things simpler and better looking, both here and there.

This is the replenishment timer handler, and it really should do
thins: handle replenishment events. That's it! ;-P

Oh I might be misunderstanding what should be put in rt_schedule().
it specifically for handling a vcpu that shouldn't be taken off a
Why are you referring this piece of code as budget enforcement?

Isn't it necessary to modify the runq here after budget
has happened? If runq goes out of order here, will rt_schedule() be
sorting it?

Ops, yes, you are right, this is not budget enforcement. So, budget
enforcement looks still to be missing from this patch (and that is why
I asked whether it is actually working).

And this chunk of code (and perhaps this whole function) is probably
ok, it's just a bit hard to understand what it does...

So, the burn_budget() is still there right? I'm not sure what you are referring to as budget enforcement. Correct me if I'm wrong, budget enforcement means that each vcpu uses its assigned/remaining budget. Is there any specific situation I'm missing?

So, do you mind if we take a step back again, and analyze the
situation. When the timer fires, and this handler is executed and a
replenishment event taken off the replenishment queue, the following
situations are possible:

  1) the replenishment is for a running vcpu
  2) the replenishment is for a runnable vcpu in the runq
  3) the replenishment is for a runnable vcpu in the depletedq
  4) the replenishment is for a sleeping vcpu

So, 4) is never going to happen, right?, as we're stopping the timer
when a vcpu blocks. If we go for this, let's put an ASSERT() and a
comment as a guard for that. If we do the optimization in this very
patch. If we leave it for another patch, we need to handle this case, I

In all 1), 2) and 3) cases, we need to remove the replenishment event
from the replenishment queue, so just (in the code) do it upfront.

I don't think we need to take them of the replenishment queue? See the last couple paragraphs.

What other actions need to happen in each case, 1), 2) and 3)? Can you
summarize then in here, so we can decide how to handle each situation?

Here we go.
1) the replenishment is for a running vcpu
Replenish the vcpu. Keep it on the replenishment queue. Tickle it as it may have a later deadline.

2) the replenishment is for a runnable vcpu in the runq
Replenish the vcpu. Keep it on the replenishment queue. (re-)insert it into runq. Tickle it.

3) the replenishment is for a runnable vcpu in the depletedq
Same as 2)

4) the replenishment is for a sleeping vcpu
Take it off from the replenishment queue and don't do replenishment. As a matter of fact, this patch does the replenishment first and then take it off.

I'd say that, in both case 2) and case 3), the vcpu needs to be taken
out of the queue where they are, and (re-)inserted in the runqueue, and
then we need to tickle... Is that correct?


I'd also say that, in case 1), we also need to tickle because the
deadline may now be have become bigger than some other vcpu that is in
the runqueue?

correct. I missed this case^.

Also, under what conditions we need, as soon as replenishment for vcpu
X happened, to queue another replenishment for the same vcpu, at its
next deadline? Always, I would say (unless the vcpu is sleeping,
because of the optimization you introduced)?

So this is kinda confusing. If in case 1) 2) and 3) they are never taken off from the replenishment queue, a series of replenishment events are automatically updated after the replenishment. There is no need, as least one less reason, to keep track of when to queue the next vcpus for replenishment.

The side effect of this is that all vcpus could be going to sleep right after the timer was programmed at the end of the handler. If they don't wake up before the timer fires next time, the handler is called to replenish nobody. If that's the case, timer will not be programmed again in the handler. Wake() kicks in and programs it instead. And the downside of this is that we always need to check if an awaking vcpu is on replq or not. It not add it back(hence the notion of starting to track it's replenishment events again). When adding back, check if there is a need to reprogram the timer.

I really like the idea of replenishment queue compared to running queue in that the places that we need to manipulate the queue are only in wake() and repl_handler().


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.