[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V2 1/1] Improved RTDS scheduler



I have removed some of the Ccs so they won't get bothered as we discussed previously.

On 1/25/2016 4:00 AM, Dario Faggioli wrote:
On Thu, 2015-12-31 at 05:20 -0500, Tianyang Chen wrote:

@@ -147,6 +148,16 @@ static unsigned int nr_rt_ops;
   * Global lock is referenced by schedule_data.schedule_lock from all
   * physical cpus. It can be grabbed via vcpu_schedule_lock_irq()
   */
+
+/* dedicated timer for replenishment */
+static struct timer repl_timer;
+
So, there's always only one timer... Even if we have multiple cpupool
with RTDS as their scheduler, they share the replenishment timer? I
think it makes more sense to make this per-scheduler.

Yeah, I totally ignored the case for cpu-pools. It looks like when a cpu-pool is created, it copies the scheduler struct and calls rt_init() where a private field is initialized. So I assume the timer should be put inside the scheduler private struct? Now that I think about it, the timer is hard-coded to run on cpu0. If there're lots of cpu-pools but the replenishment can only be done on the same pcpu, would that be a problem? Should we keep track of all instances of schedulers (nr_rt_ops counts how many) and just put times on different pcpus?

+/* controls when to first start the timer*/
+static int timer_started;
+
I don't like this, and I don't think we need it. In fact, you removed
it yourself from v3, AFAICT.

@@ -635,6 +652,13 @@ rt_vcpu_insert(const struct scheduler *ops,
struct vcpu *vc)

      /* add rt_vcpu svc to scheduler-specific vcpu list of the dom */
      list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);
+
+    if(!timer_started)
+    {
+        /* the first vcpu starts the timer for the first time*/
+        timer_started = 1;
+        set_timer(&repl_timer,svc->cur_deadline);
+    }
  }

This also seems to be gone in v3, which is good. In fact, it uses
timer_started, which I already said I didn't like.

About the actual startup of the timer (no matter whether for first time
or not). Here, you were doing it in _vcpu_insert() and not in
_vcpu_wake(); in v3 you're doing it in _vcpu_wake() and not in
_runq_insert()... Which one is the proper way?


Correct me if I'm wrong, at the beginning of the boot process, all vcpus are put to sleep/not_runnable after insertions. Therefore, the timer should start when the first vcpu wakes up. I think the wake() in v3 should be correct.

@@ -792,44 +816,6 @@ __runq_pick(const struct scheduler *ops, const
cpumask_t *mask)
  }

  /*
- * Update vcpu's budget and
- * sort runq by insert the modifed vcpu back to runq
- * lock is grabbed before calling this function
- */
-static void
-__repl_update(const struct scheduler *ops, s_time_t now)
-{

Please, allow me to say that seeing this function going away, fills my
heart with pure joy!! :-D

@@ -889,7 +874,7 @@ rt_schedule(const struct scheduler *ops, s_time_t
now, bool_t tasklet_work_sched
          }
      }

-    ret.time = MIN(snext->budget, MAX_SCHEDULE); /* sched quantum */
+    ret.time = snext->budget; /* invoke the scheduler next time */
      ret.task = snext->vcpu;

This is ok as it is done in v3 (i.e., snext->budget if !idle, -1 if
idle).

@@ -1074,14 +1055,7 @@ rt_vcpu_wake(const struct scheduler *ops,
struct vcpu *vc)
      /* insert svc to runq/depletedq because svc is not in queue now
*/
      __runq_insert(ops, svc);

-    __repl_update(ops, now);
-
-    ASSERT(!list_empty(&prv->sdom));
-    sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
-    online = cpupool_scheduler_cpumask(sdom->dom->cpupool);
-    snext = __runq_pick(ops, online); /* pick snext from ALL valid
cpus */
-
-    runq_tickle(ops, snext);
+    runq_tickle(ops, svc);

And this is another thing I especially like of this patch: it makes the
wakeup path a lot simpler and a lot more similar to how it looks like
in the other schedulers.

Good job with this. :-)

@@ -1108,15 +1078,8 @@ rt_context_saved(const struct scheduler *ops,
struct vcpu *vc)
      if ( test_and_clear_bit(__RTDS_delayed_runq_add, &svc->flags) &&
           likely(vcpu_runnable(vc)) )
      {
+        /* only insert the pre-empted vcpu back*/
          __runq_insert(ops, svc);
-        __repl_update(ops, NOW());
-
-        ASSERT(!list_empty(&prv->sdom));
-        sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
-        online = cpupool_scheduler_cpumask(sdom->dom->cpupool);
-        snext = __runq_pick(ops, online); /* pick snext from ALL
cpus */
-
-        runq_tickle(ops, snext);
      }
Mmm... I'll think about this more and let you know... But out of the
top of my head, I think the tickling has to stay? You preempted a vcpu
from the pcpu where it was running, maybe some other pcpu is either
idle or running a vcpu with a later deadline, and should come and pick
this one up?

gEDF allows this but there is overhead and may not be worth it. I have no stats to support this but there are some papers on restricting what tasks can migrate. We can discuss more if we need extra logic here.

@@ -1167,6 +1130,74 @@ rt_dom_cntl(
      return rc;
  }

+static void repl_handler(void *data){
+    unsigned long flags;
+    s_time_t now = NOW();
+    s_time_t min_repl = LONG_MAX; /* max time used in comparison*/
+    struct scheduler *ops = data;
+    struct rt_private *prv = rt_priv(ops);
+    struct list_head *runq = rt_runq(ops);
+    struct list_head *depletedq = rt_depletedq(ops);
+    struct list_head *iter;
+    struct list_head *tmp;
+    struct rt_vcpu *svc = NULL;
+
+    spin_lock_irqsave(&prv->lock,flags);
+
+    stop_timer(&repl_timer);
+
+    list_for_each_safe(iter, tmp, runq)
+    {

So, I'm a bit lost here. Why does the _replenishment_ timer's handler
scans the runqueue (and, in v3, the running-queue as well)?

I'd expect the replenishment timer to do, ehm, replenishments... And I
wouldn't expect a vcpu that is either in the runqueue or running to
need a replenishment (and, in case it would, I don't think we should
take care of that here).

+        svc = __q_elem(iter);
+        if ( now < svc->cur_deadline )
+            break;
+        rt_update_deadline(now, svc);
+        /* scan the runq to find the min release time
+         * this happens when vcpus on runq miss deadline
+         */

This is exactly my point. It looks to me that you're trying to catch
ready or running vcpus missing their deadline in here, in the
replenishment timer. I don't think this is appropriate... It makes the
logic of the timer handler a lot more complicated than it should be.

Oh, and one thing: the use of the term "release time" is IMO a bit
misleading. Release of what? Typically, the release time of an RT task
(or job) is when the task (or job) is declared ready to run... But I
don't think it's used like this in here.

I propose to just get rid of it.

The "release time" here means the next time when a deferrable server is released and ready to serve. It happens every period. Maybe the term "inter-release time" is more appropriate?
+        if( min_repl> svc->cur_deadline )
+        {
+            min_repl = svc->cur_deadline;
+        }
+        /* reinsert the vcpu if its deadline is updated */
+        __q_remove(svc);
+        __runq_insert(ops, svc);

One more proof of what I was trying to say. Is it really this handler's
job to --basically-- re-sort the runqueue? I don't think so.

What is the specific situation that you are trying to handle like this?

Right, if we want to count deadline misses, it could be done when a vcpu is picked. However, when selecting the most imminent "inter-release time" of all runnable vcpu, the head of the runq could be missing its deadline and the cur-deadline could be in the past. How do we handle this situation? We still need to scan the runq right?

In fact, this is also why I'm not convinced of the fact that we need
the additional queue for running vcpus. Later in the thread, Meng
says:

  "the current running VCPUs on cores also need to replenish their
   budgets at the beginning of their next periods."

And he makes the following example:

  "[a backlogged] VCPU has its period equal to its budget. Suppose this
   VCPU is the only VCPU on a 4 core machine. This VCPU should keep
   running on one core and never be put back to runq. In the current
   code, this VCPU won't have its budget replenished."

But I don't think I understand. When a vcpu runs out of budget, either:
  a. it needs an immediate replenishment
  b. it needs to go to depletedq, and a replenishment event for it
     programmed (which may or may not require re-arming the
     replenishment timer)

Meng's example falls in a., AFAICT, and we can just deal with that when
we handle the 'budget exhausted' event (in rt_schedule(), in this case,
I think).

The case you refer to in the comment above ("when vcpus on runq miss
deadline") can either fall in a. or in b., but in both cases it seems
to me that you can handle it when it happens, instead than inside this
timer handling routine.

This discussion was before I figured out things about idle_vcpu[] and tasklet. A vcpu could be preempted and put back to either runq or depletedq if a tasklet is scheduled. It could also end up in a depletedq in other situations. I guess Meng's point is this vcpu should be running constantly without being taken off if there is no tasklet, in an effort to follow EDF.
+
+    /* if timer was triggered but none of the vcpus
+     * need to be refilled, set the timer to be the
+     * default period + now
+     */
+    if(min_repl == LONG_MAX)
+    {
+        set_timer(&repl_timer, now + RTDS_DEFAULT_PERIOD);

I agree with Meng's point in this thread: this should not be necessary.
If it is, it's most likely because of a bug or to something else.

Let's figure out what it is, and fix it properly. (I see that in v3
this logic is gone, so hopefully you found and fixed the issue
already.)

Yeah. Like I said the timer is originally programmed to fire when the first vcpu is inserted but all vcpus are not runnable at the beginning of boot process. If the timer is triggered before any vcpu wakes up, there is nothing on queue at all. This should be fixed with wake() in V3.

Thanks,
Tianyang Chen



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.