[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 1/4] xen: add real time scheduler rtds



On gio, 2014-09-18 at 17:08 +0100, George Dunlap wrote:
> On 09/14/2014 10:37 PM, Meng Xu wrote:

> > +/*
> > + * update deadline and budget when now >= cur_deadline
> > + * it need to be updated to the deadline of the current period
> > + */
> > +static void
> > +rt_update_deadline(s_time_t now, struct rt_vcpu *svc)
> > +{
> > +    ASSERT(now >= svc->cur_deadline);
> > +    ASSERT(svc->period != 0);
> > +
> > +    do
> > +        svc->cur_deadline += svc->period;
> > +    while ( svc->cur_deadline <= now );
> 
> Is there any possibility that his loop could run on for an unusually 
> long time?  Suppose a vcpu ran, then was put to sleep for a few months, 
> then woke up again?  Or suppose that there was some other condition that 
> could be triggered that would make this take a large number of 
> iterations -- it might end up with a DoS at some point down the line.
> 
> Would it make sense to add some kind of a failsafe here to fall back to 
> the slow method (using division and multiplication) if the difference is 
> too large?  We should be able to make the check really simple, like this:
> 
> if ( svc->cur_deadline + (svc->period << UPDATE_LIMIT_SHIFT) > now )
>   [while loop];
> else
>   [multiplication and division];
> 
> Where UPDATE_LIMIT_SHIFT could be something that would limit the loop to 
> a reasonable number of iterations; say, 10 (which would give you 1024 
> iterations).
> 
I guess this makes sense. I'm not sure whether I'd put the failsafe here
or somewhere else (e.g., directly in the vcpu wakeup code), nor I'm sure
whether this is a real DoS risk, but I can't exclude that, not out of
the top of my head, at least, so, I'd say, let's go for it.

> Another option would be to scrap the multiplication and addition 
> altogether, and just set svc->cur_deadline = now + svc->period.  If so 
> much time has passed, it's unlikely that being aligned will be critical.
> 
Indeed.

> Another advantage of this is that we could use this update function both 
> here and below when creating the vcpu (and setting the initial deadline).
> 
I'm fine with the failsafe and, wrt what to do, I'm fine with both the
"reset (i.e., using now+deadline) and the occasional division.

> This isn't critical, so if we can't come to a consensus quickly it's 
> probably fine to go in as it is; but if nobody has any objections, I 
> think we should put in some kind of a check like this.
> 
No objections. :-)

> > +/*
> > + * schedule function for rt scheduler.
> > + * The lock is already grabbed in schedule.c, no need to lock here
> > + */
> > +static struct task_slice
> > +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t 
> > tasklet_work_scheduled)
> > +{
> > +    const int cpu = smp_processor_id();
> > +    struct rt_private *prv = rt_priv(ops);
> > +    struct rt_vcpu *const scurr = rt_vcpu(current);
> > +    struct rt_vcpu *snext = NULL;
> > +    struct task_slice ret = { .migrated = 0 };
> > +
> > +    /* clear ticked bit now that we've been scheduled */
> > +    cpumask_clear_cpu(cpu, &prv->tickled);
> > +
> > +    /* burn_budget would return for IDLE VCPU */
> > +    burn_budget(ops, scurr, now);
> > +
> > +    __repl_update(ops, now);
> > +
> > +    if ( tasklet_work_scheduled )
> > +    {
> > +        snext = rt_vcpu(idle_vcpu[cpu]);
> > +    }
> > +    else
> > +    {
> > +        cpumask_t cur_cpu;
> > +        cpumask_clear(&cur_cpu);
> > +        cpumask_set_cpu(cpu, &cur_cpu);
> > +        snext = __runq_pick(ops, &cur_cpu);
> > +        if ( snext == NULL )
> > +            snext = rt_vcpu(idle_vcpu[cpu]);
> > +
> > +        /* if scurr has higher priority and budget, still pick scurr */
> > +        if ( !is_idle_vcpu(current) &&
> > +             vcpu_runnable(current) &&
> > +             scurr->cur_budget > 0 &&
> > +             ( is_idle_vcpu(snext->vcpu) ||
> > +               scurr->cur_deadline <= snext->cur_deadline ) )
> > +            snext = scurr;
> > +    }
> > +
> > +    if ( snext != scurr &&
> > +         !is_idle_vcpu(current) &&
> > +         vcpu_runnable(current) )
> > +        set_bit(__RTDS_delayed_runq_add, &scurr->flags);
> > +
> > +    snext->last_start = now;
> > +    if ( !is_idle_vcpu(snext->vcpu) )
> > +    {
> > +        if ( snext != scurr )
> > +        {
> > +            __q_remove(snext);
> > +            set_bit(__RTDS_scheduled, &snext->flags);
> > +        }
> > +        if ( snext->vcpu->processor != cpu )
> > +        {
> > +            snext->vcpu->processor = cpu;
> > +            ret.migrated = 1;
> > +        }
> > +    }
> > +
> > +    ret.time = MILLISECS(1); /* sched quantum */
> > +    ret.task = snext->vcpu;
> 
> Isn't having a fixed 1ms quantum going to break with 
> microsecond-granularity period and bugdet?  If someone sets their period 
> sub-millisecond, won't it be allowed to run for a full ms (even though 
> its budget is smaller), and then, because the period has passed, be 
> given a new budget again?  This will allow vcpus with sub-millisecond 
> periods to starve out those with more, won't it? I'm not sure how two 
> sub-millisecond-period vms will interact, but I doubt it would be what 
> anybody's expecting.
> 
Me and Meng discussed quite a bit about the need for this scheduler to
become a lot more, let's say, time driven, a pretty key feature for a
real-time scheduler! :-P

This is, IMO, an example of that. Another one is the fact that
replenishments and other operations can be done by setting timers,
rather than via "polling", avoiding having to scan the full RunQ in hot
paths like this function, and other places (I said this on xen-devel
too, while reviewing one of the RFCs).

Actually, what you're saying about the budget, points at another lack of
the current implementation, i.e., payback for budget overrun (also
mentioned in the same email, I think). That basically means, in the case
you're describing, recognizing that the vcpu in question overran, by
allowing its budget to go negative, and making it pay a price for the
budget to become positive again, in terms on how far its deadline will
be set (this can be implemented quite easily, by slightly changing the
while loop we were discussing above).

My view is that these things can come as future development, especially
considering we're thinking of checking this in as an experimental
feature. However...

> What about #define MAX_SCHEDULE MILLISECS(1), then set ret.time = 
> MIN(snext->bugdet, MAX_SCHEDULE)
> 
... yes, something like this should at least help, without requiring too
much code restructuring, so, yes, I'd go for it.

> It might also make sense to do some kind of experiments to determine a 
> minimum reliable period / budget and not allow people to set values 
> below that.
> 
Yeah, experiments are always a good thing... :-)

> > +/*
> > + * Pick a cpu where to run a vcpu, possibly kicking out the vcpu running 
> > there
> > + * Called by wake() and context_saved()
> > + * We have a running candidate here, the kick logic is:
> > + * Among all the cpus that are within the cpu affinity
> > + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> > + * 2) if there are any idle vcpu, kick it.
> > + * 3) now all pcpus are busy, among all the running vcpus, pick lowest 
> > priority one
> > + *    if snext has higher priority, kick it.
> > + *
> > + * TODO:
> > + * 1) what if these two vcpus belongs to the same domain?
> > + *    replace a vcpu belonging to the same domain introduces more overhead
> > + *
> > + * lock is grabbed before calling this function
> > + */
> > +static void
> > +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> > +{
> > +    struct rt_private *prv = rt_priv(ops);
> > +    struct rt_vcpu *latest_deadline_vcpu = NULL;    /* lowest priority 
> > scheduled */
> > +    struct rt_vcpu *iter_svc;
> > +    struct vcpu *iter_vc;
> > +    int cpu = 0, cpu_to_tickle = 0;
> > +    cpumask_t not_tickled;
> > +    cpumask_t *online;
> > +
> > +    if ( new == NULL || is_idle_vcpu(new->vcpu) )
> > +        return;
> > +
> > +    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
> > +    cpumask_and(&not_tickled, online, new->vcpu->cpu_hard_affinity);
> > +    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
> > +
> > +    /* 1) if new's previous cpu is idle, kick it for cache benefit */
> > +    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) )
> > +    {
> > +        cpu_to_tickle = new->vcpu->processor;
> > +        goto out;
> > +    }
> > +
> > +    /* 2) if there are any idle pcpu, kick it */
> > +    /* The same loop also find the one with lowest priority */
> > +    for_each_cpu(cpu, &not_tickled)
> > +    {
> > +        iter_vc = curr_on_cpu(cpu);
> > +        if ( is_idle_vcpu(iter_vc) )
> > +        {
> > +            cpu_to_tickle = cpu;
> > +            goto out;
> > +        }
> > +        iter_svc = rt_vcpu(iter_vc);
> > +        if ( latest_deadline_vcpu == NULL ||
> > +             iter_svc->cur_deadline > latest_deadline_vcpu->cur_deadline )
> > +            latest_deadline_vcpu = iter_svc;
> > +    }
> > +
> > +    /* 3) candicate has higher priority, kick out the lowest priority vcpu 
> > */
> > +    if ( latest_deadline_vcpu != NULL && new->cur_deadline < 
> > latest_deadline_vcpu->cur_deadline )
> > +    {
> > +        cpu_to_tickle = latest_deadline_vcpu->vcpu->processor;
> > +        goto out;
> > +    }
> > +
> > +out:
> > +    /* TRACE */
> > +    {
> > +        struct {
> > +            unsigned cpu:8, pad:24;
> > +        } d;
> > +        d.cpu = cpu_to_tickle;
> > +        d.pad = 0;
> > +        trace_var(TRC_RTDS_TICKLE, 0,
> > +                  sizeof(d),
> > +                  (unsigned char *)&d);
> > +    }
> > +
> > +    cpumask_set_cpu(cpu_to_tickle, &prv->tickled);
> > +    cpu_raise_softirq(cpu_to_tickle, SCHEDULE_SOFTIRQ);
> 
> Won't the "fall-through" here mean that if it doesn't find a suitable 
> cpu to place next, that it will *always* pointlessly tickle cpu 0 (the 
> initial value of cpu_to_tickle)?
> 
Yep, I think you're right (and, yes, I missed this in my own
review. :-))

> It seems like you should either put a return before the out: label 
> (possibly with another trace to say the tickle didn't do anything), or 
> set cpu_to_tickle to -1 and then only set the tickled bit and raise the 
> softirq if cpu_to_tickle >= 0.
> 
+1 (both ways look fine).

> > +/*
> > + * Should always wake up runnable vcpu, put it back to RunQ.
> > + * Check priority to raise interrupt
> > + * The lock is already grabbed in schedule.c, no need to lock here
> > + * TODO: what if these two vcpus belongs to the same domain?
> > + */
> > +static void
> > +rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
> > +{
> > +    struct rt_vcpu * const svc = rt_vcpu(vc);
> > +    s_time_t now = NOW();
> > +    struct rt_private *prv = rt_priv(ops);
> > +    struct rt_vcpu *snext = NULL;        /* highest priority on RunQ */
> > +    struct rt_dom *sdom = NULL;
> > +    cpumask_t *online;
> > +
> > +    BUG_ON( is_idle_vcpu(vc) );
> > +
> > +    if ( unlikely(curr_on_cpu(vc->processor) == vc) )
> > +        return;
> > +
> > +    /* on RunQ/DepletedQ, just update info is ok */
> > +    if ( unlikely(__vcpu_on_q(svc)) )
> > +        return;
> > +
> > +    /* If context hasn't been saved for this vcpu yet, we can't put it on
> > +     * the Runqueue/DepletedQ. Instead, we set a flag so that it will be
> > +     * put on the Runqueue/DepletedQ after the context has been saved.
> > +     */
> > +    if ( unlikely(test_bit(__RTDS_scheduled, &svc->flags)) )
> > +    {
> > +        set_bit(__RTDS_delayed_runq_add, &svc->flags);
> > +        return;
> > +    }
> > +
> > +    if ( now >= svc->cur_deadline)
> > +        rt_update_deadline(now, svc);
> > +
> > +    /* insert svc to runq/depletedq because svc is not in queue now */
> > +    if ( svc->cur_budget > 0 )
> > +        __runq_insert(ops, svc);
> > +    else
> > +        list_add(&svc->q_elem, &prv->depletedq);
> 
> This check is already done inside of __runq_insert() -- you don't need 
> to duplicate it here.
> 
> > +
> > +    __repl_update(ops, now);
> > +
> > +    ASSERT(!list_empty(&prv->sdom));
> > +    sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
> > +    online = cpupool_scheduler_cpumask(sdom->dom->cpupool);
> > +    snext = __runq_pick(ops, online);    /* pick snext from ALL valid cpus 
> > */
> > +
> > +    runq_tickle(ops, snext);
> 
> Is there a need to check the status of the "front of the queue" on every 
> wake?  In theory the only vcpu which we want to check is the one we just 
> woke up.  Would it make sense instead to do something like if(snext == 
> svc) runq_tickle()?
> 
Well, since __repl_update() is called, I think there is, as things may
have changed in the RunQ, not only (at least potentially) for the task
that is waking up.

Or perhaps you were suggesting not to call __repl_update()?

What Meng is doing here, I think, is basically to use the wakeup of a
vcpu as a potential "scheduling point", i.e., an he's taking the chance
to update the vcpus parameter, look at the RunQ, and see if anything
changed and a re-schedule is necessary.

This implies some overhead, but it also increases the precision... It's,
IIUIC, another way of introducing a varying scheduling quantum, similar
to the one you proposed above, when commenting to do_schedule().

I really don't like the fact that we scan the RunQ so often. By making
(clever) use of timers and other tricks, I'm sure we can improve, but
probably not until the code is structured this way.

As said before, I'm fine with such improvements to come later, as
further development, after 4.5.

> I was about to say that since our scheduling quantum is so short (1ms), 
> it almost seems like we could get away without tickling entirely; or, 
> that we could just check to see if it's got a high enough priority to 
> run on the cpu it's currently running on, and just run it there, without 
> checking everywhere else.
> 
Maybe we can use something different than tickling (in future!), but I
don't think we can do without something that achieves the same, not if
we want to stick to implement EDF properly, as I think we should.

Or, if you want, since we're dealing with (potentially) very small
numbers already, a scheduling quantum that would allow us to avoid
actively pushing a vcpu to the a specific processor would be too small
to be practical... which, yes, is pretty much your point. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.