[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 1/4] xen: add real time scheduler rtds



2014-09-18 14:08 GMT-04:00 Dario Faggioli <dario.faggioli@xxxxxxxxxx>:
> On gio, 2014-09-18 at 17:08 +0100, George Dunlap wrote:
>> On 09/14/2014 10:37 PM, Meng Xu wrote:
>
>> > +/*
>> > + * update deadline and budget when now >= cur_deadline
>> > + * it need to be updated to the deadline of the current period
>> > + */
>> > +static void
>> > +rt_update_deadline(s_time_t now, struct rt_vcpu *svc)
>> > +{
>> > +    ASSERT(now >= svc->cur_deadline);
>> > +    ASSERT(svc->period != 0);
>> > +
>> > +    do
>> > +        svc->cur_deadline += svc->period;
>> > +    while ( svc->cur_deadline <= now );
>>
>> Is there any possibility that his loop could run on for an unusually
>> long time?  Suppose a vcpu ran, then was put to sleep for a few months,
>> then woke up again?  Or suppose that there was some other condition that
>> could be triggered that would make this take a large number of
>> iterations -- it might end up with a DoS at some point down the line.
>>
>> Would it make sense to add some kind of a failsafe here to fall back to
>> the slow method (using division and multiplication) if the difference is
>> too large?  We should be able to make the check really simple, like this:
>>
>> if ( svc->cur_deadline + (svc->period << UPDATE_LIMIT_SHIFT) > now )
>>   [while loop];
>> else
>>   [multiplication and division];
>>
>> Where UPDATE_LIMIT_SHIFT could be something that would limit the loop to
>> a reasonable number of iterations; say, 10 (which would give you 1024
>> iterations).
>>
> I guess this makes sense. I'm not sure whether I'd put the failsafe here
> or somewhere else (e.g., directly in the vcpu wakeup code), nor I'm sure
> whether this is a real DoS risk, but I can't exclude that, not out of
> the top of my head, at least, so, I'd say, let's go for it.

I like this and will change it like the above. Of course, I will also
change the code when we initianlly set the parameters of a vcpu :-)

>
>> Another option would be to scrap the multiplication and addition
>> altogether, and just set svc->cur_deadline = now + svc->period.  If so
>> much time has passed, it's unlikely that being aligned will be critical.
>>
> Indeed.

Right. I choose to go with the UPDATE_LIMIT_SHIFT as I said above.
Well, either is fine with me. :-)

>
>> Another advantage of this is that we could use this update function both
>> here and below when creating the vcpu (and setting the initial deadline).
>>
> I'm fine with the failsafe and, wrt what to do, I'm fine with both the
> "reset (i.e., using now+deadline) and the occasional division.
>
>> This isn't critical, so if we can't come to a consensus quickly it's
>> probably fine to go in as it is; but if nobody has any objections, I
>> think we should put in some kind of a check like this.
>>
> No objections. :-)

No objections.

>> > +/*
>> > + * schedule function for rt scheduler.
>> > + * The lock is already grabbed in schedule.c, no need to lock here
>> > + */
>> > +static struct task_slice
>> > +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t 
>> > tasklet_work_scheduled)
>> > +{
>> > +    const int cpu = smp_processor_id();
>> > +    struct rt_private *prv = rt_priv(ops);
>> > +    struct rt_vcpu *const scurr = rt_vcpu(current);
>> > +    struct rt_vcpu *snext = NULL;
>> > +    struct task_slice ret = { .migrated = 0 };
>> > +
>> > +    /* clear ticked bit now that we've been scheduled */
>> > +    cpumask_clear_cpu(cpu, &prv->tickled);
>> > +
>> > +    /* burn_budget would return for IDLE VCPU */
>> > +    burn_budget(ops, scurr, now);
>> > +
>> > +    __repl_update(ops, now);
>> > +
>> > +    if ( tasklet_work_scheduled )
>> > +    {
>> > +        snext = rt_vcpu(idle_vcpu[cpu]);
>> > +    }
>> > +    else
>> > +    {
>> > +        cpumask_t cur_cpu;
>> > +        cpumask_clear(&cur_cpu);
>> > +        cpumask_set_cpu(cpu, &cur_cpu);
>> > +        snext = __runq_pick(ops, &cur_cpu);
>> > +        if ( snext == NULL )
>> > +            snext = rt_vcpu(idle_vcpu[cpu]);
>> > +
>> > +        /* if scurr has higher priority and budget, still pick scurr */
>> > +        if ( !is_idle_vcpu(current) &&
>> > +             vcpu_runnable(current) &&
>> > +             scurr->cur_budget > 0 &&
>> > +             ( is_idle_vcpu(snext->vcpu) ||
>> > +               scurr->cur_deadline <= snext->cur_deadline ) )
>> > +            snext = scurr;
>> > +    }
>> > +
>> > +    if ( snext != scurr &&
>> > +         !is_idle_vcpu(current) &&
>> > +         vcpu_runnable(current) )
>> > +        set_bit(__RTDS_delayed_runq_add, &scurr->flags);
>> > +
>> > +    snext->last_start = now;
>> > +    if ( !is_idle_vcpu(snext->vcpu) )
>> > +    {
>> > +        if ( snext != scurr )
>> > +        {
>> > +            __q_remove(snext);
>> > +            set_bit(__RTDS_scheduled, &snext->flags);
>> > +        }
>> > +        if ( snext->vcpu->processor != cpu )
>> > +        {
>> > +            snext->vcpu->processor = cpu;
>> > +            ret.migrated = 1;
>> > +        }
>> > +    }
>> > +
>> > +    ret.time = MILLISECS(1); /* sched quantum */
>> > +    ret.task = snext->vcpu;
>>
>> Isn't having a fixed 1ms quantum going to break with
>> microsecond-granularity period and bugdet?  If someone sets their period
>> sub-millisecond, won't it be allowed to run for a full ms (even though
>> its budget is smaller), and then, because the period has passed, be
>> given a new budget again?  This will allow vcpus with sub-millisecond
>> periods to starve out those with more, won't it? I'm not sure how two
>> sub-millisecond-period vms will interact, but I doubt it would be what
>> anybody's expecting.
>>
> Me and Meng discussed quite a bit about the need for this scheduler to
> become a lot more, let's say, time driven, a pretty key feature for a
> real-time scheduler! :-P

Yes! Dario and I discussed this back and forth before. This can be
solved by adding timers to trigger replenish each vcpu's budget and
update its deadline; When we return the time, we can just return the
remaining budget of the vcpu. (We cannot simply return the remaining
budget of a vcpu without adding timers to update each vcpu's deadline;
That will make a vcpu not able to start its new period timely.)

>
> This is, IMO, an example of that. Another one is the fact that
> replenishments and other operations can be done by setting timers,
> rather than via "polling", avoiding having to scan the full RunQ in hot
> paths like this function, and other places (I said this on xen-devel
> too, while reviewing one of the RFCs).
>
> Actually, what you're saying about the budget, points at another lack of
> the current implementation, i.e., payback for budget overrun (also
> mentioned in the same email, I think). That basically means, in the case
> you're describing, recognizing that the vcpu in question overran, by
> allowing its budget to go negative, and making it pay a price for the
> budget to become positive again, in terms on how far its deadline will
> be set (this can be implemented quite easily, by slightly changing the
> while loop we were discussing above).

Hmm, I think we have other ways to pay a price for the overrun budget.
For example, we can reduce the budget from its next period to pay for
the overrun budget in the current period if it overruns. When we add
timers and avoid such MILLISECS(1) scheduling quantum, the overrun
budget should not be too much, IMO. But anyway, this is something in
the future to solve because it needs to incorporate the extra timers
we will add in the future.

>
> My view is that these things can come as future development, especially
> considering we're thinking of checking this in as an experimental
> feature. However...
>
>> What about #define MAX_SCHEDULE MILLISECS(1), then set ret.time =
>> MIN(snext->bugdet, MAX_SCHEDULE)
>>
> ... yes, something like this should at least help, without requiring too
> much code restructuring, so, yes, I'd go for it.

Roger.

>
>> It might also make sense to do some kind of experiments to determine a
>> minimum reliable period / budget and not allow people to set values
>> below that.
>>
> Yeah, experiments are always a good thing... :-)
>
>> > +/*
>> > + * Pick a cpu where to run a vcpu, possibly kicking out the vcpu running 
>> > there
>> > + * Called by wake() and context_saved()
>> > + * We have a running candidate here, the kick logic is:
>> > + * Among all the cpus that are within the cpu affinity
>> > + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
>> > + * 2) if there are any idle vcpu, kick it.
>> > + * 3) now all pcpus are busy, among all the running vcpus, pick lowest 
>> > priority one
>> > + *    if snext has higher priority, kick it.
>> > + *
>> > + * TODO:
>> > + * 1) what if these two vcpus belongs to the same domain?
>> > + *    replace a vcpu belonging to the same domain introduces more overhead
>> > + *
>> > + * lock is grabbed before calling this function
>> > + */
>> > +static void
>> > +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
>> > +{
>> > +    struct rt_private *prv = rt_priv(ops);
>> > +    struct rt_vcpu *latest_deadline_vcpu = NULL;    /* lowest priority 
>> > scheduled */
>> > +    struct rt_vcpu *iter_svc;
>> > +    struct vcpu *iter_vc;
>> > +    int cpu = 0, cpu_to_tickle = 0;
>> > +    cpumask_t not_tickled;
>> > +    cpumask_t *online;
>> > +
>> > +    if ( new == NULL || is_idle_vcpu(new->vcpu) )
>> > +        return;
>> > +
>> > +    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
>> > +    cpumask_and(&not_tickled, online, new->vcpu->cpu_hard_affinity);
>> > +    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
>> > +
>> > +    /* 1) if new's previous cpu is idle, kick it for cache benefit */
>> > +    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) )
>> > +    {
>> > +        cpu_to_tickle = new->vcpu->processor;
>> > +        goto out;
>> > +    }
>> > +
>> > +    /* 2) if there are any idle pcpu, kick it */
>> > +    /* The same loop also find the one with lowest priority */
>> > +    for_each_cpu(cpu, &not_tickled)
>> > +    {
>> > +        iter_vc = curr_on_cpu(cpu);
>> > +        if ( is_idle_vcpu(iter_vc) )
>> > +        {
>> > +            cpu_to_tickle = cpu;
>> > +            goto out;
>> > +        }
>> > +        iter_svc = rt_vcpu(iter_vc);
>> > +        if ( latest_deadline_vcpu == NULL ||
>> > +             iter_svc->cur_deadline > latest_deadline_vcpu->cur_deadline )
>> > +            latest_deadline_vcpu = iter_svc;
>> > +    }
>> > +
>> > +    /* 3) candicate has higher priority, kick out the lowest priority 
>> > vcpu */
>> > +    if ( latest_deadline_vcpu != NULL && new->cur_deadline < 
>> > latest_deadline_vcpu->cur_deadline )
>> > +    {
>> > +        cpu_to_tickle = latest_deadline_vcpu->vcpu->processor;
>> > +        goto out;
>> > +    }
>> > +
>> > +out:
>> > +    /* TRACE */
>> > +    {
>> > +        struct {
>> > +            unsigned cpu:8, pad:24;
>> > +        } d;
>> > +        d.cpu = cpu_to_tickle;
>> > +        d.pad = 0;
>> > +        trace_var(TRC_RTDS_TICKLE, 0,
>> > +                  sizeof(d),
>> > +                  (unsigned char *)&d);
>> > +    }
>> > +
>> > +    cpumask_set_cpu(cpu_to_tickle, &prv->tickled);
>> > +    cpu_raise_softirq(cpu_to_tickle, SCHEDULE_SOFTIRQ);
>>
>> Won't the "fall-through" here mean that if it doesn't find a suitable
>> cpu to place next, that it will *always* pointlessly tickle cpu 0 (the
>> initial value of cpu_to_tickle)?
>>
> Yep, I think you're right (and, yes, I missed this in my own
> review. :-))
>
>> It seems like you should either put a return before the out: label
>> (possibly with another trace to say the tickle didn't do anything), or
>> set cpu_to_tickle to -1 and then only set the tickled bit and raise the
>> softirq if cpu_to_tickle >= 0.
>>
> +1 (both ways look fine).

I will go with the first option: putting a return before the out label.

>
>> > +/*
>> > + * Should always wake up runnable vcpu, put it back to RunQ.
>> > + * Check priority to raise interrupt
>> > + * The lock is already grabbed in schedule.c, no need to lock here
>> > + * TODO: what if these two vcpus belongs to the same domain?
>> > + */
>> > +static void
>> > +rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
>> > +{
>> > +    struct rt_vcpu * const svc = rt_vcpu(vc);
>> > +    s_time_t now = NOW();
>> > +    struct rt_private *prv = rt_priv(ops);
>> > +    struct rt_vcpu *snext = NULL;        /* highest priority on RunQ */
>> > +    struct rt_dom *sdom = NULL;
>> > +    cpumask_t *online;
>> > +
>> > +    BUG_ON( is_idle_vcpu(vc) );
>> > +
>> > +    if ( unlikely(curr_on_cpu(vc->processor) == vc) )
>> > +        return;
>> > +
>> > +    /* on RunQ/DepletedQ, just update info is ok */
>> > +    if ( unlikely(__vcpu_on_q(svc)) )
>> > +        return;
>> > +
>> > +    /* If context hasn't been saved for this vcpu yet, we can't put it on
>> > +     * the Runqueue/DepletedQ. Instead, we set a flag so that it will be
>> > +     * put on the Runqueue/DepletedQ after the context has been saved.
>> > +     */
>> > +    if ( unlikely(test_bit(__RTDS_scheduled, &svc->flags)) )
>> > +    {
>> > +        set_bit(__RTDS_delayed_runq_add, &svc->flags);
>> > +        return;
>> > +    }
>> > +
>> > +    if ( now >= svc->cur_deadline)
>> > +        rt_update_deadline(now, svc);
>> > +
>> > +    /* insert svc to runq/depletedq because svc is not in queue now */
>> > +    if ( svc->cur_budget > 0 )
>> > +        __runq_insert(ops, svc);
>> > +    else
>> > +        list_add(&svc->q_elem, &prv->depletedq);
>>
>> This check is already done inside of __runq_insert() -- you don't need
>> to duplicate it here.
>>
>> > +
>> > +    __repl_update(ops, now);
>> > +
>> > +    ASSERT(!list_empty(&prv->sdom));
>> > +    sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
>> > +    online = cpupool_scheduler_cpumask(sdom->dom->cpupool);
>> > +    snext = __runq_pick(ops, online);    /* pick snext from ALL valid 
>> > cpus */
>> > +
>> > +    runq_tickle(ops, snext);
>>
>> Is there a need to check the status of the "front of the queue" on every
>> wake?  In theory the only vcpu which we want to check is the one we just
>> woke up.  Would it make sense instead to do something like if(snext ==
>> svc) runq_tickle()?
>>
> Well, since __repl_update() is called, I think there is, as things may
> have changed in the RunQ, not only (at least potentially) for the task
> that is waking up.
>
> Or perhaps you were suggesting not to call __repl_update()?
>
> What Meng is doing here, I think, is basically to use the wakeup of a
> vcpu as a potential "scheduling point", i.e., an he's taking the chance
> to update the vcpus parameter, look at the RunQ, and see if anything
> changed and a re-schedule is necessary.
>
> This implies some overhead, but it also increases the precision... It's,
> IIUIC, another way of introducing a varying scheduling quantum, similar
> to the one you proposed above, when commenting to do_schedule().
>
> I really don't like the fact that we scan the RunQ so often. By making
> (clever) use of timers and other tricks, I'm sure we can improve, but
> probably not until the code is structured this way.
>
> As said before, I'm fine with such improvements to come later, as
> further development, after 4.5.
>
>> I was about to say that since our scheduling quantum is so short (1ms),
>> it almost seems like we could get away without tickling entirely; or,
>> that we could just check to see if it's got a high enough priority to
>> run on the cpu it's currently running on, and just run it there, without
>> checking everywhere else.
>>
> Maybe we can use something different than tickling (in future!), but I
> don't think we can do without something that achieves the same, not if
> we want to stick to implement EDF properly, as I think we should.
>
> Or, if you want, since we're dealing with (potentially) very small
> numbers already, a scheduling quantum that would allow us to avoid
> actively pushing a vcpu to the a specific processor would be too small
> to be practical... which, yes, is pretty much your point. :-)
>

I totally agree with Dario's explanation above about  the
runq_tickle(). Thank you very much Dario! :-P
vcpu wake up is an event to trigger the repl_update() function,
because we have one more vcpu in the scheduler and this newly added
vcpu could potentially affect the current scheduling vcpus. As time
also elapses, the vcpus in RunQ/DelepletedQ may also be able to update
their deadline and budget (if now > their deadline); So we need to
call __repl_update() to update all vcpus' information and pick the
current highest priority one.

If we remove runq_tickle(), we may not follow the EDF for less than
1ms, since 1ms is the scheduling quantum. In the next nearest
scheduling point, we will call the same logic in schedule() function.

Thank you very much!

Meng


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.