[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v1 1/4] xen: add real time scheduler rt



Hi Dario,

I have modified the code based on your comments in the previous email.
Since it has a lot of points to modify,  if I don't reply to one
point, it means I agree with your comments and have changed it. :-)

>
>> +/*
>> + * Useful inline functions
>> + */
>> +static inline struct rt_private *RT_PRIV(const struct scheduler *_ops)
>> +{
>> +    return _ops->sched_data;
>> +}
>> +
>> +static inline struct rt_vcpu *RT_VCPU(const struct vcpu *_vcpu)
>> +{
>> +    return _vcpu->sched_priv;
>> +}
>> +
>> +static inline struct rt_dom *RT_DOM(const struct domain *_dom)
>> +{
>> +    return _dom->sched_priv;
>> +}
>> +
>> +static inline struct list_head *RUNQ(const struct scheduler *_ops)
>> +{
>> +    return &RT_PRIV(_ops)->runq;
>> +}
>> +
> I see what's happening, and I remember the suggestion of using static
> inline-s, with which I agree. At that point, however, I'd have the
> function names lower case-ed.
>
> It's probably not a big deal, and I don't think we have any official
> saying about this, but it looks more consistent to me.

Because other schedulers uses the upper-case name, (because they are
macros instead of static inline,) in order to keep the style
consistent with other schedulers, I think I will keep the name in
upper-case. I can send another patch set to change all other
schedulers'  macro definition of these functions to static inline, and
then change the name to lower-case. What do you think?

>> +/*
>> + * Debug related code, dump vcpu/cpu information
>> + */
>> +static void
>> +rt_dump_vcpu(const struct rt_vcpu *svc)
>> +{
>> +    char cpustr[1024];
>> +
>> +    ASSERT(svc != NULL);
>> +    /* flag vcpu */
>> +    if( svc->sdom == NULL )
>> +        return;
>> +
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
>> +    printk("[%5d.%-2u] cpu %u, (%"PRId64", %"PRId64"), cur_b=%"PRId64" 
>> cur_d=%"PRId64" last_start=%"PRId64" onR=%d runnable=%d cpu_hard_affinity=%s 
>> ",
>>
> long line.
>
> Also, we have PRI_stime (look in xen/include/xen/time.h).
>
>> +            svc->vcpu->domain->domain_id,
>> +            svc->vcpu->vcpu_id,
>> +            svc->vcpu->processor,
>> +            svc->period,
>> +            svc->budget,
>> +            svc->cur_budget,
>> +            svc->cur_deadline,
>> +            svc->last_start,
>> +            __vcpu_on_runq(svc),
>> +            vcpu_runnable(svc->vcpu),
>> +            cpustr);
>> +    memset(cpustr, 0, sizeof(cpustr));
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr), 
>> cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool));
> here too.
>
>> +    printk("cpupool=%s\n", cpustr);
>> +
>> +    /* TRACE */
>> +    {
>> +        struct {
>> +            unsigned dom:16,vcpu:16;
>> +            unsigned processor;
>> +            unsigned cur_budget_lo, cur_budget_hi, cur_deadline_lo, 
>> cur_deadline_hi;
>> +            unsigned is_vcpu_on_runq:16,is_vcpu_runnable:16;
>> +        } d;
>> +        d.dom = svc->vcpu->domain->domain_id;
>> +        d.vcpu = svc->vcpu->vcpu_id;
>> +        d.processor = svc->vcpu->processor;
>> +        d.cur_budget_lo = (unsigned) svc->cur_budget;
>> +        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
>> +        d.cur_deadline_lo = (unsigned) svc->cur_deadline;
>> +        d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
>> +        d.is_vcpu_on_runq = __vcpu_on_runq(svc);
>> +        d.is_vcpu_runnable = vcpu_runnable(svc->vcpu);
>> +        trace_var(TRC_RT_VCPU_DUMP, 1,
>> +                  sizeof(d),
>> +                  (unsigned char *)&d);
>> +    }
> Not a too big deal, but is it really useful to trace vcpu dumps? How?

Actually, this function is called in other functions in this file.
Whenever we want to look into the details of a vcpu (for example, when
we insert a vcpu to the RunQ), we call this function and it will trace
the vcpu information. I extract this as a function to avoid writing
this trace again and again in all of those functions, such as
rt_alloc_vdata, rt_free_vdata, burn_budgets.

In addition, we use xl sched-rt to dump the RunQ information to check
the RunQ is properly sorted for debug reason.

When it is upstreamed, I will change this function to static inline,
remove the printf in this function and leave the trace there.

>> +        list_add(&svc->runq_elem, &prv->flag_vcpu->runq_elem);
>>
> Mmm... flag_vcpu, eh? I missed it above where it's declared, but I
> really don't like the name. 'depleted_vcpus'? Or something else (sorry,
> not very creative in this very moment :-))... but flag_vcpu, I found it
> confusing.

It's declared in struct rt_private. It's inited in rt_init function.

The reason why I need this vcpu is:
The RunQ has two parts, the first part is the vcpus with budget and
the first part is sorted based on priority of vcpus. The second part
is the vcpus without budget and not sorted (because we won't schedule
a depleted vcpu, so we don't need to sort them.) The flag_vcpu is the
dummy vcpu that separate the two parts so that we don't have to scan
the whole RunQ when we insert a vcpu without budget into the RunQ.
(If you remember that your suggested to split the RunQ to two queues:
one keeps vcpus with budget and one keeps vcpus without budget, this
is my implementation to your suggestion. It's virtually split the
RunQ. If I have another depleted queue, I need to add some more
functions, like depleted_queue_insert, etc, which adds much more code
than this current implementation. :-P)

>
>> +/*
>> + * point per_cpu spinlock to the global system lock; all cpu have same 
>> global system lock
> long line.
>
>> + */
>> +static void *
>> +rt_alloc_pdata(const struct scheduler *ops, int cpu)
>> +{
>> +    struct rt_private *prv = RT_PRIV(ops);
>> +
>> +    cpumask_set_cpu(cpu, &prv->cpus);
>> +
>> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
>> +
>> +    printtime();
>> +    printk("%s total cpus: %d", __func__, cpumask_weight(&prv->cpus));
>> +    /* same as credit2, not a bogus pointer */
>> +    return (void *)1;
>>
> Can you put in the comment the reason why it is ok to do this, without
> referencing credit2?

Explained in the comment in next patch. In schedule.c, they use the
return value to check if this function correctly allocate pdata by
checking if it returns 1. (Well, this is inconsistent with other code
which uses 0 to indicate success.) The function definition needs
return a void *, so we have to cast the 1 to void *. That's the
reason. :-P

>> +static void *
>> +rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
>> +{
>> +    struct rt_vcpu *svc;
>> +    s_time_t now = NOW();
>> +    long count;
>> +
>> +    /* Allocate per-VCPU info */
>> +    svc = xzalloc(struct rt_vcpu);
>> +    if ( svc == NULL )
>> +    {
>> +        printk("%s, xzalloc failed\n", __func__);
>> +        return NULL;
>> +    }
>> +
>> +    INIT_LIST_HEAD(&svc->runq_elem);
>> +    INIT_LIST_HEAD(&svc->sdom_elem);
>> +    svc->flags = 0U;
>> +    svc->sdom = dd;
>> +    svc->vcpu = vc;
>> +    svc->last_start = 0;            /* init last_start is 0 */
>> +
> As it is, the comment is not very useful... we can see you're
> initializing it to zero. If there's something about the why you do it
> that you think it's worth mentioning, go ahead, otherwise, kill it.
>
>> +    svc->period = RT_DEFAULT_PERIOD;
>> +    if ( !is_idle_vcpu(vc) )
>> +        svc->budget = RT_DEFAULT_BUDGET;
>> +
>> +    count = (now/MICROSECS(svc->period)) + 1;
>> +    /* sync all VCPU's start time to 0 */
>> +    svc->cur_deadline += count * MICROSECS(svc->period);
>> +
> why "+="? What value do you expect cur_deadline to hold at this time?

It should be =. This does not cause a bug because it's in the
rt_alloc_vdata() and the svc->cur_deadline is 0.  But "+=" is not
correct in logic. I modified it. Thank you!

>
> Also, unless I'm missing something, or doing the math wrong, what you
> want to do here is place the deadline one period ahead of NOW().

Ah, no. I think you are thinking about the CBS server mechanisms. (Now
I start to use some notations in real-time academic field) For the
deferrable server, if the release offset of a implicit-deadline
deferrable vcpu is O_i, its deadline will be O_i + p_i * k, where k is
a natural number. So we want to set deadline to the end of the period
during which NOW() is fall in.

>
> In my head, this is just:
>
>     svc->cur_deadline = NOW() + svc->period;
>
> What is it that fiddling with count is buying you that I don't see?

So this calculation is incorrect. :-P

>
>> +    /* Debug only: dump new vcpu's info */
>> +    printtime();
>> +    rt_dump_vcpu(svc);
>> +
> As said before, this has to go, quite possibly by means of replacing it
> by some tracing.

When no more other comments to this patch set, I will make
rt_dump_vcpu as a static inline function and it only has the tracing
inside. Is that OK? Now I just want to use some dump to help me with
some quick debug. :-)

>> +/*
>> + * TODO: same as rt_vcpu_insert()
>> + */
>> +static void
>> +rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
>> +{
>> +    struct rt_vcpu * const svc = RT_VCPU(vc);
>> +    struct rt_dom * const sdom = svc->sdom;
>> +
>> +    printtime();
>> +    rt_dump_vcpu(svc);
>> +
>> +    BUG_ON( sdom == NULL );
>> +    BUG_ON( __vcpu_on_runq(svc) );
>> +
>> +    if ( !is_idle_vcpu(vc) )
>> +        list_del_init(&svc->sdom_elem);
>> +}
>> +
> I remember commenting about these two functions already, during RFC
> phase. It does not look like they've changed much, in response to that
> commenting.
>
> Can you please re-look at
> http://lists.xen.org/archives/html/xen-devel/2014-07/msg01624.html , and
> either reply/explain, or cope with that? In summary, what I was asking
> was whether adding/removing the vcpu to the list of vcpus of a domain
> could be done somewhere or somehow else, as doing it like this means,
> for some code-path, removing and then re-adding it right away.
>
> Looking more carefully, I do see that credit2 does the same thing but,
> for one, that doesn't make it perfect (I still don't like it :-P), and
> for two, credit2 does a bunch of other stuff there, that you don't.
>
> Looking even more carefully, this seems to be related to the TODO you
> put (which may be a new addition wrt RFCv2, I can't remember). So, yes,
> I seriously think you should take care of the fact that, when this
> function is called for moving a domain from a cpupool to another, you
> need to move the vcpus from the old cpupool's runqueue to the new one's
> runqueue.
>
> Have you tested doing such operation (moving domains between cpupools)?
> Because, with the code looking like this, it seems a bit unlikely...

Hmm, I actually did test this operations. I migrate a dom from Pool-o
to the newly created cpupool test with credit scheduler. That's also
the scenario I showed in the cover page of this patch set. :-)
But I didn't try to migrate the domU back to the Pool-o and check if
its vcpus are inserted back to the RunQ.

Now I add the code of removing the vcpu from the RunQ, and tested it
was removed from the RunQ when I move the domU to the new cpupool.

As to remove the vcpu from the domain's vcpu list. I'm not operating
the vcpu list of the structure domain{}, which are used by other
components in Xen. Actually, we have a scheduler-specific private
structure rt_vcpu which uses its field runq_elem to link the vcpu to
the RunQ of the scheduler; We also have a scheduler-specific private
structure rt_dom to record the scheduler-specific data for the domain
(it's defined in sched_rt.c). Inside the rt_dom, it has a vcpu list to
record the scheduler-specific vcpus of this domain. When we move a
domU from a cpupool to another cpupool (say, with rt scheduler), the
domU's scheduler-specific vcpu should be moved to the destination
cpupool's RunQ. So we need to remove the scheduler-specific vcpu from
the source cpupool's RunQ by using this rt_vcpu_remove() and then
insert it to the dest. cpupool's RunQ by using the rt_vcpu_insert().

Does this make sense to you?

>
> Finally, the insert_vcpu hook is called from sched_init_vcpu() too,
> still in schedule.c. And in fact, all the other scheduler use this
> function to put a vcpu on its runqueue for the first time. Once you do
> this, you a fine behaviour, cpupool-wise, almost for free. You
> apparently don't. Why is that? If you'd go for a similar approach, you
> will get the same benefit. Where is it that you insert a vcpu in the
> RunQ for the first time? Again, why there and not here?

Added the runq_insert() statement in this function to add the vcpu to
the RunQ now.
In the old patch, the vcpu is inserted in the wake-up function. :-(

>> +     * update deadline info: When deadline is in the past,
>> +     * it need to be updated to the deadline of the current period,
>> +     * and replenish the budget
>> +     */
>> +    delta = now - svc->cur_deadline;
>> +    if ( delta >= 0 )
>> +    {
>> +        count = ( delta/MICROSECS(svc->period) ) + 1;
>> +        svc->cur_deadline += count * MICROSECS(svc->period);
>> +        svc->cur_budget = svc->budget * 1000;
>> +
> Ditto, about time units and conversions.
>
> I also am sure I said before that I prefer an approach like:
>
>     while ( svc->cur_deadline < now )
>         svc->cur_deadline += svc->period;+/*

I explained in the previous comment. :-) Now could be  several periods
late after the current deadline. And this is deferrable server. :-)

>> +    delta = now - svc->last_start;
>> +    /*
>> +     * delta < 0 only happens in nested virtualization;
>> +     * TODO: how should we handle delta < 0 in a better way? */
>>
> Ah, nice to hear you tested this in nested virt? Whay config is that,
> Xen on top of VMWare (I'm assuming this from what I've seen on the
> rt-xen mailing list).

I run Xen in VirtualBox on MacBookAir.  The VirtualBox needs to enable
the Intel VT-x in the VM's configuration. Actually, I developed the
code in VirtualBox which can reboot much faster. After it works well
in virtualBox, I tested it on the bare machine.

>
>> +    if ( delta < 0 )
>> +    {
>> +        printk("%s, ATTENTION: now is behind last_start! delta = %ld for ",
>> +                __func__, delta);
>> +        rt_dump_vcpu(svc);
>> +        svc->last_start = now;  /* update last_start */
>> +        svc->cur_budget = 0;   /* FIXME: should we recover like this? */
>> +        return;
>> +    }
>> +
> Bha, this just should not happen, and if it does, it's either a bug or
> an hardware problem, isn't it? I don't think you should do much more
> than this. I'd remove the rt_dump_vcpu() but, in this case, I'm fine
> with the printk(), as it's something that really should not happen, and
> we want to inform sysadmin that something has gone very bad. :-)

When I run Xen in virtualBox and configure 4 (virtual) cores to the VM
of the virtualBox, these 4 cores are four threads for my host machine
(i.e., Macbookair). I'm guessing if the four virtual cores do not have
a good (or timely) time synchronization, when a vcpu migrates to
another virtual core, the time might be late after its last_start
time.

This never happens when I run Xen on the bare machine, but happens
when I run Xen inside virtualBox on Macbookair.


>> +    if ( svc->cur_budget == 0 )
>> +        return;
>> +
>> +    svc->cur_budget -= delta;
>> +    if ( svc->cur_budget < 0 )
>> +        svc->cur_budget = 0;
>> +
> I see you're not dealing with overruns (where with dealing I mean try to
> compensate). That's fine for now, let's leave this as a future
> development.
>

Sure.

>> +/*
>> + * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
>> + * lock is grabbed before calling this function
>> + */
>> +static struct rt_vcpu *
>> +__runq_pick(const struct scheduler *ops, cpumask_t mask)
>> +{
>> +    struct list_head *runq = RUNQ(ops);
>> +    struct list_head *iter;
>> +    struct rt_vcpu *svc = NULL;
>> +    struct rt_vcpu *iter_svc = NULL;
>> +    cpumask_t cpu_common;
>> +    cpumask_t *online;
>> +    struct rt_private * prv = RT_PRIV(ops);
>> +
>> +    list_for_each(iter, runq)
>> +    {
>> +        iter_svc = __runq_elem(iter);
>> +
>> +        /* flag vcpu */
>> +        if(iter_svc->sdom == NULL)
>> +            break;
>> +
>> +        /* mask is intersection of cpu_hard_affinity and cpupool and 
>> priv->cpus */
>> +        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
>> +        cpumask_and(&cpu_common, online, &prv->cpus);
>> +        cpumask_and(&cpu_common, &cpu_common, 
>> iter_svc->vcpu->cpu_hard_affinity);
>> +        cpumask_and(&cpu_common, &mask, &cpu_common);
>> +        if ( cpumask_empty(&cpu_common) )
>> +            continue;
>> +
> I remember asking this in
> http://lists.xen.org/archives/html/xen-devel/2014-07/msg01624.html:
>
> "What's in priv->cpus, BTW? How is it different form the cpupool online
> mask (returned in 'online' by cpupool_scheduler_cpumask() )?"
>
> while I don't remember having received an answer, and I see it's still.
> here in the code. Am I missing something? If not, can you explain?

prv->cpus is the online cpus for the scheduler.  It should have the
same value with the cpupool_scheduler_cpumask().

Should I remove this? (I plan to release the next version this
weekend, I will remove this after receiving the  command. :-))


Thank you very much for your time and review!
Hope I have solved all of them, except the ones I asked in the email.
(I think I solved all of them now. :-P )

Thanks,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.