[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap



On 08/06/17 13:08, Dario Faggioli wrote:
> This commit implements the Xen part of the cap mechanism for
> Credit2.
> 
> A cap is how much, in terms of % of physical CPU time, a domain
> can execute at most.
> 
> For instance, a domain that must not use more than 1/4 of one
> physical CPU, must have a cap of 25%; one that must not use more
> than 1+1/2 of physical CPU time, must be given a cap of 150%.
> 
> Caps are per domain, so it is all a domain's vCPUs, cumulatively,
> that will be forced to execute no more than the decided amount.
> 
> This is implemented by giving each domain a 'budget', and using
> a (per-domain again) periodic timer. Values of budget and 'period'
> are chosen so that budget/period is equal to the cap itself.
> 
> Budget is burned by the domain's vCPUs, in a similar way to how
> credits are.
> 
> When a domain runs out of budget, its vCPUs can't run any longer.
> They can gain, when the budget is replenishment by the timer, which
> event happens once every period.
> 
> Blocking the vCPUs because of lack of budget happens by
> means of a new (_VPF_parked) pause flag, so that, e.g.,
> vcpu_runnable() still works. This is similar to what is
> done in sched_rtds.c, as opposed to what happens in
> sched_credit.c, where vcpu_pause() and vcpu_unpause()
> (which means, among other things, more overhead).
> 
> Note that xenalyze and tools/xentrace/format are also modified,
> to keep them updated with one modified event.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Looks really good overall, Dario!  Just a few relatively minor comments.

> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 126417c..ba4bf4b 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -92,6 +92,82 @@
>   */
>  
>  /*
> + * Utilization cap:
> + *
> + * Setting an pCPU utilization cap for a domain means the following:
> + *
> + * - a domain can have a cap, expressed in terms of % of physical CPU time.
> + *   A domain that must not use more than 1/4 of _one_ physical CPU, will
> + *   be given a cap of 25%; a domain that must not use more than 1+1/2 of
> + *   physical CPU time, will be given a cap of 150%;
> + *
> + * - caps are per-domain (not per-vCPU). If a domain has only 1 vCPU, and
> + *   a 40% cap, that one vCPU will use 40% of one pCPU. If a somain has 4
> + *   vCPUs, and a 200% cap, all its 4 vCPUs are allowed to run for (the
> + *   equivalent of) 100% time on 2 pCPUs. How much each of the various 4
> + *   vCPUs will get, is unspecified (will depend on various aspects: 
> workload,
> + *   system load, etc.).
> + *
> + * For implementing this, we use the following approach:
> + *
> + * - each domain is given a 'budget', an each domain has a timer, which
> + *   replenishes the domain's budget periodically. The budget is the amount
> + *   of time the vCPUs of the domain can use every 'period';
> + *
> + * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for all domains
> + *   (but each domain has its own timer; so the all are periodic by the same
> + *   period, but replenishment of the budgets of the various domains, at
> + *   periods boundaries, are not synchronous);
> + *
> + * - when vCPUs run, they consume budget. When they don't run, they don't
> + *   consume budget. If there is no budget left for the domain, no vCPU of
> + *   that domain can run. If a vCPU tries to run and finds that there is no
> + *   budget, it blocks.
> + *   Budget never expires, so at whatever time a vCPU wants to run, it can
> + *   check the domain's budget, and if there is some, it can use it.

I'm not sure what this paragraph is trying to say. Saying budget "never
expires" makes it sound like you continue to accumulate it, such that if
you don't run at all for several periods, you could "save it up" and run
at 100% for one full period.

But that's contradicted by...

> + * - budget is replenished to the top of the capacity for the domain once
> + *   per period. Even if there was some leftover budget from previous period,
> + *   though, the budget after a replenishment will always be at most equal
> + *   to the total capacify of the domain ('tot_budget');

...this paragraph.

> + * - when a budget replenishment occurs, if there are vCPUs that had been
> + *   blocked because of lack of budget, they'll be unblocked, and they will
> + *   (potentially) be able to run again.
> + *
> + * Finally, some even more implementation related detail:
> + *
> + * - budget is stored in a domain-wide pool. vCPUs of the domain that want
> + *   to run go to such pool, and grub some. When they do so, the amount
> + *   they grabbed is _immediately_ removed from the pool. This happens in
> + *   vcpu_try_to_get_budget();

This sounds like a good solution to the "greedy vcpu" problem. :-)

> + * - when vCPUs stop running, if they've not consumed all the budget they
> + *   took, the leftover is put back in the pool. This happens in
> + *   vcpu_give_budget_back();
> + *
> + * - the above means that a vCPU can find out that there is no budget and
> + *   block, not only if the cap has actually been reached (for this period),
> + *   but also if some other vCPUs, in order to run, have grabbed a certain
> + *   quota of budget, no matter whether they've already used it all or not.
> + *   A vCPU blocking because (any form of) lack of budget is said to be
> + *   "parked", and such blocking happens in park_vcpu();
> + *
> + * - when a vCPU stops running, and puts back some budget in the domain pool,
> + *   we need to check whether there is someone which has been parked and that
> + *   can be unparked. This happens in unpark_parked_vcpus(), called from
> + *   csched2_context_saved();
> + *
> + * - of course, unparking happens also as a consequene of the domain's budget

*consequence

> @@ -293,6 +382,14 @@ static int __read_mostly opt_underload_balance_tolerance 
> = 0;
>  integer_param("credit2_balance_under", opt_underload_balance_tolerance);
>  static int __read_mostly opt_overload_balance_tolerance = -3;
>  integer_param("credit2_balance_over", opt_overload_balance_tolerance);
> +/*
> + * Domains subject to a cap, receive a replenishment of their runtime budget

Unnecessary comma.


> @@ -1438,6 +1570,217 @@ void burn_credits(struct csched2_runqueue_data *rqd,
>      }
>  }
>  
> +/*
> + * Budget-related code.
> + */
> +
> +static void park_vcpu(struct csched2_vcpu *svc)
> +{
> +    struct vcpu *v = svc->vcpu;
> +
> +    ASSERT(spin_is_locked(&svc->sdom->budget_lock));
> +
> +    /*
> +     * It was impossible to find budget for this vCPU, so it has to be
> +     * "parked". This implies it is not runnable, so we mark it as such in
> +     * its pause_flags. If the vCPU is currently scheduled (which means we
> +     * are here after being called from within csched_schedule()), flagging
> +     * is enough, as we'll choose someone else, and then context_saved()
> +     * will take care of updating the load properly.
> +     *
> +     * If, OTOH, the vCPU is sitting in the runqueue (which means we are here
> +     * after being called from within runq_candidate()), we must go all the
> +     * way down to taking it out of there, and updating the load accordingly.
> +     *
> +     * In both cases, we also add it to the list of parked vCPUs of the 
> domain.
> +     */
> +    __set_bit(_VPF_parked, &v->pause_flags);
> +    if ( vcpu_on_runq(svc) )
> +    {
> +        runq_remove(svc);
> +        update_load(svc->sdom->dom->cpupool->sched, svc->rqd, svc, -1, 
> NOW());
> +    }
> +    list_add(&svc->parked_elem, &svc->sdom->parked_vcpus);
> +}
> +
> +static bool vcpu_try_to_get_budget(struct csched2_vcpu *svc)

This name is OK, but it's a bit long.  What about "vcpu_grab_budget()"?

> +{
> +    struct csched2_dom *sdom = svc->sdom;
> +    unsigned int cpu = svc->vcpu->processor;
> +
> +    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
> +
> +    if ( svc->budget > 0 )
> +        return true;
> +
> +    /* budget_lock nests inside runqueue lock. */
> +    spin_lock(&sdom->budget_lock);
> +
> +    /*
> +     * Here, svc->budget is <= 0 (as, if it was > 0, we'd have taken the if
> +     * above!). That basically means the vCPU has overrun a bit --because of
> +     * various reasons-- and we want to take that into account. With the +=,
> +     * we are actually subtracting the amount of budget the vCPU has
> +     * overconsumed, from the total domain budget.
> +     */
> +    sdom->budget += svc->budget;
> +
> +    if ( sdom->budget > 0 )
> +    {
> +        svc->budget = sdom->budget;
> +        sdom->budget = 0;

Just a minor comment here -- I didn't see anything in this description,
the cover letter, or this patch indicating that you were planning on
changing this in a follow-up patch.  A heads-up not to worry about
apparently allowing only one vcpu to run at a time might have been nice. :-)

> +    }
> +    else
> +    {
> +        svc->budget = 0;
> +        park_vcpu(svc);
> +    }
> +
> +    spin_unlock(&sdom->budget_lock);
> +
> +    return svc->budget > 0;
> +}
> +
> +static void
> +vcpu_give_budget_back(struct csched2_vcpu *svc, struct list_head *parked)

"vcpu_return_budget"?

> +{
> +    struct csched2_dom *sdom = svc->sdom;
> +    unsigned int cpu = svc->vcpu->processor;
> +
> +    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
> +    ASSERT(list_empty(parked));
> +
> +    /* budget_lock nests inside runqueue lock. */
> +    spin_lock(&sdom->budget_lock);
> +
> +    /*
> +     * The vCPU is stopping running (e.g., because it's blocking, or it has
> +     * been preempted). If it hasn't consumed all the budget it got when,
> +     * starting to run, put that remaining amount back in the domain's budget
> +     * pool.
> +     */
> +    sdom->budget += svc->budget;
> +    svc->budget = 0;
> +
> +    /*
> +     * Making budget available again to the domain means that parked vCPUs
> +     * may be unparked and run. They are, if any, in the domain's 
> parked_vcpus
> +     * list, so we want to go through that and unpark them (so they can try
> +     * to get some budget).
> +     *
> +     * Touching the list requires the budget_lock, which we hold. Let's
> +     * therefore put everyone in that list in another, temporary list, which
> +     * then the caller will traverse, unparking the vCPUs it finds there.
> +     *
> +     * In fact, we can't do the actual unparking here, because that requires
> +     * taking the runqueue lock of the vCPUs being unparked, and we can't
> +     * take any runqueue locks while we hold a budget_lock.
> +     */
> +    if ( sdom->budget > 0 )
> +        list_splice_init(&sdom->parked_vcpus, parked);
> +
> +    spin_unlock(&sdom->budget_lock);
> +}
> +
> +static void
> +unpark_parked_vcpus(const struct scheduler *ops, struct list_head *vcpus)
> +{
> +    struct csched2_vcpu *svc, *tmp;
> +    spinlock_t *lock;
> +
> +    list_for_each_entry_safe(svc, tmp, vcpus, parked_elem)
> +    {
> +        unsigned long flags;
> +        s_time_t now;
> +
> +        lock = vcpu_schedule_lock_irqsave(svc->vcpu, &flags);
> +
> +        __clear_bit(_VPF_parked, &svc->vcpu->pause_flags);
> +        if ( unlikely(svc->flags & CSFLAG_scheduled) )
> +        {
> +            /*
> +             * We end here if a budget replenishment arrived between
> +             * csched2_schedule() (and, in particular, after a call to
> +             * vcpu_try_to_get_budget() that returned false), and
> +             * context_saved(). By setting __CSFLAG_delayed_runq_add,
> +             * we tell context_saved() to put the vCPU back in the
> +             * runqueue, from where it will compete with the others
> +             * for the newly replenished budget.
> +             */
> +            ASSERT( svc->rqd != NULL );
> +            ASSERT( c2rqd(ops, svc->vcpu->processor) == svc->rqd );
> +            __set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
> +        }
> +        else if ( vcpu_runnable(svc->vcpu) )
> +        {
> +            /*
> +             * The vCPU should go back to the runqueue, and compete for
> +             * the newly replenished budget, but only if it is actually
> +             * runnable (and was therefore offline only because of the
> +             * lack of budget).
> +             */
> +            now = NOW();
> +            update_load(ops, svc->rqd, svc, 1, now);
> +            runq_insert(ops, svc);
> +            runq_tickle(ops, svc, now);
> +        }
> +        list_del_init(&svc->parked_elem);
> +
> +        vcpu_schedule_unlock_irqrestore(lock, flags, svc->vcpu);
> +    }
> +}
> +
> +static void repl_sdom_budget(void* data)

Why not just "replenish_budget"?

> +{
> +    struct csched2_dom *sdom = data;
> +    unsigned long flags;
> +    s_time_t now;
> +    LIST_HEAD(parked);
> +
> +    spin_lock_irqsave(&sdom->budget_lock, flags);
> +
> +    /*
> +     * It is possible that the domain overrun, and that the budget hence went
> +     * below 0 (reasons may be system overbooking, issues in or too coarse
> +     * runtime accounting, etc.). In particular, if we overrun by more than
> +     * tot_budget, then budget+tot_budget would still be < 0, which in turn
> +     * means that, despite replenishment, there's still no budget for 
> unarking
> +     * and running vCPUs.
> +     *
> +     * It is also possible that we are handling the replenishment much later
> +     * than expected (reasons may again be overbooking, or issues with 
> timers).
> +     * If we are more than CSCHED2_BDGT_REPL_PERIOD late, this means we have
> +     * basically skipped (at least) one replenishment.
> +     *
> +     * We deal with both the issues here, by, basically, doing more than just
> +     * one replenishment. Note, however, that every time we add tot_budget
> +     * to the budget, we also move next_repl away by 
> CSCHED2_BDGT_REPL_PERIOD.
> +     * This guarantees we always respect the cap.
> +     */
> +    now = NOW();
> +    do
> +    {
> +        sdom->next_repl += CSCHED2_BDGT_REPL_PERIOD;
> +        sdom->budget += sdom->tot_budget;
> +    }
> +    while ( sdom->next_repl <= now || sdom->budget <= 0 );

The first clause ("oops, accidentally missed a replenishment period") I
agree with; but I'm going back and forth a bit on the second one.  It
means essentially that the scheduler made a mistake and allowed the VM
to run for one full budget *more* than its allocated time (perhaps
accumulated over several periods).

On the one hand, it's the scheduler that made a mistake, so we shouldn't
"punish" a domain by giving it a full period with no budget.

On the other hand, if there's a reliable way to trick the scheduler into
allowing a guest to over-run its budget, this allows a rogue guest to
essentially work around the cap.

Know what I mean?

> @@ -2423,14 +2785,22 @@ csched2_runtime(const struct scheduler *ops, int cpu,
>       * credit values of MIN,MAX per vcpu, since each vcpu burns credit
>       * at a different rate.
>       */
> -    if (rt_credit > 0)
> +    if ( rt_credit > 0 )
>          time = c2t(rqd, rt_credit, snext);
>      else
>          time = 0;
>  
> -    /* 3) But never run longer than MAX_TIMER or less than MIN_TIMER or
> -     * the rate_limit time. */
> -    if ( time < min_time)
> +    /*
> +     * 3) But, if capped, never run more than our budget.
> +     */
> +    if ( unlikely(has_cap(snext)) )
> +        time = snext->budget < time ? snext->budget : time;
> +
> +    /*
> +     * 4) But never run longer than MAX_TIMER or less than MIN_TIMER or

Two "buts" in a row doesn't work.  I'd change this to "and", which sort
of follows on from the last "but".

Thanks,
 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.