[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap



On Thu, 2017-06-22 at 17:55 +0100, George Dunlap wrote:
> On 08/06/17 13:08, Dario Faggioli wrote:
> > This commit implements the Xen part of the cap mechanism for
> > Credit2.
> > 
> > A cap is how much, in terms of % of physical CPU time, a domain
> > can execute at most.
> > 
> > For instance, a domain that must not use more than 1/4 of one
> > physical CPU, must have a cap of 25%; one that must not use more
> > than 1+1/2 of physical CPU time, must be given a cap of 150%.
> > 
> > Caps are per domain, so it is all a domain's vCPUs, cumulatively,
> > that will be forced to execute no more than the decided amount.
> > 
> > This is implemented by giving each domain a 'budget', and using
> > a (per-domain again) periodic timer. Values of budget and 'period'
> > are chosen so that budget/period is equal to the cap itself.
> > 
> > Budget is burned by the domain's vCPUs, in a similar way to how
> > credits are.
> > 
> > When a domain runs out of budget, its vCPUs can't run any longer.
> > They can gain, when the budget is replenishment by the timer, which
> > event happens once every period.
> > 
> > Blocking the vCPUs because of lack of budget happens by
> > means of a new (_VPF_parked) pause flag, so that, e.g.,
> > vcpu_runnable() still works. This is similar to what is
> > done in sched_rtds.c, as opposed to what happens in
> > sched_credit.c, where vcpu_pause() and vcpu_unpause()
> > (which means, among other things, more overhead).
> > 
> > Note that xenalyze and tools/xentrace/format are also modified,
> > to keep them updated with one modified event.
> > 
> > Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> 
> Looks really good overall, Dario!  Just a few relatively minor
> comments.
> 
Cool! Thanks for having a look so quickly. :-)

> > --- a/xen/common/sched_credit2.c
> > +++ b/xen/common/sched_credit2.c
> > @@ -92,6 +92,82 @@
> > + *   Budget never expires, so at whatever time a vCPU wants to
> > run, it can
> > + *   check the domain's budget, and if there is some, it can use
> > it.
> 
> I'm not sure what this paragraph is trying to say. Saying budget
> "never
> expires" makes it sound like you continue to accumulate it, such that
> if
> you don't run at all for several periods, you could "save it up" and
> run
> at 100% for one full period.
> 
Yes, I see what you mean, and I agree that it's unclear. The point is
that there exists algorithm where the budget of a replenishment has an
expiry time, before the next replenishment, i.e., it has to be used
right away after the replenishment, or (usually), shortly after, or it
will be thrown away, and we'll have to wait next replenishment.

But again, yes, saying "never expires", here, and in the way I did it,
and without any reference and comparison to those other algorithms,
makes people think what you just said.

I'll rephrase (or cut the paragraph entirely, it's probably not super
informative/useful anyway). :-)

> > + * - when a budget replenishment occurs, if there are vCPUs that
> > had been
> > + *   blocked because of lack of budget, they'll be unblocked, and
> > they will
> > + *   (potentially) be able to run again.
> > + *
> > + * Finally, some even more implementation related detail:
> > + *
> > + * - budget is stored in a domain-wide pool. vCPUs of the domain
> > that want
> > + *   to run go to such pool, and grub some. When they do so, the
> > amount
> > + *   they grabbed is _immediately_ removed from the pool. This
> > happens in
> > + *   vcpu_try_to_get_budget();
> 
> This sounds like a good solution to the "greedy vcpu" problem. :-)
> 
I like it too. It works because it's coupled with the fact that
runqueue is ordered by credits. E.g., doing the same in Credit1, would
probably still not work, because of the round-robin queues (within same
priority), and because of the boosting on wakeup.

> > @@ -1438,6 +1570,217 @@ void burn_credits(struct
> > +static bool vcpu_try_to_get_budget(struct csched2_vcpu *svc)
> 
> This name is OK, but it's a bit long.  What about
> "vcpu_grab_budget()"?
> 
Ok (and the other renaming suggestions too).

> > +{
> > +    struct csched2_dom *sdom = svc->sdom;
> > +    unsigned int cpu = svc->vcpu->processor;
> > +
> > +    ASSERT(spin_is_locked(per_cpu(schedule_data,
> > cpu).schedule_lock));
> > +
> > +    if ( svc->budget > 0 )
> > +        return true;
> > +
> > +    /* budget_lock nests inside runqueue lock. */
> > +    spin_lock(&sdom->budget_lock);
> > +
> > +    /*
> > +     * Here, svc->budget is <= 0 (as, if it was > 0, we'd have
> > taken the if
> > +     * above!). That basically means the vCPU has overrun a bit --
> > because of
> > +     * various reasons-- and we want to take that into account.
> > With the +=,
> > +     * we are actually subtracting the amount of budget the vCPU
> > has
> > +     * overconsumed, from the total domain budget.
> > +     */
> > +    sdom->budget += svc->budget;
> > +
> > +    if ( sdom->budget > 0 )
> > +    {
> > +        svc->budget = sdom->budget;
> > +        sdom->budget = 0;
> 
> Just a minor comment here -- I didn't see anything in this
> description,
> the cover letter, or this patch indicating that you were planning on
> changing this in a follow-up patch.  A heads-up not to worry about
> apparently allowing only one vcpu to run at a time might have been
> nice. :-)
> 
Right. I'll explain that in the cover, and in a comment here (which
will then go away).

> > +{
> > +    struct csched2_dom *sdom = data;
> > +    unsigned long flags;
> > +    s_time_t now;
> > +    LIST_HEAD(parked);
> > +
> > +    spin_lock_irqsave(&sdom->budget_lock, flags);
> > +
> > +    /*
> > +     * It is possible that the domain overrun, and that the budget
> > hence went
> > +     * below 0 (reasons may be system overbooking, issues in or
> > too coarse
> > +     * runtime accounting, etc.). In particular, if we overrun by
> > more than
> > +     * tot_budget, then budget+tot_budget would still be < 0,
> > which in turn
> > +     * means that, despite replenishment, there's still no budget
> > for unarking
> > +     * and running vCPUs.
> > +     *
> > +     * It is also possible that we are handling the replenishment
> > much later
> > +     * than expected (reasons may again be overbooking, or issues
> > with timers).
> > +     * If we are more than CSCHED2_BDGT_REPL_PERIOD late, this
> > means we have
> > +     * basically skipped (at least) one replenishment.
> > +     *
> > +     * We deal with both the issues here, by, basically, doing
> > more than just
> > +     * one replenishment. Note, however, that every time we add
> > tot_budget
> > +     * to the budget, we also move next_repl away by
> > CSCHED2_BDGT_REPL_PERIOD.
> > +     * This guarantees we always respect the cap.
> > +     */
> > +    now = NOW();
> > +    do
> > +    {
> > +        sdom->next_repl += CSCHED2_BDGT_REPL_PERIOD;
> > +        sdom->budget += sdom->tot_budget;
> > +    }
> > +    while ( sdom->next_repl <= now || sdom->budget <= 0 );
> 
> The first clause ("oops, accidentally missed a replenishment period")
> I
> agree with; 
>
Ok.

> but I'm going back and forth a bit on the second one.  It
> means essentially that the scheduler made a mistake and allowed the
> VM
> to run for one full budget *more* than its allocated time (perhaps
> accumulated over several periods).
> 
No, the budget does not accumulate. Or at least, it does, but only up
to the original tot_budget.

So, basically, the reason why budget may still be <0, after a
replenishment of tot_budget, is that something went wrong, and we let
the vcpu overrun for more than tot_budget.

It really should never happen (I may actually add a WARN()), unless the
accounting is very coarse, or the budget is really small (i.e., the
budget is small compared to the resolution we can achieve for the
accounting).

> On the one hand, it's the scheduler that made a mistake, so we
> shouldn't
> "punish" a domain by giving it a full period with no budget.
> 
Yes, I think I understand what you mean. However, I would not
necessarily call this "punishing" the domain. We're just making sure
that cap is enforced, even during (hopefully sporadic and only
transient) tricky circumstances where the scheduler got something
wrong, and for those domains that have (perhaps not maliciously, but
still) already taken advantage of such mistake.

In fact, assume you have a domain that wants to execute W amount of
work every T time, but has a cap that results in it having a budget of
C<<W every T. Under normal circumstances, it executes for C between 0
and T, for C between T and 2T, for C between 2T and 3T, etc., until it
reaches W. So, after 3T, it will have executed for 3C.

In presence of an accounting/enforcing error, it may happen that it
executes for C between 0 and T, for 2C between T and 2T, for 0 between
2T and 3T, etc. So, after 3T, it will also have executed for 3C, as
above.

What I'm trying to say is that, yes, we're making it skip a period, but
that's only because we gave it more before, and it actually used this
additional capacity, we granted by mistake. So, it's not really a
punishment, it's... well.. recognising that we gave it too big of a
reward. And if it had spent that already, we don't go blame it, we just
skip the next one, to make things even.

> On the other hand, if there's a reliable way to trick the scheduler
> into
> allowing a guest to over-run its budget, this allows a rogue guest to
> essentially work around the cap.
> 
Exactly. There really should not be a way of triggering this behavior
but, if someone finds it, this is a safety/robustness measure for
always having cap enforced.

And remember that this only applies if the vcpu has actually overrun.
As in, if the scheduler gets confused and come to do the budget
enforcement late, during the T-2T period, but the vcpu has only
executed for C it (the vcpu) does not have to pay any price for the
scheduler's mistake (that, indeed, would be unfair).

> > @@ -2423,14 +2785,22 @@ csched2_runtime(const struct scheduler
> > *ops, int cpu,
> >       * credit values of MIN,MAX per vcpu, since each vcpu burns
> > credit
> >       * at a different rate.
> >       */
> > -    if (rt_credit > 0)
> > +    if ( rt_credit > 0 )
> >          time = c2t(rqd, rt_credit, snext);
> >      else
> >          time = 0;
> >  
> > -    /* 3) But never run longer than MAX_TIMER or less than
> > MIN_TIMER or
> > -     * the rate_limit time. */
> > -    if ( time < min_time)
> > +    /*
> > +     * 3) But, if capped, never run more than our budget.
> > +     */
> > +    if ( unlikely(has_cap(snext)) )
> > +        time = snext->budget < time ? snext->budget : time;
> > +
> > +    /*
> > +     * 4) But never run longer than MAX_TIMER or less than
> > MIN_TIMER or
> 
> Two "buts" in a row doesn't work.  I'd change this to "and", which
> sort
> of follows on from the last "but".
> 
Ok, will do.

Thanks again and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.