[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap
On 06/08/2017 01:08 PM, Dario Faggioli wrote: > This commit implements the Xen part of the cap mechanism for > Credit2. > > A cap is how much, in terms of % of physical CPU time, a domain > can execute at most. > > For instance, a domain that must not use more than 1/4 of one > physical CPU, must have a cap of 25%; one that must not use more > than 1+1/2 of physical CPU time, must be given a cap of 150%. > > Caps are per domain, so it is all a domain's vCPUs, cumulatively, > that will be forced to execute no more than the decided amount. > > This is implemented by giving each domain a 'budget', and using > a (per-domain again) periodic timer. Values of budget and 'period' > are chosen so that budget/period is equal to the cap itself. > > Budget is burned by the domain's vCPUs, in a similar way to how > credits are. > > When a domain runs out of budget, its vCPUs can't run any longer. > They can gain, when the budget is replenishment by the timer, which > event happens once every period. > > Blocking the vCPUs because of lack of budget happens by > means of a new (_VPF_parked) pause flag, so that, e.g., > vcpu_runnable() still works. This is similar to what is > done in sched_rtds.c, as opposed to what happens in > sched_credit.c, where vcpu_pause() and vcpu_unpause() > (which means, among other things, more overhead). > > Note that xenalyze and tools/xentrace/format are also modified, > to keep them updated with one modified event. > > Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> > --- > Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx> > Cc: Anshul Makkar <anshul.makkar@xxxxxxxxxx> > Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > Cc: Jan Beulich <jbeulich@xxxxxxxx> > Cc: Ian Jackson <ian.jackson@xxxxxxxxxxxxx> > Cc: Wei Liu <wei.liu2@xxxxxxxxxx> > --- > tools/xentrace/formats | 2 > tools/xentrace/xenalyze.c | 10 + > xen/common/sched_credit2.c | 470 > +++++++++++++++++++++++++++++++++++++++++--- > xen/include/xen/sched.h | 3 > 4 files changed, 445 insertions(+), 40 deletions(-) > > diff --git a/tools/xentrace/formats b/tools/xentrace/formats > index 8b31780..142b0cf 100644 > --- a/tools/xentrace/formats > +++ b/tools/xentrace/formats > @@ -51,7 +51,7 @@ > > 0x00022201 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tick > 0x00022202 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_pos [ > dom:vcpu = 0x%(1)08x, pos = %(2)d] > -0x00022203 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit burn [ > dom:vcpu = 0x%(1)08x, credit = %(2)d, delta = %(3)d ] > +0x00022203 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit burn [ > dom:vcpu = 0x%(1)08x, credit = %(2)d, budget = %(3)d, delta = %(4)d ] > 0x00022204 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit_add > 0x00022205 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tickle_check [ > dom:vcpu = 0x%(1)08x, credit = %(2)d ] > 0x00022206 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tickle [ cpu > = %(1)d ] > diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c > index fa608ad..c16c02d 100644 > --- a/tools/xentrace/xenalyze.c > +++ b/tools/xentrace/xenalyze.c > @@ -7680,12 +7680,14 @@ void sched_process(struct pcpu_info *p) > if(opt.dump_all) { > struct { > unsigned int vcpuid:16, domid:16; > - int credit, delta; > + int credit, budget, delta; > } *r = (typeof(r))ri->d; > > - printf(" %s csched2:burn_credits d%uv%u, credit = %d, delta > = %d\n", > - ri->dump_header, r->domid, r->vcpuid, > - r->credit, r->delta); > + printf(" %s csched2:burn_credits d%uv%u, credit = %d, ", > + ri->dump_header, r->domid, r->vcpuid, r->credit); > + if ( r->budget != INT_MIN ) > + printf("budget = %d, ", r->budget); > + printf("delta = %d\n", r->delta); > } > break; > case TRC_SCHED_CLASS_EVT(CSCHED2, 5): /* TICKLE_CHECK */ > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c > index 126417c..ba4bf4b 100644 > --- a/xen/common/sched_credit2.c > +++ b/xen/common/sched_credit2.c > @@ -92,6 +92,82 @@ > */ > > /* > + * Utilization cap: > + * > + * Setting an pCPU utilization cap for a domain means the following: > + * > + * - a domain can have a cap, expressed in terms of % of physical CPU time. > + * A domain that must not use more than 1/4 of _one_ physical CPU, will > + * be given a cap of 25%; a domain that must not use more than 1+1/2 of > + * physical CPU time, will be given a cap of 150%; > + * > + * - caps are per-domain (not per-vCPU). If a domain has only 1 vCPU, and > + * a 40% cap, that one vCPU will use 40% of one pCPU. If a somain has 4 > + * vCPUs, and a 200% cap, all its 4 vCPUs are allowed to run for (the > + * equivalent of) 100% time on 2 pCPUs. How much each of the various 4 > + * vCPUs will get, is unspecified (will depend on various aspects: > workload, > + * system load, etc.). > + * > + * For implementing this, we use the following approach: > + * > + * - each domain is given a 'budget', an each domain has a timer, which > + * replenishes the domain's budget periodically. The budget is the amount > + * of time the vCPUs of the domain can use every 'period'; > + * > + * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for all domains > + * (but each domain has its own timer; so the all are periodic by the same > + * period, but replenishment of the budgets of the various domains, at > + * periods boundaries, are not synchronous); > + * > + * - when vCPUs run, they consume budget. When they don't run, they don't > + * consume budget. If there is no budget left for the domain, no vCPU of > + * that domain can run. If a vCPU tries to run and finds that there is no > + * budget, it blocks. > + * Budget never expires, so at whatever time a vCPU wants to run, it can > + * check the domain's budget, and if there is some, it can use it. > + * > + * - budget is replenished to the top of the capacity for the domain once > + * per period. Even if there was some leftover budget from previous period, > + * though, the budget after a replenishment will always be at most equal > + * to the total capacify of the domain ('tot_budget'); > + * > + * - when a budget replenishment occurs, if there are vCPUs that had been > + * blocked because of lack of budget, they'll be unblocked, and they will > + * (potentially) be able to run again. > + * > + * Finally, some even more implementation related detail: > + * > + * - budget is stored in a domain-wide pool. vCPUs of the domain that want > + * to run go to such pool, and grub some. When they do so, the amount > + * they grabbed is _immediately_ removed from the pool. This happens in > + * vcpu_try_to_get_budget(); > + * > + * - when vCPUs stop running, if they've not consumed all the budget they > + * took, the leftover is put back in the pool. This happens in > + * vcpu_give_budget_back(); > + * > + * - the above means that a vCPU can find out that there is no budget and > + * block, not only if the cap has actually been reached (for this period), > + * but also if some other vCPUs, in order to run, have grabbed a certain > + * quota of budget, no matter whether they've already used it all or not. > + * A vCPU blocking because (any form of) lack of budget is said to be > + * "parked", and such blocking happens in park_vcpu(); > + * > + * - when a vCPU stops running, and puts back some budget in the domain pool, > + * we need to check whether there is someone which has been parked and that > + * can be unparked. This happens in unpark_parked_vcpus(), called from > + * csched2_context_saved(); > + * > + * - of course, unparking happens also as a consequene of the domain's budget > + * being replenished by the periodic timer. This also occurs by means of > + * calling csched2_context_saved() (but from repl_sdom_budget()); > + * > + * - parked vCPUs of a domain are kept in a (per-domain) list, called > + * 'parked_vcpus'). Manipulation of the list and of the domain-wide budget > + * pool, must occur only when holding the 'budget_lock'. > + */ > + > +/* > * Locking: > * > * - runqueue lock > @@ -112,18 +188,29 @@ > * runqueue each cpu is; > * + serializes the operation of changing the weights of domains; > * > + * - Budget lock > + * + it is per-domain; > + * + protects, in domains that have an utilization cap; > + * * manipulation of the total budget of the domain (as it is shared > + * among all vCPUs of the domain), > + * * manipulation of the list of vCPUs that are blocked waiting for > + * some budget to be available. > + * > * - Type: > * + runqueue locks are 'regular' spinlocks; > * + the private scheduler lock can be an rwlock. In fact, data > * it protects is modified only during initialization, cpupool > * manipulation and when changing weights, and read in all > - * other cases (e.g., during load balancing). > + * other cases (e.g., during load balancing); > + * + budget locks are 'regular' spinlocks. > * > * Ordering: > * + tylock must be used when wanting to take a runqueue lock, > * if we already hold another one; > * + if taking both a runqueue lock and the private scheduler > - * lock is, the latter must always be taken for first. > + * lock is, the latter must always be taken for first; > + * + if taking both a runqueue lock and a budget lock, the former > + * must always be taken for first. > */ > > /* > @@ -164,6 +251,8 @@ > #define CSCHED2_CREDIT_RESET 0 > /* Max timer: Maximum time a guest can be run for. */ > #define CSCHED2_MAX_TIMER CSCHED2_CREDIT_INIT > +/* Period of the cap replenishment timer. */ > +#define CSCHED2_BDGT_REPL_PERIOD ((opt_cap_period)*MILLISECS(1)) > > /* > * Flags > @@ -293,6 +382,14 @@ static int __read_mostly opt_underload_balance_tolerance > = 0; > integer_param("credit2_balance_under", opt_underload_balance_tolerance); > static int __read_mostly opt_overload_balance_tolerance = -3; > integer_param("credit2_balance_over", opt_overload_balance_tolerance); > +/* > + * Domains subject to a cap, receive a replenishment of their runtime budget > + * once every opt_cap_period interval. Default is 10 ms. The amount of budget > + * they receive depends on their cap. For instance, a domain with a 50% cap > + * will receive 50% of 10 ms, so 5 ms. > + */ > +static unsigned int __read_mostly opt_cap_period = 10; /* ms */ > +integer_param("credit2_cap_period_ms", opt_cap_period); > > /* > * Runqueue organization. > @@ -408,6 +505,10 @@ struct csched2_vcpu { > unsigned int residual; > > int credit; > + > + s_time_t budget; > + struct list_head parked_elem; /* On the parked_vcpus list */ > + > s_time_t start_time; /* When we were scheduled (used for credit) */ > unsigned flags; /* 16 bits doesn't seem to play well with > clear_bit() */ > int tickled_cpu; /* cpu tickled for picking us up (-1 if none) */ > @@ -425,7 +526,15 @@ struct csched2_vcpu { > struct csched2_dom { > struct list_head sdom_elem; > struct domain *dom; > + > + spinlock_t budget_lock; > + struct timer repl_timer; > + s_time_t next_repl; > + s_time_t budget, tot_budget; > + struct list_head parked_vcpus; > + > uint16_t weight; > + uint16_t cap; Hmm, this needs to be rebased on the structure layout patches I checked in last week. :-) -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |