[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sketch of an idea for handling the "mixed workload" problem



On Sun, Oct 1, 2023 at 12:28 AM Demi Marie Obenour
<demi@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> > The basic credit2 algorithm goes something like this:
> >
> > 1. All vcpus start with the same number of credits; about 10ms worth
> > if everyone has the same weight
>
> > 2. vcpus burn credits as they consume cpu, based on the relative
> > weights: higher weights burn slower, lower weights burn faster
> >
> > 3. At any given point in time, the runnable vcpu with the highest
> > credit is allowed to run
> >
> > 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> > reset: everyone gets another 10ms, and can carry over at most 2ms of
> > credit over the reset.
>
> One relevant aspect of Qubes OS is that it is very very heavily
> oversubscribed: having more VMs running than physical CPUs is (at least
> in my usage) not uncommon, and each of those VMs will typically have at
> least two vCPUs.  With a credit of 10ms and 36 vCPUs, I could easily see
> a vCPU not being allowed to execute for 200ms or more.  For audio or
> video, workloads, this is a disaster.
>
> 10ms is a LOT for desktop workloads or for anyone who cares about
> latency.  At 60Hz it is 3/5 of a frame, and with a 120Hz monitor and a
> heavily contended system frame drops are guaranteed.

You'd probably benefit from understanding better how the various
algorithms actually work.  I'm sorry I don't have any really good
"virtualization scheduling for dummies" resources; the best I have is
a few talks I gave on the subject; e.g.:

https://www.youtube.com/watch?v=C3jjvkr6fgQ

For one, when I say "oversubscribed", I don't mean "vcpus / pcpus"; I
mean "requested vcpu execution time / vcpus".  If you have 18 vcpus on
a single pcpu, and all of them *on an empty system* would have run at
5%, you're totally fine.  If you have 18 vcpus on a single pcpu, and
all of them on an empty system would have averaged 100%, there's only
so much the scheduler can do to avoid problems.

Secondly, while on credit1 a vcpu is allowed to run for 10ms without
stopping (and then must wait for 18x that time to get the same credit
back, if there are 18 other vcpus running on that same pcpu), this is
not the case for credit2.  The exact calculation can be found in
xen/common/sched/credit2.c:sched2_runtime(), but generally here's the
general algorithm from the comment:

/* General algorithm:
 * 1) Run until snext's credit will be 0.
 * 2) But if someone is waiting, run until snext's credit is equal
 *    to his.
 * 3) But, if we are capped, never run more than our budget.
 * 4) And never run longer than MAX_TIMER or shorter than MIN_TIMER or
 *    the ratelimit time.
 */

Default MIN_TIMER is 500us, and is configurable via sysctl; default
MAX_TIMER is... hmm, I'm pretty sure this started out as 2ms, but now
it seems to be 10ms.  Looks like this was changed in da92ec5bd1 ("xen:
credit2: "relax" CSCHED2_MAX_TIMER") in 2016.  (MAX_TIMER isn't
configurable, but arguably it should be; and making it configurable
should just be a matter of duplicating the logic around MIN_TIMER.)

That's not yet the last word though: If a VM that was a sleep wakes
up, and it has credit than the running vcpu, then it will generally
preempt that cpu.

All that to say, that it should be very rare for a cpu to run for a
full 10ms under credit2.

> > Other ways we could consider putting a vcpu into a boosted state (some
> > discussed on Matrix or emails linked from Matrix):
> > * Xen is about to preempt, but finds that the vcpu interrupts are
> > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > one)
>
> This is also a good heuristic for "vCPU owns a spinlock", which is
> definitely a bad time to preempt.

Not all spinlocks disable IRQs, but certainly some do.

> > Getting the defaults right might take some thinking.  If you set the
> > default "boost credit ratio" to 25% and the "default boost interval"
> > to 500ms, then you'd basically have five "boosts" per scheduling
> > window.  The window depends on how active other vcpus are, but if it's
> > longer than 20ms your system is too overloaded.
>
> An interval of 500ms seems rather long to me.  Did you mean 500μs?

Yes, I did mean 500us, sorry.

I'll respond to the other suggestions later.

> > Demi, what kinds of interrupt counts are you getting for your VM?
>
> I didn't measure it, but I can check the next time I am on a video call
> or doing audio recoring.

Running xentrace would be really interesting too; those are another
good way to nerd-snipe me. :-)

 -George



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.