Xen project Mailing List

Re: Sketch of an idea for handling the "mixed workload" problem

To: George Dunlap <george.dunlap@xxxxxxxxx>

From: Demi Marie Obenour <demi@xxxxxxxxxxxxxxxxxxxxxx>

Date: Sun, 21 Jan 2024 18:46:36 -0500

Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Delivery-date: Sun, 21 Jan 2024 23:47:19 +0000

Feedback-id: iac594737:Fastmail

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Oct 02, 2023 at 12:20:31PM +0100, George Dunlap wrote: > On Sun, Oct 1, 2023 at 12:28 AM Demi Marie Obenour > <demi@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote: > > > The basic credit2 algorithm goes something like this: > > > > > > 1. All vcpus start with the same number of credits; about 10ms worth > > > if everyone has the same weight > > > > > 2. vcpus burn credits as they consume cpu, based on the relative > > > weights: higher weights burn slower, lower weights burn faster > > > > > > 3. At any given point in time, the runnable vcpu with the highest > > > credit is allowed to run > > > > > > 4. When the "next runnable vcpu" on a runqueue is negative, credit is > > > reset: everyone gets another 10ms, and can carry over at most 2ms of > > > credit over the reset. > > > > One relevant aspect of Qubes OS is that it is very very heavily > > oversubscribed: having more VMs running than physical CPUs is (at least > > in my usage) not uncommon, and each of those VMs will typically have at > > least two vCPUs. With a credit of 10ms and 36 vCPUs, I could easily see > > a vCPU not being allowed to execute for 200ms or more. For audio or > > video, workloads, this is a disaster. > > > > 10ms is a LOT for desktop workloads or for anyone who cares about > > latency. At 60Hz it is 3/5 of a frame, and with a 120Hz monitor and a > > heavily contended system frame drops are guaranteed. > > You'd probably benefit from understanding better how the various > algorithms actually work. I'm sorry I don't have any really good > "virtualization scheduling for dummies" resources; the best I have is > a few talks I gave on the subject; e.g.: > > https://www.youtube.com/watch?v=C3jjvkr6fgQ > > For one, when I say "oversubscribed", I don't mean "vcpus / pcpus"; I > mean "requested vcpu execution time / vcpus". If you have 18 vcpus on > a single pcpu, and all of them *on an empty system* would have run at > 5%, you're totally fine. If you have 18 vcpus on a single pcpu, and > all of them on an empty system would have averaged 100%, there's only > so much the scheduler can do to avoid problems. If each vCPU would have spent 4% time doing realtime tasks, it should be possible to give all of the realtime tasks all the time they need, while the remaining 100 - 4 * 18 = 28% of time is available to non-realtime tasks. That’s not awesome, but it might be enough to prevent audio from glitching. > Secondly, while on credit1 a vcpu is allowed to run for 10ms without > stopping (and then must wait for 18x that time to get the same credit > back, if there are 18 other vcpus running on that same pcpu), this is > not the case for credit2. The exact calculation can be found in > xen/common/sched/credit2.c:sched2_runtime(), but generally here's the > general algorithm from the comment: > > /* General algorithm: > * 1) Run until snext's credit will be 0. > * 2) But if someone is waiting, run until snext's credit is equal > * to his. > * 3) But, if we are capped, never run more than our budget. > * 4) And never run longer than MAX_TIMER or shorter than MIN_TIMER or > * the ratelimit time. > */ > > Default MIN_TIMER is 500us, and is configurable via sysctl; default > MAX_TIMER is... hmm, I'm pretty sure this started out as 2ms, but now > it seems to be 10ms. Looks like this was changed in da92ec5bd1 ("xen: > credit2: "relax" CSCHED2_MAX_TIMER") in 2016. (MAX_TIMER isn't > configurable, but arguably it should be; and making it configurable > should just be a matter of duplicating the logic around MIN_TIMER.) Maybe MAX_TIMER should be lowered to e.g. 1ms? > That's not yet the last word though: If a VM that was a sleep wakes > up, and it has credit than the running vcpu, then it will generally > preempt that cpu. > > All that to say, that it should be very rare for a cpu to run for a > full 10ms under credit2. That’s good. > > > Other ways we could consider putting a vcpu into a boosted state (some > > > discussed on Matrix or emails linked from Matrix): > > > * Xen is about to preempt, but finds that the vcpu interrupts are > > > blocked (this sort of overlaps with the "when we deliver an interrupt" > > > one) > > > > This is also a good heuristic for "vCPU owns a spinlock", which is > > definitely a bad time to preempt. > > Not all spinlocks disable IRQs, but certainly some do. > > > > Getting the defaults right might take some thinking. If you set the > > > default "boost credit ratio" to 25% and the "default boost interval" > > > to 500ms, then you'd basically have five "boosts" per scheduling > > > window. The window depends on how active other vcpus are, but if it's > > > longer than 20ms your system is too overloaded. > > > > An interval of 500ms seems rather long to me. Did you mean 500μs? > > Yes, I did mean 500us, sorry. > > I'll respond to the other suggestions later. > > > > Demi, what kinds of interrupt counts are you getting for your VM? > > > > I didn't measure it, but I can check the next time I am on a video call > > or doing audio recoring. > > Running xentrace would be really interesting too; those are another > good way to nerd-snipe me. :-) > > -George That would certainly be a good idea! -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab

Attachment: signature.asc
Description: PGP signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.