[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sketch of an idea for handling the "mixed workload" problem



On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> The basic credit2 algorithm goes something like this:
> 
> 1. All vcpus start with the same number of credits; about 10ms worth
> if everyone has the same weight

> 2. vcpus burn credits as they consume cpu, based on the relative
> weights: higher weights burn slower, lower weights burn faster
> 
> 3. At any given point in time, the runnable vcpu with the highest
> credit is allowed to run
> 
> 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> reset: everyone gets another 10ms, and can carry over at most 2ms of
> credit over the reset.

One relevant aspect of Qubes OS is that it is very very heavily
oversubscribed: having more VMs running than physical CPUs is (at least
in my usage) not uncommon, and each of those VMs will typically have at
least two vCPUs.  With a credit of 10ms and 36 vCPUs, I could easily see
a vCPU not being allowed to execute for 200ms or more.  For audio or
video, workloads, this is a disaster.

10ms is a LOT for desktop workloads or for anyone who cares about
latency.  At 60Hz it is 3/5 of a frame, and with a 120Hz monitor and a
heavily contended system frame drops are guaranteed.

> Generally speaking, vcpus that use less than their quota and have lots
> of interrupts are scheduled immediately, since when they wake up they
> always have more credit than the vcpus who are burning through their
> slices.
> 
> But what about a situation as described recently on Matrix, where a VM
> uses a non-negligible amount of cpu doing un-accelerated encryption
> and decryption, which can be delayed by a few MS, as well as handling
> audio events?  How can we make sure that:
> 
> 1. We can run whenever interrupts happen
> 2. We get no more than our fair share of the cpu?
> 
> The counter-intuitive key here is that in order to achieve the above,
> you need to *deschedule or preempt early*, so that when the interrupt
> comes, you have spare credit to run the interrupt handler.  How do we
> manage that?
> 
> The idea I'm working out comes from a phrase I used in the Matrix
> discussion, about a vcpu that "foolishly burned all its credits".
> Naturally the thing you want to do to have credits available is to
> save them up.
> 
> So the idea would be this.  Each vcpu would have a "boost credit
> ratio" and a "default boost interval"; there would be sensible
> defaults based on typical workloads, but these could be tweaked for
> individual VMs.
> 
> When credit is assigned, all VMs would get the same amount of credit,
> but divided into two "buckets", according to the boost credit ratio.
> 
> Under certain conditions, a vcpu would be considered "boosted"; this
> state would last either until the default boost interval, or until
> some other event (such as a de-boost yield).
> 
> The queue would be sorted thus:
> 
> * Boosted vcpus, by boost credit available
> * Non-boosted vcpus, by non-boost credit available
> 
> Getting more boost credit means having lower priority when not
> boosted; and burning through your boost credit means not being
> scheduled when you need to be.
> 
> Other ways we could consider putting a vcpu into a boosted state (some
> discussed on Matrix or emails linked from Matrix):
> * Xen is about to preempt, but finds that the vcpu interrupts are
> blocked (this sort of overlaps with the "when we deliver an interrupt"
> one)

This is also a good heuristic for "vCPU owns a spinlock", which is
definitely a bad time to preempt.

> * Xen is about to preempt, but finds that the (currently out-of-tree)
> "dont_desched" bit has been set in the shared memory area
> 
> Other ways to consider de-boosting:
> * There's a way to trigger a VMEXIT when interrupts have been
> re-enabled; setting this up when the VM is in the boost state

This is a good idea.

> Getting the defaults right might take some thinking.  If you set the
> default "boost credit ratio" to 25% and the "default boost interval"
> to 500ms, then you'd basically have five "boosts" per scheduling
> window.  The window depends on how active other vcpus are, but if it's
> longer than 20ms your system is too overloaded.

An interval of 500ms seems rather long to me.  Did you mean 500μs?

> Thoughts?

My first thought when I had the problem is that Xen's scheduling quantum
was too long.  This is consistent with the observation that dom0 (which
was not very busy IIRC) fell behind in its delivery of audio samples.
Presumably it had plenty of credit, but simply did not get scheduled in
time, perhaps because Xen did not preempt soon enough.  It’s also worth
noting that Qubes makes heavy use of vchans, and I expect the latency of
these to be directly proportional to the time between preemption
interrupts.

Audio is not very demanding on throughput, but is extremely sensitive to
latency.  Therefore, the top priority is making sure that every runnable
vCPU gets a chance to execute periodically.  One way to solve this would
be for both the credits (both the initial credit and the maximum credit
carried over) and the interval between preemptions to be inversely
proportional to the number of runnable vCPUs, so that the time needed to
cycle through all runnable vCPUs is roughly constant.  Specifically,
they would be proportional to Lmax/runnable_vCPUs, where Lmax is the
latency target (1ms or so).  This also ensures that even Xen-unaware VMs
(such as a Windows guest running Microsoft Teams or Skype) get to run
periodically.  There would need to be a limit to prevent Xen from
hogging more than e.g. 10% of CPU time just doing preemption, but if
this is hit, Xen should log something and possibly notify dom0 so that a
warning can be displayed to the user.  Additionally, a certain amount of
CPU time (such as 10%) should be reserved for dom0, so that the system
remains responsive.

Qubes OS could also help here.  If a VM is allowed to record audio, it
(and the VMs providing network to it, transitively) should get a boost
in priority, so that if the system is overloaded other guests are more
likely be delayed in their execution.

> Demi, what kinds of interrupt counts are you getting for your VM?

I didn't measure it, but I can check the next time I am on a video call
or doing audio recoring.

>  -George

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.