[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sketch of an idea for handling the "mixed workload" problem



On Mon, Jan 22, 2024 at 11:54:14AM +0000, George Dunlap wrote:
> On Mon, Jan 22, 2024 at 12:31 AM Demi Marie Obenour
> <demi@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> > > The basic credit2 algorithm goes something like this:
> > >
> > > 1. All vcpus start with the same number of credits; about 10ms worth
> > > if everyone has the same weight
> > >
> > > 2. vcpus burn credits as they consume cpu, based on the relative
> > > weights: higher weights burn slower, lower weights burn faster
> > >
> > > 3. At any given point in time, the runnable vcpu with the highest
> > > credit is allowed to run
> > >
> > > 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> > > reset: everyone gets another 10ms, and can carry over at most 2ms of
> > > credit over the reset.
> > >
> > > Generally speaking, vcpus that use less than their quota and have lots
> > > of interrupts are scheduled immediately, since when they wake up they
> > > always have more credit than the vcpus who are burning through their
> > > slices.
> > >
> > > But what about a situation as described recently on Matrix, where a VM
> > > uses a non-negligible amount of cpu doing un-accelerated encryption
> > > and decryption, which can be delayed by a few MS, as well as handling
> > > audio events?  How can we make sure that:
> > >
> > > 1. We can run whenever interrupts happen
> > > 2. We get no more than our fair share of the cpu?
> > >
> > > The counter-intuitive key here is that in order to achieve the above,
> > > you need to *deschedule or preempt early*, so that when the interrupt
> > > comes, you have spare credit to run the interrupt handler.  How do we
> > > manage that?
> > >
> > > The idea I'm working out comes from a phrase I used in the Matrix
> > > discussion, about a vcpu that "foolishly burned all its credits".
> > > Naturally the thing you want to do to have credits available is to
> > > save them up.
> > >
> > > So the idea would be this.  Each vcpu would have a "boost credit
> > > ratio" and a "default boost interval"; there would be sensible
> > > defaults based on typical workloads, but these could be tweaked for
> > > individual VMs.
> > >
> > > When credit is assigned, all VMs would get the same amount of credit,
> > > but divided into two "buckets", according to the boost credit ratio.
> > >
> > > Under certain conditions, a vcpu would be considered "boosted"; this
> > > state would last either until the default boost interval, or until
> > > some other event (such as a de-boost yield).
> > >
> > > The queue would be sorted thus:
> > >
> > > * Boosted vcpus, by boost credit available
> > > * Non-boosted vcpus, by non-boost credit available
> > >
> > > Getting more boost credit means having lower priority when not
> > > boosted; and burning through your boost credit means not being
> > > scheduled when you need to be.
> > >
> > > Other ways we could consider putting a vcpu into a boosted state (some
> > > discussed on Matrix or emails linked from Matrix):
> > > * Xen is about to preempt, but finds that the vcpu interrupts are
> > > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > > one)
> > > * Xen is about to preempt, but finds that the (currently out-of-tree)
> > > "dont_desched" bit has been set in the shared memory area
> >
> > I think both of these would be good.  Another one would be when Xen is
> > about to deliver an interrupt to a guest, provided that there is no
> > storm of interrupts.  I’ve seen a USB webcam cause a system-wide latency
> > spike through what I presume is an interrupt storm, and I suspect that
> > others have observed similar behavior with USB external drives.
> 
> How would you determine that a given interrupt was part of a "storm",
> and what would you do differently as a result of determining that?

I’m not sure.  One heuristic might be that if a device assigned to a VM
is interrupting Xen too many times while Xen is running other VMs,
interrupts from that device are blocked as needed to ensure other VMs
get to execute.  Theoretically, an interrupt from a USB storage device
should be safe to block until Xen is no longer running boosted
workloads, but an interrupt from a USB microphone or speaker is not.

> > > Other ways to consider de-boosting:
> > > * There's a way to trigger a VMEXIT when interrupts have been
> > > re-enabled; setting this up when the VM is in the boost state
> >
> > That’s a good idea, but should be conditional on “dont_desched” _not_
> > being set.  This handles the case where the guest is running a realtime
> > thread.
> 
> In which case we need some way for the "enlightened" guest to know how
> to de-boost itself; a yield might do.

That would be sufficient.

> > Generally, I’d like to see something like this:
> >
> > - A vCPU with sufficient boost credit is boosted by Xen under the
> >   following conditions:
> >
> >   1. Xen interrupts the guest.
> 
> I take it you mean, "delivers an interrupt to the guest"?

Yes.

> >   2. Xen is about to preempt, but detects that “dont_desched” is set.
> >   3. Xen is about to preempt, but detects that interrupts are disabled.
> >
> > - A vCPU is deboosted if:
> >
> >   1. It runs out of boost credit, even if “dont_desched” is set.
> >   2. An interrupt handler returns, but only if “dont_desched” is not set.
> >   3. Interrupts are re-enabled, but only if “dont_desched” is not set.
> >
> >   The first case is an abnormal condition and typically means that
> >   either the system is overloaded or a vCPU is running boosted for too
> >   long.  To help debug this situation, Xen will log a warning and
> >   increment both a system-wide and a per-domain counter.  dom0 can
> >   retrieve counters for any domain, and a domain can read its own
> >   counter.
> >
> > - When to set “dont_desched” is entirely up to the guest kernel, but
> >   there are some general rules guests should follow:
> >
> >   - Only set “dont_desched” if there is a good reason, and unset it as
> >     soon as possible.  Xen gives vCPUs with “dont_desched” set priority
> >     over all other vCPUs on the system, but the amount of time a vCPU is
> >     allowed to run with an elevated priority is limited.  Xen will log a
> >     warning if a guest tries to run with elevated priority for too long.
> >
> >   - Xen boosts vCPUs before delivering an interrupt, but there should be
> >     a way for a vCPU to deboost itself even before returning from the
> >     interrupt handler.
> >
> >   - Guests should always set “dont_desched” when running hard-realtime
> >     threads (used for e.g. audio processing), even when the thread is in
> >     userspace.  This ensures that Xen gives the underlying vCPU priority
> >     over vCPUs
> >
> >   - Guests should always set “dont_desched” when holding a spin lock,
> >     but it is even better to use paravirtualized spin locks (which make
> >     a hypercall into Xen and therefore allow other vCPUs to run).
> >
> >   - Xen does not implement priority inheritance, so guests need to do
> >     that.
> >
> > - Max boost credits can be set by dom0 via a hypercall.
> >
> > The advantage of this approach is that it keeps almost all policy out of
> > Xen.  The only exception is the boosting when an interrupt is received,
> > but a well-behaved guest will deboost itself very quickly (by enabling
> > interrupts) if the boost was not actually needed, so this should have
> > very limited impact.  I think this should be enough for realtime audio,
> > and it is somewhat related to (but hopefully simpler than) the KVM RFC
> > from Google [1].
> >
> > Any thoughts on this?
> 
> Overall sounds good.  I think a good approach would be to start by
> implementing it without the "dont_desched" flag, and then add that on
> top later.  It sounds like you have a clear vision for what you want,
> so it shouldn't be too hard to write such that adding the
> "dont_desched" doesn't require a lot of pointless refactoring.
> 
> The other issue I have with this (and essentially where I got stuck
> developing credit2 in the first place) is testing: how do you ensure
> that it has the properties that you expect?  How do you develop a
> "regression test" to make sure that server-based workloads don't have
> issues in this sort of case?

I don’t have any server workloads myself.  Would it be reasonable to ask
those who do have such workloads to develop such a test?  They would be
in a much better position to check for regressions on these workloads,
and have server hardware that they can use to benchmark such workloads.
I just have my laptop and a test laptop, both running Qubes OS.

It’s also possible that some of these changes will improve latency at
the expense of throughput.  In that case, I could add a Xen command-line
option (or even a runtime toggle) that controls whether Xen honors the
boost state.  I do expect that the rest of the logic should have very
little overhead in this case.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.