[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sketch of an idea for handling the "mixed workload" problem



On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> The basic credit2 algorithm goes something like this:
> 
> 1. All vcpus start with the same number of credits; about 10ms worth
> if everyone has the same weight
> 
> 2. vcpus burn credits as they consume cpu, based on the relative
> weights: higher weights burn slower, lower weights burn faster
> 
> 3. At any given point in time, the runnable vcpu with the highest
> credit is allowed to run
> 
> 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> reset: everyone gets another 10ms, and can carry over at most 2ms of
> credit over the reset.
> 
> Generally speaking, vcpus that use less than their quota and have lots
> of interrupts are scheduled immediately, since when they wake up they
> always have more credit than the vcpus who are burning through their
> slices.
> 
> But what about a situation as described recently on Matrix, where a VM
> uses a non-negligible amount of cpu doing un-accelerated encryption
> and decryption, which can be delayed by a few MS, as well as handling
> audio events?  How can we make sure that:
> 
> 1. We can run whenever interrupts happen
> 2. We get no more than our fair share of the cpu?
> 
> The counter-intuitive key here is that in order to achieve the above,
> you need to *deschedule or preempt early*, so that when the interrupt
> comes, you have spare credit to run the interrupt handler.  How do we
> manage that?
> 
> The idea I'm working out comes from a phrase I used in the Matrix
> discussion, about a vcpu that "foolishly burned all its credits".
> Naturally the thing you want to do to have credits available is to
> save them up.
> 
> So the idea would be this.  Each vcpu would have a "boost credit
> ratio" and a "default boost interval"; there would be sensible
> defaults based on typical workloads, but these could be tweaked for
> individual VMs.
> 
> When credit is assigned, all VMs would get the same amount of credit,
> but divided into two "buckets", according to the boost credit ratio.
> 
> Under certain conditions, a vcpu would be considered "boosted"; this
> state would last either until the default boost interval, or until
> some other event (such as a de-boost yield).
> 
> The queue would be sorted thus:
> 
> * Boosted vcpus, by boost credit available
> * Non-boosted vcpus, by non-boost credit available
> 
> Getting more boost credit means having lower priority when not
> boosted; and burning through your boost credit means not being
> scheduled when you need to be.
> 
> Other ways we could consider putting a vcpu into a boosted state (some
> discussed on Matrix or emails linked from Matrix):
> * Xen is about to preempt, but finds that the vcpu interrupts are
> blocked (this sort of overlaps with the "when we deliver an interrupt"
> one)
> * Xen is about to preempt, but finds that the (currently out-of-tree)
> "dont_desched" bit has been set in the shared memory area

I think both of these would be good.  Another one would be when Xen is
about to deliver an interrupt to a guest, provided that there is no
storm of interrupts.  I’ve seen a USB webcam cause a system-wide latency
spike through what I presume is an interrupt storm, and I suspect that
others have observed similar behavior with USB external drives.

> Other ways to consider de-boosting:
> * There's a way to trigger a VMEXIT when interrupts have been
> re-enabled; setting this up when the VM is in the boost state

That’s a good idea, but should be conditional on “dont_desched” _not_
being set.  This handles the case where the guest is running a realtime
thread.

Generally, I’d like to see something like this:

- A vCPU with sufficient boost credit is boosted by Xen under the
  following conditions:

  1. Xen interrupts the guest.
  2. Xen is about to preempt, but detects that “dont_desched” is set.
  3. Xen is about to preempt, but detects that interrupts are disabled.

- A vCPU is deboosted if:

  1. It runs out of boost credit, even if “dont_desched” is set.
  2. An interrupt handler returns, but only if “dont_desched” is not set.
  3. Interrupts are re-enabled, but only if “dont_desched” is not set.

  The first case is an abnormal condition and typically means that
  either the system is overloaded or a vCPU is running boosted for too
  long.  To help debug this situation, Xen will log a warning and
  increment both a system-wide and a per-domain counter.  dom0 can
  retrieve counters for any domain, and a domain can read its own
  counter.

- When to set “dont_desched” is entirely up to the guest kernel, but
  there are some general rules guests should follow:

  - Only set “dont_desched” if there is a good reason, and unset it as
    soon as possible.  Xen gives vCPUs with “dont_desched” set priority
    over all other vCPUs on the system, but the amount of time a vCPU is
    allowed to run with an elevated priority is limited.  Xen will log a
    warning if a guest tries to run with elevated priority for too long.
    
  - Xen boosts vCPUs before delivering an interrupt, but there should be
    a way for a vCPU to deboost itself even before returning from the
    interrupt handler.

  - Guests should always set “dont_desched” when running hard-realtime
    threads (used for e.g. audio processing), even when the thread is in
    userspace.  This ensures that Xen gives the underlying vCPU priority
    over vCPUs 

  - Guests should always set “dont_desched” when holding a spin lock,
    but it is even better to use paravirtualized spin locks (which make
    a hypercall into Xen and therefore allow other vCPUs to run).

  - Xen does not implement priority inheritance, so guests need to do
    that.

- Max boost credits can be set by dom0 via a hypercall.

The advantage of this approach is that it keeps almost all policy out of
Xen.  The only exception is the boosting when an interrupt is received,
but a well-behaved guest will deboost itself very quickly (by enabling
interrupts) if the boost was not actually needed, so this should have
very limited impact.  I think this should be enough for realtime audio,
and it is somewhat related to (but hopefully simpler than) the KVM RFC
from Google [1].

Any thoughts on this?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[1]: 
https://lore.kernel.org/kvm/20231214024727.3503870-1-vineeth@xxxxxxxxxxxxxxx/

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.