Xen project Mailing List

2015-03-24 7:54 GMT-04:00 George Dunlap <George.Dunlap@xxxxxxxxxxxxx>:

On Tue, Mar 24, 2015 at 3:50 AM, Meng Xu <xumengpanda@xxxxxxxxx> wrote:
> Hi Dario and George,
>
> I'm exploring the design choice of eliminating the Xen scheduler overhead on
> the dedicated CPU. A dedicated CPU is a PCPU that has a full capacity VCPU
> pinned onto it and no other VCPUs will run on that PCPU.

Hey Meng!Â This sounds awesome, thanks for looking into it.

:-) I think it is a useful feature for extreme low latency applications.Â

> [Problems]
> The issue I'm encountering is as follows:
> After I implemented the dedicated cpu feature, I compared the latency of a
> cpu-intensive task in domU on dedicated CPU (denoted as R_dedcpu) and the
> latency on non-dedicated CPU (denoted as R_nodedcpu). The expected result
> should be R_dedcpu < R_nodedcpu since we avoid the scheduler overhead.
> However, the actual result is R_dedcpu > R_nodedcpu, and R_dedcpu -
> R_nodedcpu ~= 1000 cycles.
>
> After adding some trace to every function that may raise the
> SCHEDULE_SOFTIRQ, I found:
> When a cpu is not marked as dedicated cpu and the scheduler on it is not
> disabled, the vcpu_block() is triggered 2896 times during 58280322928ns
> (i.e., triggered once every 20,124,421ns in average) on the dedicated cpu.
> However,
> When I disable the scheduler on a dedicated cpu, the function
> vcpu_block(void) @schedule.c will be triggered very frequently; the
> vcpu_block(void) is triggered 644824 times during 8,918,636,761ns (i.e.,
> once every 13831ns in average) on the dedicated cpu.
>
> To sum up the problem I'm facing, the vcpu_block(void) is trigger much
> faster and more frequently when the scheduler is disabled on a cpu than when
> the scheduled is enabled.
>
> [My question]
> I'm very confused at the reason why vcpu_block(void) is triggered so
> frequently when the scheduler is disabled.Â The vcpu_block(void) is called
> by the SCHEDOP_block hypercall, but why this hypercall will be triggered so
> frequently?
>
> It will be great if you know the answer directly. (This is just a pure hope
> and I cannot really expect it. :-) )
> But I really appreciate it if you could give me some directions on how I
> should figure it out. I grepped vcpu_block(void) and SCHEDOP_blockÂ in the
> xen code base, but didn't found much call to them.
>
> What confused me most is thatÂ the dedicated VCPU should be blocked less
> frequently instead of more frequently when the scheduler is disabled on the
> dedicated CPU, because the dedicated VCPU is always running on the CPU now
> without the hypervisor scheduler's interference.

So if I had to guess, I would guess that you're not actually blocking
when the guest tries to block.Â Normally if the guest blocks, it
blocks in a loop like this:

do {
Â enable_irqs();
Â hlt;
Â disable_irqs;
} while (!interrup_pending);

For a PV guest, the hlt() would be replaced with a PV block() hypercall.

Normally, when a guest calls block(), then it's taken off the
runqueue; and if there's nothing on the runqueue, then the scheduler
will run the idle domain; it's the idle domain that actually does the
blocking.

If you've hardwired it always to return the vcpu in question rather
than the idle domain, then it will never block -- it will busy-wait,
calling block millions of times.

The simplest way to get your prototype working, in that case, would be
to return the idle vcpu for that pcpu if the guest is blocked.

âExactly! Thank you so much for pointing this out!Â I did hardwired it always to return the vcpu that is supposed to be blocked. Now I totally understand what happened. :-) â

But this lead to another issue to my design:

If I return the idle vcpu when the dedicated VCPU is blocked, it will do the context_switch(prev, next); when the dedicated VCPU is unblocked, another context_switch() is triggered.Â

It means that we can not eliminate the context_switch overhead for the dedicated CPU.Â

The ideal performance for the dedicated VCPU on the dedicated CPU should be super-close to the bare-metal CPU. Here we still have the context_switch overhead, which is about Â1500-2000 Âcycles.Â

Can we avoid the context switch overhead?Â

But a brief comment on your design:

Looking at your design at the moment, you will get rid of the overhead
of the scheduler-related interrupts, and any pluggable-cpu accounting
that needs to happen (e.g., calculating credits burned, &c).Â And
that's certainly not nothing.Â

âYes. The schedule() function is avoided.Â

Right now, I only apply the dedicated cpu feature to the RTDS scheduler. So when a dedicated VCPU is pinned and running on the dedicated CPU, it should be a full-capacity vcpu and we don't need to count the budget burned.Â

However, because credit2 scheduler counts the credit in domain level, the function of counting the credit burned should not be avoided.

Actually, the trace code in the scheduler() will also be bypassed on the dedicated CPU. I'm not sure if we need the trace code working on the dedicated CPU or not. Since we are aiming to provide the dedicated VCPU that has close-to-bare-metal CPU performance, the tracing mechanism in the schedule() is unnecessary IMHO.

But it's not really accurate to say
that you're avoiding the scheduler entirely.Â At the moment, as far as
I can tell, you're still going through all the normal schedule.c
machinery between wake-up and actually running the vm; and the normal
machinery for interrupt delivery.

âYes. :-(

Ideally, I want to isolate all such interference from the dedicated CPU so that the dedicated VCPU on it will have the high-performance that is close to the bare-metal cpu. However, I'm concerning about how complex it will be and how it will affect the existing functions that relies on Âinterrupts.

I'm wondering -- are people really going to want to just pin a single
vcpu from a domain like this?Â Or are they going to want to pin all
vcpus from a given domain?

For the first to be useful, the guest OS would need to understand
somehow that this cpu has better properties than the other vcpus on
its system.Â Which I suppose could be handled manually (e.g., by the
guest admin pinning processes to that cpu or something).

âRight. The guest OS will be running on heterogeneous cpus.â In my mind, not all processes in the guest ask for the extreme low latency. So guest OS can just pin those latency-critical processes onto the dedicated VCPU (which is mapped to dedicated CPU), and pin other processes to the non-dedicated VCPUs. This could be more flexible for the guest OS and accommodate more domains on the same number of cpus. But (of course), it introduce more complexity into the hypervisor and management in guest OS.

The reason I'm asking is because another option that would avoid the
need for special per-cpu flags would to make a "sched_place" scheduler
(sched_partition?), which would essentially do what you've done here
-- when you add a vcpu to the scheduler, it simply chooses one of its
free cpus and dedicates it to that vcpu.Â If no such cpus are
available, it returns an error.Â In that case, you could use the
normal cpupool machinery to assign cpus to that scheduler, without
needing to introduce these extra flags, and to make each of the
pluggable schedulers need to deal with the complexity of implementing
the "dedicated" scheduling.

âThis is also a good idea, if we don't aim to avoid the context switch overhead and avoid calling the schedule() function. The biggest strength of this approach is that it has as little impact as possible on the existing functions.â

Actually, I can extend the RTDS scheduler to include this feature. This is more like a fast path in the scheduler on the dedicated CPU: Instead of scanning the runq and deciding which vcpu should run next, we just always pick the dedicated VCPU if the vcpu is not blocked. (If the dedicated VCPU is blocked, we pick the idle VCPU.)

âHowever, this just reduce (instead of remove) the scheduler() overheadâ and cannot avoid the context switch overhead either.

The only downside is that at the moment you can't have a domain cross
cpupools; so either all vcpus of a domain would have to be dedicated,
or none.

âYes. I think this is a secondary concern. I'm more concerned about how much overhead can we remove by using the dedicated CPU. Ideally, the more overhead we remove, the better performance we get.Â

âDo you have any suggestion/insights on the performance goal of âthe dedicated CPU feature? I think it will affect how far we should go to remove the overheads.

âThank you very much!â

âBest,

Mengâ

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

Re: [Xen-devel] Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU