[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Some interesting results from schedbench



I'm starting to get some interesting results from schedbench.

Unfortunately, to see how interesting they really are will take a fair
amount of getting up to speed. :-(

But the summary is this:

- credit2 is *much* more consistent than credit1.  At the level of
one-second samples, credit2 samples look nearly identical, whereas
credit1 samples have quite a bit higher variance

- credit2's gives much better "maximum latency" than credit1

- That said, in most cases credit1 doesn't do to bad averaging over a
whole 10-second run

- credit2 at the cpu time level tends to be much more "fair"; and
usually this leads to fairer throughputs; but sometimes it leads to
apparently contradictory results as the same VM will get lower througput
with higher overall cpu utilization

Attached are the output to a recent run.  A brief introduction is in the
README here:

https://github.com/gwd/schedbench

My system at the moment has a cpupool with 2 cores each of which has 2
hyperthreads (4 logical cpus).

In this test (04) workloads A and B are fairly similar; they burn and
sleep on the order of hundreds of milliseconds.

OK, so here are some interesting highlights:

* The throughput of both workloads A is higher in run '1a+1b' (one of
each workload) is higher than 'baseline-a'

For both schedulers, workload A averages around 146 Mops/second when
running in an empty cpupool; and workload B averages around 250 Mops/sec
when running in an empty cputool.

But put them both in the same cpupool, and workload A now averages 174
Mops/sec in both schedulers.

* In "RUN 1a+1b", the throughput of workload B is higher in credit2 than
in credit1 (255Mops/sec  compared to 250Mops/sec)

Looking at the individual samples, it appears that credit1 has a very
consistent throughput of around 250Mops/sec.  But the throughput for
individual samples in credit2 is bimodal: Three (seconds 1, 2, and 4)
have throughputs of around 250Mops/sec, and six (seconds 3, 5-9) have
outputs around 258Mops/sec.  The outputs for workload A are similarly
bimodal in the same seconds, though not as dramatic (175 vs 174 Mops/sec).

The second in particular make me wonder whether we're running into any
kinds of power savings / cstate effects.

* "RUN 2a+2b" is probably one of the most interesting and
counterintuitive cases

In this one, it looks like credit1 is more "fair" than credit2:
Aggregate worker A throughput is 292Mops/sec for credit1, only
241Mops/sec for credit2; worker B throughput is only 353Mops for
credit1, compared to 392Mops for credit2.

But also interestingly enough, under credit1, worker A is getting *less*
cputime: Each worker is averaging only 58% of the cpu, whereas under
credit2 it's averaging 64% of the cpu.  So under credit1, worker A is
doing more with less.

But the reverse is true for worker B: under credit2 is giving worker B
only 85% (compared to 92% for credit1); but the throughput for worker B
is 178Mops for credit2, compared to 159Mops/sec for credit1.  So under
credit2, worker *B* is doing more with less.

Before explaining this a bit more, it's worth digging a bit deeper into
the individual samples.  Here we start to see credit2's consistency --
every single sample of worker A gets about 65% of the CPU, and generates
121Mops of throughput.  Under credit1, worker A's cpu time ranges from
45% to 65%.  And *most* of the time its throughput is also around
121Mops.  But there are individual intervals where it goes up to 150Mops
or 175Mops.

What I think is going on here in credit1 is artifacts due to
hyperthreading.  Remember that there are 2 cores, each with two threads.
 The steady state for this use case will be one worker per logical cpu.
But none of the workers max out the cpu: so some of the time, each
worker will be running on a thread whose sibling is idle, in which case
it will get a throughput boost.

But since worker A and worker B have different cpu utilizations, the
amount of "boosted" time will depend on the particular placement.  If
one core has two worker A threads and one has two worker B threads, then
the worker A threads will have much more "boost" time than the worker
"B" threads; whereas if each core has one A and one B, then the amount
of "boost" time will be equal.

This is I think what we're seeing here.  If you notice, under credit 1,
workers A and B have very similar spikes in throughputs: at seconds 4,
8, and 9, both get 175Mops; at seconds 5 and 7, both get 150Mops.  And
those "spikes" correspond to "dips" in throughput for both of worker B:
150 Mops at seconds 4, 8, and 9, and 170 Mops at seconds 5 and 7.

This is due to the difference between credit1 and credit2's load
balancing.  Credit1 randomly shifts stuff around based on what it sees
at this instant.  Which means that much of the time, it has {A,B} {A,B},
but it frequently ends up with {A,A} {B,B}.

Credit2 measures the load average for runqueues over the long haul and
tries to make the *runqueue* average the same; and since our default now
is one runqueue per core, that means that it will almost immedeately go
to {A,B} {A,B} and never change it.

The aggregate throughput for the system seems to be slightly higher
under credit1 (645Mops credit1 vs 639Mops credit2).

It's actually somewhat arguable what the optimal thing to do here is --
one could argue that "fairness" in the case of hyperthreads should mean
that if you leave space for someone else to run at 'boost', that you
should be given space for someone else to be run at 'boost'.

But that's probably an optimization for another day: on the whole I
think credit2's rational approach to balancing load is much better.

* As the overcommitment goes up, credit2's consistency becomes more evident

Once things become properly overcommitted, the aggregate fairness of
credit1 and credit2 seem to be similar: the sum of throughputs converge
around 300Mops.  But the individual throughputs for all workers of
either type for credit2 all converge in a very tight range; whereas with
credit1 there is quite a bit of variance.  And even the 10-second
averages, the standard devation for credit2 is an order of magnitude
lower than credit1's.

* The maximum latency of credit2 is much smaller and more predictable

One of the values that's reported in the raw scores but I haven't done
anything with is the "maximum delta" -- the maximum time a VM woke up
*after* it asked to be woken up.  This is the last number in the raw
scores, and is given in nanoseconds.

For "RUN 8a+8b", credit2 has highly consistent numbers -- 2.2ms for
workload A, 4.5ms for workload B.  "RUN 16a+16b" is similar -- 4.2ms for
workload A, 9ms for workload B.

Credit1 has *much* higher numbers and higher variance.  For 8a+8b,
worker A is getting from 62ms to 120ms, and worker B is getting 90ms to
150ms.  For 16a+16b, worker A is getting 150-270ms, while worker B is
getting as high as 300ms.

---

That's a lot of information.  This is only one mix of workloads, on only
one box, in a very tight corner case.  A lot more testing needs to be
done before we can have a clear idea how the algorithms work at other
levels.  But it does give us some interesting insights into the
micro-level difference between credit1 and credit2.

 -George

Attachment: 04.credit.out
Description: Text document

Attachment: 04.credit2.out
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.