Xen project Mailing List

[Xen-devel] Some interesting results from schedbench

I'm starting to get some interesting results from schedbench. Unfortunately, to see how interesting they really are will take a fair amount of getting up to speed. :-( But the summary is this: - credit2 is *much* more consistent than credit1. At the level of one-second samples, credit2 samples look nearly identical, whereas credit1 samples have quite a bit higher variance - credit2's gives much better "maximum latency" than credit1 - That said, in most cases credit1 doesn't do to bad averaging over a whole 10-second run - credit2 at the cpu time level tends to be much more "fair"; and usually this leads to fairer throughputs; but sometimes it leads to apparently contradictory results as the same VM will get lower througput with higher overall cpu utilization Attached are the output to a recent run. A brief introduction is in the README here: https://github.com/gwd/schedbench My system at the moment has a cpupool with 2 cores each of which has 2 hyperthreads (4 logical cpus). In this test (04) workloads A and B are fairly similar; they burn and sleep on the order of hundreds of milliseconds. OK, so here are some interesting highlights: * The throughput of both workloads A is higher in run '1a+1b' (one of each workload) is higher than 'baseline-a' For both schedulers, workload A averages around 146 Mops/second when running in an empty cpupool; and workload B averages around 250 Mops/sec when running in an empty cputool. But put them both in the same cpupool, and workload A now averages 174 Mops/sec in both schedulers. * In "RUN 1a+1b", the throughput of workload B is higher in credit2 than in credit1 (255Mops/sec compared to 250Mops/sec) Looking at the individual samples, it appears that credit1 has a very consistent throughput of around 250Mops/sec. But the throughput for individual samples in credit2 is bimodal: Three (seconds 1, 2, and 4) have throughputs of around 250Mops/sec, and six (seconds 3, 5-9) have outputs around 258Mops/sec. The outputs for workload A are similarly bimodal in the same seconds, though not as dramatic (175 vs 174 Mops/sec). The second in particular make me wonder whether we're running into any kinds of power savings / cstate effects. * "RUN 2a+2b" is probably one of the most interesting and counterintuitive cases In this one, it looks like credit1 is more "fair" than credit2: Aggregate worker A throughput is 292Mops/sec for credit1, only 241Mops/sec for credit2; worker B throughput is only 353Mops for credit1, compared to 392Mops for credit2. But also interestingly enough, under credit1, worker A is getting *less* cputime: Each worker is averaging only 58% of the cpu, whereas under credit2 it's averaging 64% of the cpu. So under credit1, worker A is doing more with less. But the reverse is true for worker B: under credit2 is giving worker B only 85% (compared to 92% for credit1); but the throughput for worker B is 178Mops for credit2, compared to 159Mops/sec for credit1. So under credit2, worker *B* is doing more with less. Before explaining this a bit more, it's worth digging a bit deeper into the individual samples. Here we start to see credit2's consistency -- every single sample of worker A gets about 65% of the CPU, and generates 121Mops of throughput. Under credit1, worker A's cpu time ranges from 45% to 65%. And *most* of the time its throughput is also around 121Mops. But there are individual intervals where it goes up to 150Mops or 175Mops. What I think is going on here in credit1 is artifacts due to hyperthreading. Remember that there are 2 cores, each with two threads. The steady state for this use case will be one worker per logical cpu. But none of the workers max out the cpu: so some of the time, each worker will be running on a thread whose sibling is idle, in which case it will get a throughput boost. But since worker A and worker B have different cpu utilizations, the amount of "boosted" time will depend on the particular placement. If one core has two worker A threads and one has two worker B threads, then the worker A threads will have much more "boost" time than the worker "B" threads; whereas if each core has one A and one B, then the amount of "boost" time will be equal. This is I think what we're seeing here. If you notice, under credit 1, workers A and B have very similar spikes in throughputs: at seconds 4, 8, and 9, both get 175Mops; at seconds 5 and 7, both get 150Mops. And those "spikes" correspond to "dips" in throughput for both of worker B: 150 Mops at seconds 4, 8, and 9, and 170 Mops at seconds 5 and 7. This is due to the difference between credit1 and credit2's load balancing. Credit1 randomly shifts stuff around based on what it sees at this instant. Which means that much of the time, it has {A,B} {A,B}, but it frequently ends up with {A,A} {B,B}. Credit2 measures the load average for runqueues over the long haul and tries to make the *runqueue* average the same; and since our default now is one runqueue per core, that means that it will almost immedeately go to {A,B} {A,B} and never change it. The aggregate throughput for the system seems to be slightly higher under credit1 (645Mops credit1 vs 639Mops credit2). It's actually somewhat arguable what the optimal thing to do here is -- one could argue that "fairness" in the case of hyperthreads should mean that if you leave space for someone else to run at 'boost', that you should be given space for someone else to be run at 'boost'. But that's probably an optimization for another day: on the whole I think credit2's rational approach to balancing load is much better. * As the overcommitment goes up, credit2's consistency becomes more evident Once things become properly overcommitted, the aggregate fairness of credit1 and credit2 seem to be similar: the sum of throughputs converge around 300Mops. But the individual throughputs for all workers of either type for credit2 all converge in a very tight range; whereas with credit1 there is quite a bit of variance. And even the 10-second averages, the standard devation for credit2 is an order of magnitude lower than credit1's. * The maximum latency of credit2 is much smaller and more predictable One of the values that's reported in the raw scores but I haven't done anything with is the "maximum delta" -- the maximum time a VM woke up *after* it asked to be woken up. This is the last number in the raw scores, and is given in nanoseconds. For "RUN 8a+8b", credit2 has highly consistent numbers -- 2.2ms for workload A, 4.5ms for workload B. "RUN 16a+16b" is similar -- 4.2ms for workload A, 9ms for workload B. Credit1 has *much* higher numbers and higher variance. For 8a+8b, worker A is getting from 62ms to 120ms, and worker B is getting 90ms to 150ms. For 16a+16b, worker A is getting 150-270ms, while worker B is getting as high as 300ms. --- That's a lot of information. This is only one mix of workloads, on only one box, in a very tight corner case. A lot more testing needs to be done before we can have a clear idea how the algorithms work at other levels. But it does give us some interesting insights into the micro-level difference between credit1 and credit2. -George

Attachment: 04.credit.out
Description: Text document

Attachment: 04.credit2.out
Description: Text document

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.