[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Fwd: xen: credit2: credit2 can’t reach the throughput as expected



Hi, Dario,
Here is the xentrace in credit2 with ratelimiting 1 ms and 30ms by observing 1 seconds both.

Roughly, we can see the frequency of the context switch.
The context switch decreases significantly when the ratelimiting changes from 1ms to 30ms
linux-EBkjWt:/home # cat credit2_r_1000.log | grep __enter_scheduler | wc -l
2407
linux-EBkjWt:/home # cat credit2_r_30000.log | grep __enter_scheduler | wc -l
714

Since we also complement credit for sleeper vcpus to guarantee the fairness (also sched_latency of sleeper vcpus) once we trigger the reset_credit.
it does not look like suitable for some workload such like the case in this issue,
Is that possible we try to do some punishment for the sleepers or complement credit in other policy to avoid too much preemption?

We sacrifice throughput for the sched_latency by theory, However, what's interesting is that, as I said before, if I don't complement credit for sleepers
or enlarge the ratelimiting, the sched_latency may not get worse
If the vcpus runs staggeredly which spread into pCPUs when they are in idle at most of time due to the stable running pattern in my demo.

Best regards.

------
> Forwarding to xen-devel, as it was dropped.
> ---
> Hi, Dario,
>
> > On Thu, 2019-02-14 at 07:10 +0000, zheng chuan wrote:
> > > Hi, Dario,
> > >
> > Hi,
> >
> > > I have put the test demo in attachment, please run it as follows:
> > > 1. compile it
> > > gcc upress.c -o upress
> > > 2. calculate the loops in dom0 first ./upress -l 100 For example,
> > > the output is cpu khz : 2200000 calculate loops: 4472.
> > > We get 4472.
> > > 3. give the 20% pressure for each vcpu in guest by ./upress -l 20
> -z
> > > 4472 & It is better to bind each pressure task to vcpu by taskset.
> > >
> > Ok, thanks for the code and the instructions, I will give it a try.
> >
>
> If you have questions about the test code, please let me know:)
>
> > > Sorry for the mess picture, you can see the figure below.
> > >
> > Yeah, that's clearer. However, it is preferred to avoid HTML emails.
> In these
> > cases, you could put the table in some online accessible document,
> and post
> > the link. :-)
> >
> > Also, let me ask this again, is this coming from actual tracing (like
> with
> > `xentrace` etc)?
> >
> > > The green one means vcpu is running while the red one means idle.
> > > In Fig.1, vcpu1 and vcpu2 runs staggeredly, it means vcpu1 runs
> 20ms
> > > and then vcpu2 runs 20ms while vcpu1 is sleeping.
> > >
> > How do you know it's sleeping and not, for instance, that it has been
> > preempted and hence is waiting to run?
> >
>
> It is the schematic diagram in theory.
> I'm sorry the xentrace has problem on my machine, I'll put the trace as soon as
> I fix it.
>
> > My point being that, when you setup a workload like this, and only
> look at the
> > throughput you achieve, it is expected that schedulers with longer
> timeslices
> > do better.
> >
> > It would be interesting to look at both throughput and latency,
> though.
> > In fact, (assuming the analysis is correct) in the Credit1 case, if
> two vcpus
> > wakes up at about the same time, the one that wins the pcpu runs for
> a full
> > timeslice, or until it blocks, i.e., in this case, for 20ms.
> > This means the other vcpu will have to wait for so long, before being
> able to do
> > anything.
> >
> > > In Fig.2, vcpu1 and vcpu2 runs at the same time, it means vcpu1 and
> > > vcpu2 compete for pCPU, and then go to sleep at the same time.
> > > Obviously, the smaller time-slice is, the worse competition
> happens.
> > >
> > But the better the latency. :-D
> >
> > What I mean is that, achieving best throughput or best latency at the
> same
> > time is often impossible, and the job of a scheduler is to come up
> with a
> > trade-off, as well as with tunables for letting people that cares
> more about
> > either one or the other, to steer it that direction.
> >
> > Achieving better latency than Credit1 has been a goal of Credit2,
> since design
> > time. However, it's possible that we ended up sacrificing throughput
> too much,
> > or that we lack tunables to let users decide what they want.
> >
> > Of course, this is all assuming that the analysis of the problem that
> you're
> > providing is correct, which I'll be looking into confirming. :-)
> >
>
> I agree that it is difficult to balance between the throughput and
> sched_latency:( In my workload, if we enlarge the difference of credit between
> vcpus, the vcpus would be run staggeredly from the long term to see, I doubt
> the sched_latency could be also low if we spread running vcpus into pCPUs
> since the pCPUs are not used up to 100%.
>
> I used to test sched_latency in CFS by perf with the scheduler
> parameters:
> linux-GMwmmh:~ # cat /proc/sys/kernel/sched_min_granularity_ns
> 3000000
> linux-GMwmmh:~ # cat /proc/sys/kernel/sched_latency_ns
> 24000000
> linux-GMwmmh:~ #
> the vcpu of sched_latency is maxium to 21ms around.
>
> But I don't know how to compare the sched_latency with Credit2 since the
> analyzing tool and scheduler are totally different :(.
>
> > > As you mentioned that the Credit2 does not have a real timeslice,
> the
> > > vcpu can be preempted by the difference of credit (+
> > > sched_ratelimit_us) dynamically.
> > >
> > Actually, it's:
> >
> > difference_of_credit + min(CSCHED2_MIN_TIMER, sched_ratelimit_us)
> >
>
> Thank you for correcting.
>
> > > > Perhaps, one thing that can be done to try to confirm this
> > > analysis, would be to
> > > > make the scheduling less frequent in Credit2 and, on the other
> > > hand, to make
> > > > it more frequent in Credit1.
> >
> > > Here is the further test result:
> > > i. it is interesting that it still works well if I make Credit1 to
> > > 1ms by xl sched-credit -s -t 1 linux-sodv:~ # xl sched-credit
> > > Cpupool Pool-0: tslice=1ms ratelimit=1000us migration-delay=0us Name
> > > ID Weight Cap
> > > Domain-0 0 256 0
> > > Xenstore 1 256 0
> > > guest_1 2 256 0
> > > guest_2 3 256 0
> > >
> > Hah, yes, it is interesting indeed! It shows us one more time how not
> > predictable Credit1 behavior is, because of all the hacks it
> > accumulated over time (some of which, are my doing, I know... :-P).
> >
> > > ii. it works well if sched_ratelimit_us is set up to 30ms above.
> > > linux-sodv:~ # xl sched-credit2 -s -p Pool-0 Cpupool Pool-0:
> > > ratelimit=30000us
> > >
> > Ok, good to know, thanks for doing the experiment.
> >
> > If you have time, can you try other values? I mean, while still on
> > Credit2, try to set ratelimiting to, like, 20, 15, 10, 5, and report
> > what happens?
> >
>
> It still has problem if I set ratelimiting below 30ms.
> 1): 20ms not stable, sometimes can up to 80% and 160% xentop - 20:08:14 Xen
> 4.11.0_02-1
> 4 domains: 1 running, 3 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
> Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz NAME
> STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS
> NETS
> NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT
> SSID
> Domain-0 -----r 112 3.0 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0
> 0
> guest_1 --b--- 110 67.1 1048832 1.6 1049600 1.6 4 1 636 4 1 0 4072 2195
> 191451 10983 0
> guest_2 --b--- 186 134.8 1048832 1.6 1049600 1.6 8 1 630 4 1 0 4203
> 1166 191619 10921 0
>
> 2): 15 ms
> xentop - 20:10:07 Xen 4.11.0_02-1
> 4 domains: 2 running, 2 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
> Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz NAME
> STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS
> NETS
> NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT
> SSID
> Domain-0 -----r 116 2.7 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0
> 0
> guest_1 --b--- 193 73.9 1048832 1.6 1049600 1.6 4 1 927 5 1 0 4072 2198
> 191451 10992 0
> guest_2 -----r 350 146.6 1048832 1.6 1049600 1.6 8 1 921 6 1 0 4203
> 1169 191619 10930 0
>
> 3): 10 ms
> xentop - 20:07:35 Xen 4.11.0_02-1
> 4 domains: 2 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
> Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz NAME
> STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS
> NETS
> NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT
> SSID
> Domain-0 -----r 111 3.1 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0
> 0
> guest_1 -----r 81 67.1 1048832 1.6 1049600 1.6 4 1 588 3 1 0 4072 2193
> 191451 10980 0
> guest_2 ------ 130 125.5 1048832 1.6 1049600 1.6 8 1 583 4 1 0 4203
> 1164 191619 10918 0
>
> 4): 5 ms
> xentop - 20:07:12 Xen 4.11.0_02-1
> 4 domains: 3 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
> Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz NAME
> STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS
> NETS
> NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT
> SSID
> Domain-0 -----r 110 2.8 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0
> 0
> guest_1 -----r 66 64.5 1048832 1.6 1049600 1.6 4 1 386 3 1 0 4072 2187
> 191451 10835 0
> guest_2 -----r 101 124.3 1048832 1.6 1049600 1.6 8 1 381 3 1 0 4203
> 1161 191619 10822 0
>
> > > However, the sched_ratelimit_us is not so elegant and flexiable
> that
> > > it guarantees the specific time-slice fixedly.
> > >
> > Well, I personally never loved it, but it is not completely unrelated
> > to what we're seeing and discussing, TBH. It indeed was introduced to
> > improve the throughput, in workloads where there was too many wakeups
> > (which, in Credit1, also resulted in invoking the scheduler and often
> > in context switching, do to boosting).
> >
> > > It may very likely cause degrading of the other scheduler criteria
> > > like sched_latency.
> > > As far as I know, CFS could adjust time-slice according to the
> > > nr_queue in runqueue (in__sched_period() ).
> > > Could it possible that Credit2 also have the similar ability to
> > > adjust time-slice automatically?
> > >
> > Well, let's see. Credit2 and CFS are very similar, in principle, but
> > the code is actually quite different. But yeah, we may be able to
> come
> > up with something more clever than just plain ratelimiting, for
> > adjusting what CFS calls "the granularity".
> >
>
> Yes, but I think it is a little difficult to do the same way like CFS due to one
> queue per socket since we can not have the runqueue per-cpu anymore.
>
> Best Regards.
>

Attachment: credit2_r_1000.raw
Description: credit2_r_1000.raw

Attachment: credit2_r_30000.raw
Description: credit2_r_30000.raw

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.