Xen project Mailing List

Re: [Xen-devel] schedulers and topology exposing questions

On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote: > Hello all! > Hey, here I am again, > Konrad came up with a workaround that was setting the flag for domain > scheduler in linux > As the guest is not aware of SMT-related topology, it has a flat > topology initialized. > Kernel has domain scheduler flags for scheduling domain CPU set to > 4143 for 2.6.39. > Konrad discovered that changing the flag for CPU sched domain to 4655 > So, as you've seen, I also have been up to doing quite a few of benchmarking doing soemthing similar (I used more recent kernels, and decided to test 4131 as flags. In your casse, according to this: Âhttp://lxr.oss.org.cn/source/include/linux/sched.h?v=2.6.39#L807 4655 means: Â SD_LOAD_BALANCE Â Â Â Â| Â SD_BALANCE_EXEC Â Â Â Â| Â SD_BALANCE_WAKE Â Â Â Â| Â SD_PREFER_LOCAL Â Â Â Â| [*] Â SD_SHARE_PKG_RESOURCES | Â SD_SERIALIZE and another bit (0x4000) that I don't immediately see what it is. Things have changed a bit since then, it appears. However, I'm quite sure I've tested turning on SD_SERIALIZE in 4.2.0 and 4.3.0, and results were really pretty bad (as you also seem to say later). > works as a workaround and makes Linux think that the topology has SMT > threads. > Well, yes and no. :-). I don't want to make this all a terminology bunfight, something that also matters here is how many scheduling domains you have. To check that (although in recent kernels) you check here: Âls /proc/sys/kernel/sched_domain/cpu2/ (any cpu is ok) and see how many domain[0-9] you have. On baremetal, on an HT cpu, I've got this: $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/nameÂ SMT MC So, two domains, one of which is the SMT one. If you check their flags, they're different: $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags 4783 559 So, yes, you are right in saying that 4655 is related to SMT. In fact, it is what (among other things) tells the load balancer that *all* the cpus (well, all the scheduling groups, actually) in this domain are SMT siblings... Which is a legitimate thing to do, but it's not what happens on SMT baremetal. At least is consistent, IMO. I.e., it still creates a pretty flat topology, like there was a big core, of which _all_ the vcpus are part of, as SMT siblings. The other option (the one I'm leaning toward) was too get rid of that one flag. I've only done preliminary experiments with it on and off, and the ones with it off were better looking, so I did keep it off for the big run... but we can test with it again. > This workaround makes the test to complete almost in same time as on > baremetal (or insignificantly worse). > > This workaround is not suitable for kernels of higher versions as we > discovered. > There may be more than one reason for this (as said, a lot changed!) but it matches what I've found when SD_SERIALIZE was kept on for the scheduling domain where all the vcpus are. > The hackish way of making domU linux think that it has SMT threads > (along with matching cpuid) > made us thinks that the problem comes from the fact that cpu topology > is not exposed to > guest and Linux scheduler cannot make intelligent decision on > scheduling. > As said, I think it's the other way around: we expose too much of it (and this is more of an issue for PV rather than for HVM). Basically, either you do the pinning you're doing or, whatever you expose, will be *wrong*... and the only way to expose not wrong data is to actually don't expose anything! :-) > The test described above was labeled as IO-bound test. > > We have run io-bound test with and without smt-patches. The > improvement comparing > to base case (no smt patches, flat topology) shows 22-23% gain. > I'd be curious to see the content of the /proc/sys/kernel/sched_domain directory and subdirectories with Joao's patches applied. > While we have seen improvement with io-bound tests, the same did not > happen with cpu-bound workload. > As cpu-bound test we use kernel module which runs requested number of > kernel threads > and each thread compresses and decompresses some data. > That is somewhat what I would have expected, although up to what extent, it's hard to tell in advance. It also matches my findings, both for the results I've already shared on list, and for others that I'll be sharing in a bit. > Here is the setup for tests: > Intel Xeon E5 2600 > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core. > Xen 4.4.3, default timeslice and ratelimit > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+. > Dom0: kernel 4.1.0, 2 vcpus, not pinned. > DomU has 8 vcpus (except some cases). > > > For io-bound tests results were better with smt patches applied for > every kernel. > > For cpu-bound test the results were different depending on wether > vcpus were > pinned or not, how many vcpus were assigned to the guest. > Right. In general, this also makes sense... Can we see the actual numbers? I mean the results of the tests with improvements/regressions highlighted, in addition to the traces that you already shared? > Please take a look at the graph captured by xentrace -e 0x0002f000 > On the graphs X is time in seconds since xentrace start, Y is the > pcpu number, > the graph itself represent the event when scheduler places vcpu to > pcpu. > > The graphs #1 & #2: > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, > one client/server > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, > 8 kernel theads > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 > kernel. > Ok, so this is the "baseline", the result of just running your tests in a pretty standard Xen and Dom0 and DomU status and configurations, right? > As can be seen here scheduler places the vcpus correctly on empty > cores. > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this? > Take a look at > trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png > where I split data per vcpus. > Well, why not, I would say? I mean, where a vcpu starts to run at an arbitrary point in time, especially if the system is otherwise idle before, it can be considered random (it's not, it depends on both the vcpu's and system's previous history, but in a non-linear way, and that is not in the graph anyway). In any case, since there are idle cores, the fact that vcpus do not move much, even if they're not pinned, I consider it a good thing, don't you? If vcpuX wakes up on processor Y, where it has always run before, and it find out it can still run there, migrating somewhere else would be pure overhead. The only potential worry of mine about trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png, is that vcpus 4 and 7 (or 4 and 2, colors are too similar to be sure) run for some time (the burst around t=17), on pcpus 5 and 6. Are these two pcpus SMT siblings? Doing the math myself on pCPUs IDs, I don't think they are, so all would be fine. If they are, that should not happen. However, you're using 4.4, so even if you had an issue there, we don't know if it's still in staging. In any case and just to be sure, can you produce the output of `xl vcpu-list', while this case is running? > Now to cpu-bound tests. > When smt patches applied and vcpus pinned correctly to match the > topology and > guest become aware of the topology, cpu-bound tests did not show > improvement with kernel 2.6.39. > With upstream kernel we see some improvements. The tes was repeated 5 > times back to back. > Again, 'some' being? > The number of vcpus was increased to 16 to match the test case where > linux was not > aware of the topology and assumed all cpus as cores. > Â > On some iterations one can see that vcpus are being scheduled as > expected. > For some runs the vcpus are placed on came core (core/thread) (see > trace_cpu_16vcpus_8threads_5runs.out.plot.err.png). > It doubles the time it takes for test to complete (first three runs > show close to baremetal execution time). > No, sorry, I don't think I fully understood this part. So: Â1. can you point me at where (time equal to ?) what you are sayingÂ Â Â happens? Â2. more important, you are saying that the vcpus are pinned. If youÂ Â Â pin the vcpus they just should not move. Period. If they move,Â Â Â if's a bug, no matter where they go and what the other SMT siblingÂ Â Â of the pcpu where they go is busy or idle! :-O Â Â So, are you saying that you pinned the vcpus of the guest and you Â Â see them moving and/or not being _always_ scheduled where youÂ Â Â pinned them? Can we see `xl vcpu-list' again, to see how they'reÂ Â Â actually pinned. > END: cycles: 31209326708 (29 seconds) > END: cycles: 30928835308 (28 seconds) > END: cycles: 31191626508 (29 seconds) > END: cycles: 50117313540 (46 seconds) > END: cycles: 49944848614 (46 seconds) > > Since the vcpus are pinned, then my guess is that Linux scheduler > makes wrong decisions? > Ok, so now it seems to me that you agree that the vcpus don't have much alternatives. If yes (which would be of great relief for me :-) ), it could indeed be that indeed the Linux scheduler is working suboptimally. Perhaps it's worth trying running the benchmark inside the guest with the Linux's threads pinned to the vcpus. That should give you perfectly consistent results over all the 5 runs. One more thing. You say you the guest has 16 vcpus, and that there are 8 threads running inside it. However, I seem to be able to identify in the graphs at least a few vertical lines where more than 8 vcpus are running on some pcpu. So, if Linux is working well, and it really only has to place 8 vcpus, it would put them on different cores. However, if at some point in time, there is more than that it has to place, it will have to necessarily 'invade' an already busy core. Am I right in seeing those lines, or are my eyes deceiving me? (I think a per-vcpu breakup of the graph above, like you did for dom0, would help figure this out). > So I ran the test with smt patches enabled, but not pinned vcpus. > AFAICT, This does not make much sense. So, if I understood correctly what you mean, by doing as you say, you're telling Linux that, for instance, vcpu0 and vcpu1 are SMT siblings, but then Xen is free to run vcpu0 and vcpu1 at the same time wherever it likes... same core, different core on same socket, different socket, etc. This, I would say, bring us back to the pseudo-random situation we have by default already, without any patching and any pinning, or just in a different variant of it.Â > result is also shows the same as above (see > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png) > : > Also see the per-cpu graph > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_per > vcpu.png). > Ok. I'll look at this graph better with the aim of showing an example of my theory above (as soon as my brain, which is not in it's best shape today) will manage to deal with all the colors (I'm not complaining, BTW, there's not another way in which you can show things, it's just me! :-D). > END: cycles: 49740185572 (46 seconds) > END: cycles: 45862289546 (42 seconds) > END: cycles: 30976368378 (28 seconds) > END: cycles: 30886882143 (28 seconds) > END: cycles: 30806304256 (28 seconds) > > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same > core while other cores are idle: > > 35v2 9.881103815 > 7ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ > 35v0 9.881104013 6 > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ > 35v2 9.892746452 > 7ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ > 35v0 9.892746546 6ÂÂÂ-> vcpu0 gets scheduled right after vcpu2 on > same core > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ > 35v0 9.904388175 > 6ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ > 35v2 9.904388205 7 -> same here > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ > 35v2 9.916029791 > 7ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ > 35v0 9.916029992 > 6ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ > Yes, this, in theory, should not happen. However, our (but Linux's, or any other OS's one --perhaps in its own way--) can't always be _instantly_ perfect! In this case, for instance, the SMT load balancing logic, in Credit1, is triggered: Â- from outside of sched_credit.c, by vcpu_migrate(), which is calledÂ Â Âupon in response to a bunch of events, but _not_ at every vcpuÂ Â Âwakeup Â- from inside sched_credit.c,Âcsched_vcpu_acct(), if the vcpu was itÂ Â Âhas been active for a while This means, it is not triggered upon each and every vcpu wakeup (it might, but not for the vcpu that is waking up). So, seeing a samples of a vcpu not being scheduled according to optimal SMT load balancing, especially right after it woke up, it is expectable. Then, after a while, the logic should indeed trigger (via csched_vcpu_acct()) and move away the vcpu to an idle core. To tell how long the SMT perfect balancing violation happens, and whether or not it happens as a consequence of tasks wakeups, we need more records from the trace file, coming from around the point where the violation happens. Does this make sense to you? Regards, and thanks for sharing all this! :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.