[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] schedulers and topology exposing questions



On Wed, Jan 27, 2016 at 02:01:35PM +0000, Dario Faggioli wrote:
> On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote:
> > Hello all!
> > 
> Hey, here I am again,
> 
> > Konrad came up with a workaround that was setting the flag for domain
> > scheduler in linux
> > As the guest is not aware of SMT-related topology, it has a flat
> > topology initialized.
> > Kernel has domain scheduler flags for scheduling domain CPU set to
> > 4143 for 2.6.39.
> > Konrad discovered that changing the flag for CPU sched domain to 4655
> >
> So, as you've seen, I also have been up to doing quite a few of
> benchmarking doing soemthing similar (I used more recent kernels, and
> decided to test 4131 as flags.
> 
> In your casse, according to this:
> Âhttp://lxr.oss.org.cn/source/include/linux/sched.h?v=2.6.39#L807
> 
> 4655 means:
> Â SD_LOAD_BALANCE Â Â Â Â|
> Â SD_BALANCE_EXEC Â Â Â Â|
> Â
> SD_BALANCE_WAKE Â Â Â Â|
> Â SD_PREFER_LOCAL Â Â Â Â| [*]
> Â
> SD_SHARE_PKG_RESOURCES |
> Â SD_SERIALIZE
> 
> and another bit (0x4000) that I don't immediately see what it is.
> 
> Things have changed a bit since then, it appears. However, I'm quite sure 
> I've tested turning on SD_SERIALIZE in 4.2.0 and 4.3.0, and results were 
> really pretty bad (as you also seem to say later).
> 
> > works as a workaround and makes Linux think that the topology has SMT
> > threads.
> >
> Well, yes and no. :-). I don't want to make this all a terminology
> bunfight, something that also matters here is how many scheduling
> domains you have.
> 
> To check that (although in recent kernels) you check here:
> 
> Âls /proc/sys/kernel/sched_domain/cpu2/ (any cpu is ok)
> 
> and see how many domain[0-9] you have.
> 
> On baremetal, on an HT cpu, I've got this:
> 
> $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/nameÂ
> SMT
> MC
> 
> So, two domains, one of which is the SMT one. If you check their flags,
> they're different:
> 
> $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags
> 4783
> 559
> 
> So, yes, you are right in saying that 4655 is related to SMT. In fact,
> it is what (among other things) tells the load balancer that *all* the
> cpus (well, all the scheduling groups, actually) in this domain are SMT
> siblings... Which is a legitimate thing to do, but it's not what
> happens on SMT baremetal.
> 
> At least is consistent, IMO. I.e., it still creates a pretty flat
> topology, like there was a big core, of which _all_ the vcpus are part
> of, as SMT siblings.
> 
> The other option (the one I'm leaning toward) was too get rid of that
> one flag. I've only done preliminary experiments with it on and off,
> and the ones with it off were better looking, so I did keep it off for
> the big run... but we can test with it again.
> 
> > This workaround makes the test to complete almost in same time as on
> > baremetal (or insignificantly worse).
> > 
> > This workaround is not suitable for kernels of higher versions as we
> > discovered.
> > 
> There may be more than one reason for this (as said, a lot changed!)
> but it matches what I've found when SD_SERIALIZE was kept on for the
> scheduling domain where all the vcpus are.
> 
> > The hackish way of making domU linux think that it has SMT threads
> > (along with matching cpuid)
> > made us thinks that the problem comes from the fact that cpu topology
> > is not exposed to
> > guest and Linux scheduler cannot make intelligent decision on
> > scheduling.
> > 
> As said, I think it's the other way around: we expose too much of it
> (and this is more of an issue for PV rather than for HVM). Basically,
> either you do the pinning you're doing or, whatever you expose, will be
> *wrong*... and the only way to expose not wrong data is to actually
> don't expose anything! :-)
> 
> > The test described above was labeled as IO-bound test.
> > 
> > We have run io-bound test with and without smt-patches. The
> > improvement comparing
> > to base case (no smt patches, flat topology) shows 22-23% gain.
> > 
> I'd be curious to see the content of the /proc/sys/kernel/sched_domain
> directory and subdirectories with Joao's patches applied.
> 
> > While we have seen improvement with io-bound tests, the same did not
> > happen with cpu-bound workload.
> > As cpu-bound test we use kernel module which runs requested number of
> > kernel threads
> > and each thread compresses and decompresses some data.
> > 
> That is somewhat what I would have expected, although up to what
> extent, it's hard to tell in advance.
> 
> It also matches my findings, both for the results I've already shared
> on list, and for others that I'll be sharing in a bit.
> 
> > Here is the setup for tests:
> > Intel Xeon E5 2600
> > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core.
> > Xen 4.4.3, default timeslice and ratelimit
> > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+.
> > Dom0: kernel 4.1.0, 2 vcpus, not pinned.
> > DomU has 8 vcpus (except some cases).
> > 
> > 
> > For io-bound tests results were better with smt patches applied for
> > every kernel.
> > 
> > For cpu-bound test the results were different depending on wether
> > vcpus were
> > pinned or not, how many vcpus were assigned to the guest.
> > 
> Right. In general, this also makes sense... Can we see the actual
> numbers? I mean the results of the tests with improvements/regressions
> highlighted, in addition to the traces that you already shared?
> 
> > Please take a look at the graph captured by xentrace -e 0x0002f000
> > On the graphs X is time in seconds since xentrace start, Y is the
> > pcpu number,
> > the graph itself represent the event when scheduler places vcpu to
> > pcpu.
> > 
> > The graphs #1 & #2:
> > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test,
> > one client/server
> > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test,
> > 8 kernel theads
> > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69
> > kernel.
> > 
> Ok, so this is the "baseline", the result of just running your tests in
> a pretty standard Xen and Dom0 and DomU status and configurations,
> right?
> 
> > As can be seen here scheduler places the vcpus correctly on empty
> > cores.
> > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this?
> > Take a look at
> > trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png
> > where I split data per vcpus.
> > 
> Well, why not, I would say? I mean, where a vcpu starts to run at an
> arbitrary point in time, especially if the system is otherwise idle
> before, it can be considered random (it's not, it depends on both the
> vcpu's and system's previous history, but in a non-linear way, and that
> is not in the graph anyway).
> 
> In any case, since there are idle cores, the fact that vcpus do not
> move much, even if they're not pinned, I consider it a good thing,
> don't you? If vcpuX wakes up on processor Y, where it has always run
> before, and it find out it can still run there, migrating somewhere
> else would be pure overhead.
> 
> The only potential worry of mine about
> trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png, is
> that vcpus 4 and 7 (or 4 and 2, colors are too similar to be sure) run
> for some time (the burst around t=17), on pcpus 5 and 6. Are these two
> pcpus SMT siblings? Doing the math myself on pCPUs IDs, I don't think
> they are, so all would be fine. If they are, that should not happen.
> 
> However, you're using 4.4, so even if you had an issue there, we don't
> know if it's still in staging.
> 
> In any case and just to be sure, can you produce the output of `xl
> vcpu-list', while this case is running?
> 
> > Now to cpu-bound tests.
> > When smt patches applied and vcpus pinned correctly to match the
> > topology and
> > guest become aware of the topology, cpu-bound tests did not show
> > improvement with kernel 2.6.39.
> > With upstream kernel we see some improvements. The tes was repeated 5
> > times back to back.
> >
> Again, 'some' being?
> 
> > The number of vcpus was increased to 16 to match the test case where
> > linux was not
> > aware of the topology and assumed all cpus as cores.
> > Â
> > On some iterations one can see that vcpus are being scheduled as
> > expected.
> > For some runs the vcpus are placed on came core (core/thread) (see
> > trace_cpu_16vcpus_8threads_5runs.out.plot.err.png).
> > It doubles the time it takes for test to complete (first three runs
> > show close to baremetal execution time).
> > 
> No, sorry, I don't think I fully understood this part. So:
> Â1. can you point me at where (time equal to ?) what you are sayingÂ
> Â Â happens?
> Â2. more important, you are saying that the vcpus are pinned. If youÂ
> Â Â pin the vcpus they just should not move. Period. If they move,Â
> Â Â if's a bug, no matter where they go and what the other SMT siblingÂ
> Â Â of the pcpu where they go is busy or idle! :-O
> 
> Â Â So, are you saying that you pinned the vcpus of the guest and you
> Â Â see them moving and/or not being _always_ scheduled where youÂ
> Â Â pinned them? Can we see `xl vcpu-list' again, to see how they'reÂ
> Â Â actually pinned.
> 
> > END: cycles: 31209326708 (29 seconds)
> > END: cycles: 30928835308 (28 seconds)
> > END: cycles: 31191626508 (29 seconds)
> > END: cycles: 50117313540 (46 seconds)
> > END: cycles: 49944848614 (46 seconds)
> > 
> > Since the vcpus are pinned, then my guess is that Linux scheduler
> > makes wrong decisions?
> >
> Ok, so now it seems to me that you agree that the vcpus don't have much
> alternatives.
> 
> If yes (which would be of great relief for me :-) ), it could indeed be
> that indeed the Linux scheduler is working suboptimally.
> 
> Perhaps it's worth trying running the benchmark inside the guest with
> the Linux's threads pinned to the vcpus. That should give you perfectly
> consistent results over all the 5 runs.
> 
> One more thing. You say you the guest has 16 vcpus, and that there are
> 8 threads running inside it. However, I seem to be able to identify in
> the graphs at least a few vertical lines where more than 8 vcpus are
> running on some pcpu. So, if Linux is working well, and it really only
> has to place 8 vcpus, it would put them on different cores. However, if
> at some point in time, there is more than that it has to place, it will
> have to necessarily 'invade' an already busy core. Am I right in seeing
> those lines, or are my eyes deceiving me? (I think a per-vcpu breakup
> of the graph above, like you did for dom0, would help figure this out).
> 
> > So I ran the test with smt patches enabled, but not pinned vcpus.
> > 
> AFAICT, This does not make much sense. So, if I understood correctly
> what you mean, by doing as you say, you're telling Linux that, for
> instance, vcpu0 and vcpu1 are SMT siblings, but then Xen is free to run
> vcpu0 and vcpu1 at the same time wherever it likes... same core,
> different core on same socket, different socket, etc.

Correct. I did run this to see what happens in this pseudo-random case.
> 
> This, I would say, bring us back to the pseudo-random situation we have
> by default already, without any patching and any pinning, or just in a
> different variant of it.Â
> 
> > result is also shows the same as above (see
> > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png)
> > :
> > Also see the per-cpu graph
> > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_per
> > vcpu.png).
> > 
> Ok. I'll look at this graph better with the aim of showing an example
> of my theory above (as soon as my brain, which is not in it's best
> shape today) will manage to deal with all the colors (I'm not
> complaining, BTW, there's not another way in which you can show things,
> it's just me! :-D).

At the same time, if you think I can improve data representation, it
will be awesome!

> 
> > END: cycles: 49740185572 (46 seconds)
> > END: cycles: 45862289546 (42 seconds)
> > END: cycles: 30976368378 (28 seconds)
> > END: cycles: 30886882143 (28 seconds)
> > END: cycles: 30806304256 (28 seconds)
> > 
> > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same
> > core while other cores are idle:
> > 
> > 35v2 9.881103815
> > 7ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> > 35v0 9.881104013 6
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> > 35v2 9.892746452
> > 7ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> > 35v0 9.892746546 6ÂÂÂ-> vcpu0 gets scheduled right after vcpu2 on
> > same core
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> > 35v0 9.904388175
> > 6ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> > 35v2 9.904388205 7 -> same here
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> > 35v2 9.916029791
> > 7ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> > 35v0 9.916029992
> > 6ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> > 
> Yes, this, in theory, should not happen. However, our (but Linux's, or
> any other OS's one --perhaps in its own way--) can't always be
> _instantly_ perfect! In this case, for instance, the SMT load balancing
> logic, in Credit1, is triggered:
> Â- from outside of sched_credit.c, by vcpu_migrate(), which is calledÂ
> Â Âupon in response to a bunch of events, but _not_ at every vcpuÂ
> Â Âwakeup
> Â- from inside sched_credit.c,Âcsched_vcpu_acct(), if the vcpu was itÂ
> Â Âhas been active for a while
> 
> This means, it is not triggered upon each and every vcpu wakeup (it
> might, but not for the vcpu that is waking up). So, seeing a samples of
> a vcpu not being scheduled according to optimal SMT load balancing,
> especially right after it woke up, it is expectable. Then, after a
> while, the logic should indeed trigger (via csched_vcpu_acct()) and
> move away the vcpu to an idle core.
> 
> To tell how long the SMT perfect balancing violation happens, and
> whether or not it happens as a consequence of tasks wakeups, we need
> more records from the trace file, coming from around the point where
> the violation happens.
> 
> Does this make sense to you?

Dario, thanks for explanations.

I am going to verify some numbers and also I am collecting more trace
data.
I am going to send it shortly, sorry for the delay.


Elena
> 
> Regards, and thanks for sharing all this! :-)
> Dario
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.