[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] schedulers and topology exposing questions



On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote:
> Hello all!
> 
Hey, here I am again,

> Konrad came up with a workaround that was setting the flag for domain
> scheduler in linux
> As the guest is not aware of SMT-related topology, it has a flat
> topology initialized.
> Kernel has domain scheduler flags for scheduling domain CPU set to
> 4143 for 2.6.39.
> Konrad discovered that changing the flag for CPU sched domain to 4655
>
So, as you've seen, I also have been up to doing quite a few of
benchmarking doing soemthing similar (I used more recent kernels, and
decided to test 4131 as flags.

In your casse, according to this:
Âhttp://lxr.oss.org.cn/source/include/linux/sched.h?v=2.6.39#L807

4655 means:
 SD_LOAD_BALANCE    Â|
 SD_BALANCE_EXEC    Â|
Â
SD_BALANCE_WAKE Â Â Â Â|
 SD_PREFER_LOCAL    Â| [*]
Â
SD_SHARE_PKG_RESOURCES |
 SD_SERIALIZE

and another bit (0x4000) that I don't immediately see what it is.

Things have changed a bit since then, it appears. However, I'm quite sure I've 
tested turning on SD_SERIALIZE in 4.2.0 and 4.3.0, and results were really 
pretty bad (as you also seem to say later).

> works as a workaround and makes Linux think that the topology has SMT
> threads.
>
Well, yes and no. :-). I don't want to make this all a terminology
bunfight, something that also matters here is how many scheduling
domains you have.

To check that (although in recent kernels) you check here:

Âls /proc/sys/kernel/sched_domain/cpu2/ (any cpu is ok)

and see how many domain[0-9] you have.

On baremetal, on an HT cpu, I've got this:

$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/nameÂ
SMT
MC

So, two domains, one of which is the SMT one. If you check their flags,
they're different:

$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags
4783
559

So, yes, you are right in saying that 4655 is related to SMT. In fact,
it is what (among other things) tells the load balancer that *all* the
cpus (well, all the scheduling groups, actually) in this domain are SMT
siblings... Which is a legitimate thing to do, but it's not what
happens on SMT baremetal.

At least is consistent, IMO. I.e., it still creates a pretty flat
topology, like there was a big core, of which _all_ the vcpus are part
of, as SMT siblings.

The other option (the one I'm leaning toward) was too get rid of that
one flag. I've only done preliminary experiments with it on and off,
and the ones with it off were better looking, so I did keep it off for
the big run... but we can test with it again.

> This workaround makes the test to complete almost in same time as on
> baremetal (or insignificantly worse).
> 
> This workaround is not suitable for kernels of higher versions as we
> discovered.
> 
There may be more than one reason for this (as said, a lot changed!)
but it matches what I've found when SD_SERIALIZE was kept on for the
scheduling domain where all the vcpus are.

> The hackish way of making domU linux think that it has SMT threads
> (along with matching cpuid)
> made us thinks that the problem comes from the fact that cpu topology
> is not exposed to
> guest and Linux scheduler cannot make intelligent decision on
> scheduling.
> 
As said, I think it's the other way around: we expose too much of it
(and this is more of an issue for PV rather than for HVM). Basically,
either you do the pinning you're doing or, whatever you expose, will be
*wrong*... and the only way to expose not wrong data is to actually
don't expose anything! :-)

> The test described above was labeled as IO-bound test.
> 
> We have run io-bound test with and without smt-patches. The
> improvement comparing
> to base case (no smt patches, flat topology) shows 22-23% gain.
> 
I'd be curious to see the content of the /proc/sys/kernel/sched_domain
directory and subdirectories with Joao's patches applied.

> While we have seen improvement with io-bound tests, the same did not
> happen with cpu-bound workload.
> As cpu-bound test we use kernel module which runs requested number of
> kernel threads
> and each thread compresses and decompresses some data.
> 
That is somewhat what I would have expected, although up to what
extent, it's hard to tell in advance.

It also matches my findings, both for the results I've already shared
on list, and for others that I'll be sharing in a bit.

> Here is the setup for tests:
> Intel Xeon E5 2600
> 8 cores, 25MB Cashe, 2 sockets, 2 threads per core.
> Xen 4.4.3, default timeslice and ratelimit
> Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+.
> Dom0: kernel 4.1.0, 2 vcpus, not pinned.
> DomU has 8 vcpus (except some cases).
> 
> 
> For io-bound tests results were better with smt patches applied for
> every kernel.
> 
> For cpu-bound test the results were different depending on wether
> vcpus were
> pinned or not, how many vcpus were assigned to the guest.
> 
Right. In general, this also makes sense... Can we see the actual
numbers? I mean the results of the tests with improvements/regressions
highlighted, in addition to the traces that you already shared?

> Please take a look at the graph captured by xentrace -e 0x0002f000
> On the graphs X is time in seconds since xentrace start, Y is the
> pcpu number,
> the graph itself represent the event when scheduler places vcpu to
> pcpu.
> 
> The graphs #1 & #2:
> trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test,
> one client/server
> trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test,
> 8 kernel theads
> config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69
> kernel.
> 
Ok, so this is the "baseline", the result of just running your tests in
a pretty standard Xen and Dom0 and DomU status and configurations,
right?

> As can be seen here scheduler places the vcpus correctly on empty
> cores.
> As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this?
> Take a look at
> trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png
> where I split data per vcpus.
> 
Well, why not, I would say? I mean, where a vcpu starts to run at an
arbitrary point in time, especially if the system is otherwise idle
before, it can be considered random (it's not, it depends on both the
vcpu's and system's previous history, but in a non-linear way, and that
is not in the graph anyway).

In any case, since there are idle cores, the fact that vcpus do not
move much, even if they're not pinned, I consider it a good thing,
don't you? If vcpuX wakes up on processor Y, where it has always run
before, and it find out it can still run there, migrating somewhere
else would be pure overhead.

The only potential worry of mine about
trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png, is
that vcpus 4 and 7 (or 4 and 2, colors are too similar to be sure) run
for some time (the burst around t=17), on pcpus 5 and 6. Are these two
pcpus SMT siblings? Doing the math myself on pCPUs IDs, I don't think
they are, so all would be fine. If they are, that should not happen.

However, you're using 4.4, so even if you had an issue there, we don't
know if it's still in staging.

In any case and just to be sure, can you produce the output of `xl
vcpu-list', while this case is running?

> Now to cpu-bound tests.
> When smt patches applied and vcpus pinned correctly to match the
> topology and
> guest become aware of the topology, cpu-bound tests did not show
> improvement with kernel 2.6.39.
> With upstream kernel we see some improvements. The tes was repeated 5
> times back to back.
>
Again, 'some' being?

> The number of vcpus was increased to 16 to match the test case where
> linux was not
> aware of the topology and assumed all cpus as cores.
> Â
> On some iterations one can see that vcpus are being scheduled as
> expected.
> For some runs the vcpus are placed on came core (core/thread) (see
> trace_cpu_16vcpus_8threads_5runs.out.plot.err.png).
> It doubles the time it takes for test to complete (first three runs
> show close to baremetal execution time).
> 
No, sorry, I don't think I fully understood this part. So:
Â1. can you point me at where (time equal to ?) what you are sayingÂ
  happens?
Â2. more important, you are saying that the vcpus are pinned. If youÂ
  pin the vcpus they just should not move. Period. If they move,Â
  if's a bug, no matter where they go and what the other SMT siblingÂ
  of the pcpu where they go is busy or idle! :-O

  So, are you saying that you pinned the vcpus of the guest and you
  see them moving and/or not being _always_ scheduled where youÂ
  pinned them? Can we see `xl vcpu-list' again, to see how they'reÂ
  actually pinned.

> END: cycles: 31209326708 (29 seconds)
> END: cycles: 30928835308 (28 seconds)
> END: cycles: 31191626508 (29 seconds)
> END: cycles: 50117313540 (46 seconds)
> END: cycles: 49944848614 (46 seconds)
> 
> Since the vcpus are pinned, then my guess is that Linux scheduler
> makes wrong decisions?
>
Ok, so now it seems to me that you agree that the vcpus don't have much
alternatives.

If yes (which would be of great relief for me :-) ), it could indeed be
that indeed the Linux scheduler is working suboptimally.

Perhaps it's worth trying running the benchmark inside the guest with
the Linux's threads pinned to the vcpus. That should give you perfectly
consistent results over all the 5 runs.

One more thing. You say you the guest has 16 vcpus, and that there are
8 threads running inside it. However, I seem to be able to identify in
the graphs at least a few vertical lines where more than 8 vcpus are
running on some pcpu. So, if Linux is working well, and it really only
has to place 8 vcpus, it would put them on different cores. However, if
at some point in time, there is more than that it has to place, it will
have to necessarily 'invade' an already busy core. Am I right in seeing
those lines, or are my eyes deceiving me? (I think a per-vcpu breakup
of the graph above, like you did for dom0, would help figure this out).

> So I ran the test with smt patches enabled, but not pinned vcpus.
> 
AFAICT, This does not make much sense. So, if I understood correctly
what you mean, by doing as you say, you're telling Linux that, for
instance, vcpu0 and vcpu1 are SMT siblings, but then Xen is free to run
vcpu0 and vcpu1 at the same time wherever it likes... same core,
different core on same socket, different socket, etc.

This, I would say, bring us back to the pseudo-random situation we have
by default already, without any patching and any pinning, or just in a
different variant of it.Â

> result is also shows the same as above (see
> trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png)
> :
> Also see the per-cpu graph
> (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_per
> vcpu.png).
> 
Ok. I'll look at this graph better with the aim of showing an example
of my theory above (as soon as my brain, which is not in it's best
shape today) will manage to deal with all the colors (I'm not
complaining, BTW, there's not another way in which you can show things,
it's just me! :-D).

> END: cycles: 49740185572 (46 seconds)
> END: cycles: 45862289546 (42 seconds)
> END: cycles: 30976368378 (28 seconds)
> END: cycles: 30886882143 (28 seconds)
> END: cycles: 30806304256 (28 seconds)
> 
> I cut the timeslice where its seen that vcpu0 and vcpu2 run on same
> core while other cores are idle:
> 
> 35v2 9.881103815
> 7ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> 35v0 9.881104013 6
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> 35v2 9.892746452
> 7ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> 35v0 9.892746546 6ÂÂÂ-> vcpu0 gets scheduled right after vcpu2 on
> same core
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> 35v0 9.904388175
> 6ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> 35v2 9.904388205 7 -> same here
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> 35v2 9.916029791
> 7ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> 35v0 9.916029992
> 6ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
> 
Yes, this, in theory, should not happen. However, our (but Linux's, or
any other OS's one --perhaps in its own way--) can't always be
_instantly_ perfect! In this case, for instance, the SMT load balancing
logic, in Credit1, is triggered:
Â- from outside of sched_credit.c, by vcpu_migrate(), which is calledÂ
 Âupon in response to a bunch of events, but _not_ at every vcpuÂ
 Âwakeup
Â- from inside sched_credit.c,Âcsched_vcpu_acct(), if the vcpu was itÂ
 Âhas been active for a while

This means, it is not triggered upon each and every vcpu wakeup (it
might, but not for the vcpu that is waking up). So, seeing a samples of
a vcpu not being scheduled according to optimal SMT load balancing,
especially right after it woke up, it is expectable. Then, after a
while, the logic should indeed trigger (via csched_vcpu_acct()) and
move away the vcpu to an idle core.

To tell how long the SMT perfect balancing violation happens, and
whether or not it happens as a consequence of tasks wakeups, we need
more records from the trace file, coming from around the point where
the violation happens.

Does this make sense to you?

Regards, and thanks for sharing all this! :-)
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.