[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] schedulers and topology exposing questions



On Wed, 2016-01-27 at 11:03 -0500, Elena Ufimtseva wrote:
> On Wed, Jan 27, 2016 at 10:27:01AM -0500, Konrad Rzeszutek Wilk
> wrote:
> > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote:
> > > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote:
> > > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote:
> > > > > On 22/01/16 16:54, Elena Ufimtseva wrote:
> > > > > > Hello all!
> > > > > > 
> > > > > > Dario, Gerorge or anyone else,ÂÂyour help will be
> > > > > > appreciated.
> > > > > > 
> > > > > > Let me put some intro to our findings. I may forget
> > > > > > something or put something
> > > > > > not too explicit, please ask me.
> > > > > > 
> > > > > > Customer filled a bug where some of the applications were
> > > > > > running slow in their HVM DomU setups.
> > > > > > These running times were compared against baremetal running
> > > > > > same kernel version as HVM DomU.
> > > > > > 
> > > > > > After some investigation by different parties, the test
> > > > > > case scenario was found
> > > > > > where the problem was easily seen. The test app is a udp
> > > > > > server/client pair where
> > > > > > client passes some message n number of times.
> > > > > > The test case was executed on baremetal and Xen DomU with
> > > > > > kernel version 2.6.39.
> > > > > > Bare metal showed 2x times better result that DomU.
> > > > > > 
> > > > > > Konrad came up with a workaround that was setting the flag
> > > > > > for domain scheduler in linux
> > > > > > As the guest is not aware of SMT-related topology, it has a
> > > > > > flat topology initialized.
> > > > > > Kernel has domain scheduler flags for scheduling domain CPU
> > > > > > set to 4143 for 2.6.39.
> > > > > > Konrad discovered that changing the flag for CPU sched
> > > > > > domain to 4655
> > > > > > works as a workaround and makes Linux think that the
> > > > > > topology has SMT threads.
> > > > > > This workaround makes the test to complete almost in same
> > > > > > time as on baremetal (or insignificantly worse).
> > > > > > 
> > > > > > This workaround is not suitable for kernels of higher
> > > > > > versions as we discovered.
> > > > > > 
> > > > > > The hackish way of making domU linux think that it has SMT
> > > > > > threads (along with matching cpuid)
> > > > > > made us thinks that the problem comes from the fact that
> > > > > > cpu topology is not exposed to
> > > > > > guest and Linux scheduler cannot make intelligent decision
> > > > > > on scheduling.
> > > > > > 
> > > > > > Joao Martins from Oracle developed set of patches that
> > > > > > fixed the smt/core/cashe
> > > > > > topology numbering and provided matching pinning of vcpus
> > > > > > and enabling options,
> > > > > > allows to expose to guest correct topology.
> > > > > > I guess Joao will be posting it at some point.
> > > > > > 
> > > > > > With this patches we decided to test the performance impact
> > > > > > on different kernel versionand Xen versions.
> > > > > > 
> > > > > > The test described above was labeled as IO-bound test.
> > > > > 
> > > > > So just to clarify: The client sends a request (presumably
> > > > > not much more
> > > > > than a ping) to the server, and waits for the server to
> > > > > respond before
> > > > > sending another one; and the server does the reverse --
> > > > > receives a
> > > > > request, responds, and then waits for the next request.ÂÂIs
> > > > > that right?
> > > > 
> > > > Yes.
> > > > > 
> > > > > How much data is transferred?
> > > > 
> > > > 1 packet, UDP
> > > > > 
> > > > > If the amount of data transferred is tiny, then the
> > > > > bottleneck for the
> > > > > test is probably the IPI time, and I'd call this a "ping-
> > > > > pong"
> > > > > benchmark[1].ÂÂI would only call this "io-bound" if you're
> > > > > actually
> > > > > copying large amounts of data.
> > > > 
> > > > What we found is that on baremetal the scheduler would put both
> > > > apps
> > > > on the same CPU and schedule them right after each other. This
> > > > would
> > > > have a high IPI as the scheduler would poke itself.
> > > > On Xen it would put the two applications on seperate CPUs - and
> > > > there
> > > > would be hardly any IPI.
> > > 
> > > Sorry -- why would the scheduler send itself an IPI if it's on
> > > the same
> > > logical cpu (which seems pretty pointless), but *not* send an IPI
> > > to the
> > > *other* processor when it was actually waking up another task?
> > > 
> > > Or do you mean high context switch rate?
> > 
> > Yes, very high.
> > > 
> > > > Digging deeper in the code I found out that if you do an UDP
> > > > sendmsg
> > > > without any timeouts - it would put it in a queue and just call
> > > > schedule.
> > > 
> > > You mean, it would mark the other process as runnable somehow,
> > > but not
> > > actually send an IPI to wake it up?ÂÂIs that a new "feature"
> > > designed
> > 
> > Correct - because the other process was not on its vCPU runqueue.
> > 
> > > for large systems, to reduce the IPI traffic or something?
> > 
> > This is just a normal Linux scheduler. The only way it would do an
> > IPI
> > to the other CPU was if the UDP message had an timeout. The default
> > timeout is infite so it didn't bother to send an IPI.
> > 
> > > 
> > > > On baremetal the schedule would result in scheduler picking up
> > > > the other
> > > > task, and starting it - which would dequeue immediately.
> > > > 
> > > > On Xen - the schedule() would go HLT.. and then later be woken
> > > > up by the
> > > > VIRQ_TIMER. And since the two applications were on seperate
> > > > CPUs - the
> > > > single packet would just stick in the queue until the
> > > > VIRQ_TIMER arrived.
> > > 
> > > I'm not sure I understand the situation right, but it sounds a
> > > bit like
> > > what you're seeing is just a quirk of the fact that Linux doesn't
> > > always
> > > send IPIs to wake other processes up (either by design or by
> > > accident),
> > 
> > It does and it does not :-)
> > 
> > > but relies on scheduling timers to check for work to
> > > do.ÂÂPresumably
> > 
> > It .. I am not explaining it well. The Linux kernel scheduler when
> > called for 'schedule' (from the UDP sendmsg) would either pick the
> > next
> > appliction and do a context swap - of if there were none - go to
> > sleep.
> > [Kind of - it also may do an IPI to the other CPU if requested ,but
> > that requires
> > some hints from underlaying layers]
> > Since there were only two apps on the runqueue - udp sender and udp
> > receiver
> > it would run them back-to back (this is on baremetal)
> > 
> > However if SMT was not exposed - the Linux kernel scheduler would
> > put those
> > on each CPU runqueue. Meaning each CPU only had one app on its
> > runqueue.
> > 
> > Hence no need to do an context switch.
> > [unless you modified the UDP message to have a timeout, then it
> > would
> > send an IPI]
> > > they knew that low performance on ping-pong workloads might be a
> > > possibility when they wrote the code that way; I don't see a
> > > reason why
> > > we should try to work around that in Xen.
> > 
> > Which is not what I am suggesting.
> > 
> > Our first ideas was that since this is a Linux kernel schduler
> > characteristic
> > - let us give the guest all the information it needs to do this.
> > That is
> > make it look as baremetal as possible - and that is where the vCPU
> > pinning and the exposing of SMT information came about. That (Elena
> > pls correct me if I am wrong) did indeed show that the guest was
> > doing
> > what we expected.
> > 
> > But naturally that requires pinning and all that - and while it is
> > a useful
> > case for those that have the vCPUs to spare and can do it - that is
> > not
> > a general use-case.
> > 
> > So Elena started looking at the CPU bound and seeing how Xen
> > behaves then
> > and if we can improve the floating situation as she saw some
> > abnormal
> > behavious.
> 
> Maybe its normal? :)
> 
> While having satisfactory results with ping-pong test and having
> Joao's
> SMT patches in hand, we decided to try cpu-bound workload.
> And that is where exposing SMT does not work that well.
> I mostly here refer to the case where two vCPUs are being placed on
> same
> core while there are idle cores.
> 
> This I think what Dario is asking me more details about in another
> reply and I am going to
> answer his questions.
> 
Yes, exactly. We need to see more trace entries around the one where we
see the vcpus being placed on SMT-siblings. You can well send, or
upload somewhere, the full trace, and I'll have a look myself as soon
as I can. :-)

> > I do not see any way to fix the udp single message mechanism except
> > by modifying the Linux kernel scheduler - and indeed it looks like
> > later
> > kernels modified their behavior. Also doing the vCPU pinning and
> > SMT exposing
> > did not hurt in those cases (Elena?).
> 
> Yes, the drastic performance differences with bare metal were only
> observed with 2.6.39-based kernel.
> For this ping-pong udp test exposing the SMT topology to the kernels
> if
> higher versions did help as tests show about 20 percent performance
> improvement comparing to the tests where SMT topology is not exposed.
> This assumes that SMT exposure goes along with pinning.
> 
> 
> kernel.
>
hypervisor.

:-D :-D :-D

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.