[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [Question] PARSEC benchmark has smaller execution time in VM than in native?
Tuesday, March 1, 2016, 9:39:25 PM, you wrote: > On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote: >> Hi Elena, >> >> Thank you very much for sharing this! :-) >> >> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva >> <elena.ufimtseva@xxxxxxxxxx> wrote: >> > >> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote: >> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk >> > > <konrad.wilk@xxxxxxxxxx> wrote: >> > > >> > Hey! >> > > >> > >> > > >> > CC-ing Elena. >> > > >> >> > > >> I think you forgot you cc.ed her.. >> > > >> Anyway, let's cc. her now... :-) >> > > >> >> > > >> > >> > > >> >> We are measuring the execution time between native machine >> > > >> >> environment >> > > >> >> and xen virtualization environment using PARSEC Benchmark [1]. >> > > >> >> >> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each >> > > >> >> of >> > > >> >> them pinned to a core; we pin the dom0 to another core that is not >> > > >> >> used by the domU. >> > > >> >> >> > > >> >> Inside the Linux in domU in virtualization environment and in >> > > >> >> native >> > > >> >> environment, We used the cpuset to isolate a core (or VCPU) for >> > > >> >> the >> > > >> >> system processors and to isolate a core for the benchmark >> > > >> >> processes. >> > > >> >> We also configured the Linux boot command line with isocpus= >> > > >> >> option to >> > > >> >> isolate the core for benchmark from other unnecessary processes. >> > > >> > >> > > >> > You may want to just offline them and also boot the machine with >> > > >> > NUMA >> > > >> > disabled. >> > > >> >> > > >> Right, the machine is booted up with NUMA disabled. >> > > >> We will offline the unnecessary cores then. >> > > >> >> > > >> > >> > > >> >> >> > > >> >> We expect that execution time of benchmarks in xen virtualization >> > > >> >> environment is larger than the execution time in native machine >> > > >> >> environment. However, the evaluation gave us an opposite result. >> > > >> >> >> > > >> >> Below is the evaluation data for the canneal and streamcluster >> > > >> >> benchmarks: >> > > >> >> >> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial >> > > >> >> Native: 6.387s >> > > >> >> Virtualization: 5.890s >> > > >> >> >> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial >> > > >> >> Native: 5.276s >> > > >> >> Virtualization: 5.240s >> > > >> >> >> > > >> >> Is there anything wrong with our evaluation that lead to the >> > > >> >> abnormal >> > > >> >> performance results? >> > > >> > >> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal! >> > > >> > >> > > >> > :-) >> > > >> > >> > > >> > No clue sadly. >> > > >> >> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the >> > > >> system by adding one more layer? Unless the virtualization disabled >> > > >> some services that occur in native and interfere with the benchmark. >> > > >> >> > > >> If virtualization is faster than baremetal by nature, why we can see >> > > >> that some experiment shows that virtualization introduces overhead? >> > > > >> > > > Elena told me that there were some weird regression in Linux 4.1 - >> > > > where >> > > > CPU burning workloads were _slower_ on baremetal than as guests. >> > > >> > > Hi Elena, >> > > Would you mind sharing with us some of your experience of how you >> > > found the real reason? Did you use some tool or some methodology to >> > > pin down the reason (i.e, CPU burning workloads in native is _slower_ >> > > on baremetal than as guests)? >> > > >> > >> > Hi Meng >> > >> > Yes, sure! >> > >> > While working on performance tests for smt-exposing patches from Joao >> > I run CPU bound workload in HVM guest and using same kernel in baremetal >> > run same test. >> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2) >> > I found that the time to complete the same test is few times more that >> > as it takes for the same under HVM guest. >> > I have tried tests where kernel threads pinned to cores and without >> > pinning. >> > The execution times are most of the times take as twice longer, sometimes 4 >> > times longer that HVM case. >> > >> > Interesting is not only that it takes sometimes 3-4 times more >> > than HVM guest, but also that test with bound threads (to cores) takes >> > almost >> > 3 times longer >> > to execute than running same cpu-bound test under HVM (in all >> > configurations). >> >> >> wow~ I didn't expect the native performance can be so "bad".... ;-) > Yes, quite a surprise :) >> >> > >> > >> > I run each test 5 times and here are the execution times (seconds): >> > >> > ------------------------------------------------- >> > baremetal | >> > thread_bind | thread unbind | HVM pinned to cores >> > ----------- |---------------|--------------------- >> > 74 | 83 | 28 >> > 74 | 88 | 28 >> > 74 | 38 | 28 >> > 74 | 73 | 28 >> > 74 | 87 | 28 >> > >> > Sometimes better times were on unbinded tests, but not often enough >> > to present it here. Some results are much worse and reach up to 120 >> > seconds. >> > >> > Each test has 8 kernel threads. In baremetal case I tried the following: >> > - numa off,on; >> > - all cpus are on; >> > - isolate cpus from first node; >> > - set intel_idle.max_cstate=1; >> > - disable intel_pstate; >> > >> > I dont think I have exhausted all the options here, but it looked like >> > two last changes did improve performance, but was still not comparable to >> > HVM case. >> > I am trying to find where regression had happened. Performance on newer >> > kernel (I tried 4.5.0-rc4+) was close or better than HVM. Just a perhaps silly thought .. but could there be something in the time-measuring that could differ and explain the slightly surprising results ? -- Sander >> > I am trying to find f there were some relevant regressions to understand >> > the reason of this. >> >> >> I see. If this is only happening for the SMT, it may be caused by the >> SMT-related load balancing in Linux scheduler. >> However, I have disabled the HT on my machine. Probably, that's also >> the reason why I didn't see so much different in performance. > I did enable tracing to see if maybe there is extensive migration: > Test machine has two nodes, 8 cores each, 2 threads per core, total 32 > logical cpus. > Kernel threads are not binded and here is the output for the life of one of > the threads: > cat ./t-komp_trace |grep t-kompressor|grep 18883 > t-kompressor-18883 [028] d... 69458.596403: sched_switch: > prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> > next_comm=swapper/28 next_pid=0 next_prio=120 > insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: > comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9 > <idle>-0 [009] d... 69458.669205: sched_switch: > prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> > next_comm=t-kompressor next_pid=18883 next_prio=120 > t-kompressor-18883 [009] d... 69486.997626: sched_switch: > prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> > next_comm=migration/9 next_pid=52 next_prio=0 > migration/9-52 [009] d... 69486.997632: sched_migrate_task: > comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 > <idle>-0 [025] d... 69486.997641: sched_switch: > prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> > next_comm=t-kompressor next_pid=18883 next_prio=120 > t-kompressor-18883 [025] d... 69486.997710: sched_switch: > prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> > next_comm=swapper/25 next_pid=0 next_prio=120 > insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: > comm=t-kompressor pid=18883 > Threads are being spawned from two cores, then some of the threads migrate to > other cores. > In the example above threads is being spawned on cpu 27 and when woken up, > runs on cpu 009. > Later it migrated to 025 which is the second thread of the same core (009). > While I am not sure why this migration happens, it does not seem to > contribute a lot. > Anyway this picture repeats for some other threads (some stay where they were > woken up): > t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald > pid=3820 prio=120 orig_cpu=14 dest_cpu=11 > migration/13-72 [013] d... 69486.707459: sched_migrate_task: > comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29 > migration/14-77 [014] d... 69486.783818: sched_migrate_task: > comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30 > migration/8-47 [008] d... 69486.792667: sched_migrate_task: > comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24 > migration/15-82 [015] d... 69486.796429: sched_migrate_task: > comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31 > migration/10-57 [010] d... 69486.857848: sched_migrate_task: > comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26 > migration/9-52 [009] d... 69486.997632: sched_migrate_task: > comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 > migration/28-147 [028] d... 69503.073577: sched_migrate_task: > comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10 > All threads are running on their own cores and some migrate to second > smt-thread over time. > I probably should have traced some other scheduling events, but I did not yet > find any relevant ones yet. >> >> > >> > >> > >> > What kernel you guys use? >> >> >> I'm using a quite old kernel >> 3.10.31 >> . The reason why I'm using this kernel is because I want to use the >> LITMUS^RT [1], which is a linux testbed for real-time scheduling >> research. (It has a new version though, and I can upgrade to the >> latest version to see if the "problem" still occurs.) > Yes, it will be interesting to see the outcome. > What difference in numbers do you see? > What the machines you are seeing it on? > Is your workload is purely cpu-bound? > Thanks! >> >> Thanks and Best Regards, >> >> Meng _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |