[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Question] PARSEC benchmark has smaller execution time in VM than in native?



Tuesday, March 1, 2016, 9:39:25 PM, you wrote:

> On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
>> Hi Elena,
>> 
>> Thank you very much for sharing this! :-)
>> 
>> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
>> <elena.ufimtseva@xxxxxxxxxx> wrote:
>> >
>> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
>> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
>> > > <konrad.wilk@xxxxxxxxxx> wrote:
>> > > >> > Hey!
>> > > >> >
>> > > >> > CC-ing Elena.
>> > > >>
>> > > >> I think you forgot you cc.ed her..
>> > > >> Anyway, let's cc. her now... :-)
>> > > >>
>> > > >> >
>> > > >> >> We are measuring the execution time between native machine 
>> > > >> >> environment
>> > > >> >> and xen virtualization environment using PARSEC Benchmark [1].
>> > > >> >>
>> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each 
>> > > >> >> of
>> > > >> >> them pinned to a core; we pin the dom0 to another core that is not
>> > > >> >> used by the domU.
>> > > >> >>
>> > > >> >> Inside the Linux in domU in virtualization environment and in 
>> > > >> >> native
>> > > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for 
>> > > >> >> the
>> > > >> >> system processors and to isolate a core for the benchmark 
>> > > >> >> processes.
>> > > >> >> We also configured the Linux boot command line with isocpus= 
>> > > >> >> option to
>> > > >> >> isolate the core for benchmark from other unnecessary processes.
>> > > >> >
>> > > >> > You may want to just offline them and also boot the machine with 
>> > > >> > NUMA
>> > > >> > disabled.
>> > > >>
>> > > >> Right, the machine is booted up with NUMA disabled.
>> > > >> We will offline the unnecessary cores then.
>> > > >>
>> > > >> >
>> > > >> >>
>> > > >> >> We expect that execution time of benchmarks in xen virtualization
>> > > >> >> environment is larger than the execution time in native machine
>> > > >> >> environment. However, the evaluation gave us an opposite result.
>> > > >> >>
>> > > >> >> Below is the evaluation data for the canneal and streamcluster 
>> > > >> >> benchmarks:
>> > > >> >>
>> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
>> > > >> >> Native: 6.387s
>> > > >> >> Virtualization: 5.890s
>> > > >> >>
>> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
>> > > >> >> Native: 5.276s
>> > > >> >> Virtualization: 5.240s
>> > > >> >>
>> > > >> >> Is there anything wrong with our evaluation that lead to the 
>> > > >> >> abnormal
>> > > >> >> performance results?
>> > > >> >
>> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
>> > > >> >
>> > > >> > :-)
>> > > >> >
>> > > >> > No clue sadly.
>> > > >>
>> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the
>> > > >> system by adding one more layer? Unless the virtualization disabled
>> > > >> some services that occur in native and interfere with the benchmark.
>> > > >>
>> > > >> If virtualization is faster than baremetal by nature, why we can see
>> > > >> that some experiment shows that virtualization introduces overhead?
>> > > >
>> > > > Elena told me that there were some weird regression in Linux 4.1 - 
>> > > > where
>> > > > CPU burning workloads were _slower_ on baremetal than as guests.
>> > >
>> > > Hi Elena,
>> > > Would you mind sharing with us some of your experience of how you
>> > > found the real reason? Did you use some tool or some methodology to
>> > > pin down the reason (i.e,  CPU burning workloads in native is _slower_
>> > > on baremetal than as guests)?
>> > >
>> >
>> > Hi Meng
>> >
>> > Yes, sure!
>> >
>> > While working on performance tests for smt-exposing patches from Joao
>> > I run CPU bound workload in HVM guest and using same kernel in baremetal
>> > run same test.
>> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
>> > I found that the time to complete the same test is few times more that
>> > as it takes for the same under HVM guest.
>> > I have tried tests where kernel threads pinned to cores and without 
>> > pinning.
>> > The execution times are most of the times take as twice longer, sometimes 4
>> > times longer that HVM case.
>> >
>> > Interesting is not only that it takes sometimes 3-4 times more
>> > than HVM guest, but also that test with bound threads (to cores) takes 
>> > almost
>> > 3 times longer
>> > to execute than running same cpu-bound test under HVM (in all
>> > configurations).
>> 
>> 
>> wow~ I didn't expect the native performance can be so "bad".... ;-)

> Yes, quite a surprise :)
>> 
>> >
>> >
>> > I run each test 5 times and here are the execution times (seconds):
>> >
>> > -------------------------------------------------
>> >         baremetal           |
>> > thread_bind | thread unbind | HVM pinned to cores
>> > ----------- |---------------|---------------------
>> >      74     |     83        |        28
>> >      74     |     88        |        28
>> >      74     |     38        |        28
>> >      74     |     73        |        28
>> >      74     |     87        |        28
>> >
>> > Sometimes better times were on unbinded tests, but not often enough
>> > to present it here. Some results are much worse and reach up to 120
>> > seconds.
>> >
>> > Each test has 8 kernel threads. In baremetal case I tried the following:
>> > - numa off,on;
>> > - all cpus are on;
>> > - isolate cpus from first node;
>> > - set intel_idle.max_cstate=1;
>> > - disable intel_pstate;
>> >
>> > I dont think I have exhausted all the options here, but it looked like
>> > two last changes did improve performance, but was still not comparable to
>> > HVM case.
>> > I am trying to find where regression had happened. Performance on newer
>> > kernel (I tried 4.5.0-rc4+) was close or better than HVM.

Just a perhaps silly thought .. but could there be something in the 
time-measuring that could differ and explain the slightly surprising results ?
--
Sander 

>> > I am trying to find f there were some relevant regressions to understand
>> > the reason of this.
>> 
>> 
>> I see. If this is only happening for the SMT, it may be caused by the
>> SMT-related load balancing in Linux scheduler.
>> However, I have disabled the HT on my machine. Probably, that's also
>> the reason why I didn't see so much different in performance.

> I did enable tracing to see if maybe there is extensive migration:
> Test machine has two nodes, 8 cores each, 2 threads per core, total 32 
> logical cpus.

> Kernel threads are not binded and here is the output for the life of one of 
> the threads:

> cat ./t-komp_trace |grep t-kompressor|grep 18883

>     t-kompressor-18883 [028] d... 69458.596403: sched_switch: 
> prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> 
> next_comm=swapper/28 next_pid=0 next_prio=120
>           insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: 
> comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9
>           <idle>-0     [009] d... 69458.669205: sched_switch: 
> prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> 
> next_comm=t-kompressor next_pid=18883 next_prio=120
>     t-kompressor-18883 [009] d... 69486.997626: sched_switch: 
> prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> 
> next_comm=migration/9 next_pid=52 next_prio=0
>      migration/9-52    [009] d... 69486.997632: sched_migrate_task: 
> comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
>           <idle>-0     [025] d... 69486.997641: sched_switch: 
> prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> 
> next_comm=t-kompressor next_pid=18883 next_prio=120
>     t-kompressor-18883 [025] d... 69486.997710: sched_switch: 
> prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> 
> next_comm=swapper/25 next_pid=0 next_prio=120
>           insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: 
> comm=t-kompressor pid=18883


> Threads are being spawned from two cores, then some of the threads migrate to 
> other cores.
> In the example above threads is being spawned on cpu 27 and when woken up, 
> runs on cpu 009.
> Later it migrated to 025 which is the second thread of the same core (009).
> While I am not sure why this migration happens, it does not seem to 
> contribute a lot.
> Anyway this picture repeats for some other threads (some stay where they were 
> woken up):

>     t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald 
> pid=3820 prio=120 orig_cpu=14 dest_cpu=11
>     migration/13-72    [013] d... 69486.707459: sched_migrate_task: 
> comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29
>     migration/14-77    [014] d... 69486.783818: sched_migrate_task: 
> comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30
>      migration/8-47    [008] d... 69486.792667: sched_migrate_task: 
> comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24
>     migration/15-82    [015] d... 69486.796429: sched_migrate_task: 
> comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31
>     migration/10-57    [010] d... 69486.857848: sched_migrate_task: 
> comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26
>      migration/9-52    [009] d... 69486.997632: sched_migrate_task: 
> comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
>     migration/28-147   [028] d... 69503.073577: sched_migrate_task: 
> comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10

> All threads are running on their own cores and some migrate to second 
> smt-thread over time.
> I probably should have traced some other scheduling events, but I did not yet 
> find any relevant ones yet.

>> 
>> >
>> >
>> >
>> > What kernel you guys use?
>> 
>> 
>> I'm using a quite old kernel
>> 3.10.31
>> . The reason why I'm using this kernel is because I want to use the
>> LITMUS^RT [1], which is a linux testbed for real-time scheduling
>> research. (It has a new version though, and I can upgrade to the
>> latest version to see if the "problem" still occurs.)

> Yes, it will be interesting to see the outcome.

> What difference in numbers do you see?
> What the machines you are seeing it on?
> Is your workload is purely cpu-bound?


> Thanks!

>> 
>> Thanks and Best Regards,
>> 
>> Meng




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.