[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Question] PARSEC benchmark has smaller execution time in VM than in native?



On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
> Hi Elena,
> 
> Thank you very much for sharing this! :-)
> 
> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
> <elena.ufimtseva@xxxxxxxxxx> wrote:
> >
> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
> > > <konrad.wilk@xxxxxxxxxx> wrote:
> > > >> > Hey!
> > > >> >
> > > >> > CC-ing Elena.
> > > >>
> > > >> I think you forgot you cc.ed her..
> > > >> Anyway, let's cc. her now... :-)
> > > >>
> > > >> >
> > > >> >> We are measuring the execution time between native machine 
> > > >> >> environment
> > > >> >> and xen virtualization environment using PARSEC Benchmark [1].
> > > >> >>
> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each 
> > > >> >> of
> > > >> >> them pinned to a core; we pin the dom0 to another core that is not
> > > >> >> used by the domU.
> > > >> >>
> > > >> >> Inside the Linux in domU in virtualization environment and in native
> > > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
> > > >> >> system processors and to isolate a core for the benchmark processes.
> > > >> >> We also configured the Linux boot command line with isocpus= option 
> > > >> >> to
> > > >> >> isolate the core for benchmark from other unnecessary processes.
> > > >> >
> > > >> > You may want to just offline them and also boot the machine with NUMA
> > > >> > disabled.
> > > >>
> > > >> Right, the machine is booted up with NUMA disabled.
> > > >> We will offline the unnecessary cores then.
> > > >>
> > > >> >
> > > >> >>
> > > >> >> We expect that execution time of benchmarks in xen virtualization
> > > >> >> environment is larger than the execution time in native machine
> > > >> >> environment. However, the evaluation gave us an opposite result.
> > > >> >>
> > > >> >> Below is the evaluation data for the canneal and streamcluster 
> > > >> >> benchmarks:
> > > >> >>
> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
> > > >> >> Native: 6.387s
> > > >> >> Virtualization: 5.890s
> > > >> >>
> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
> > > >> >> Native: 5.276s
> > > >> >> Virtualization: 5.240s
> > > >> >>
> > > >> >> Is there anything wrong with our evaluation that lead to the 
> > > >> >> abnormal
> > > >> >> performance results?
> > > >> >
> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
> > > >> >
> > > >> > :-)
> > > >> >
> > > >> > No clue sadly.
> > > >>
> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the
> > > >> system by adding one more layer? Unless the virtualization disabled
> > > >> some services that occur in native and interfere with the benchmark.
> > > >>
> > > >> If virtualization is faster than baremetal by nature, why we can see
> > > >> that some experiment shows that virtualization introduces overhead?
> > > >
> > > > Elena told me that there were some weird regression in Linux 4.1 - where
> > > > CPU burning workloads were _slower_ on baremetal than as guests.
> > >
> > > Hi Elena,
> > > Would you mind sharing with us some of your experience of how you
> > > found the real reason? Did you use some tool or some methodology to
> > > pin down the reason (i.e,  CPU burning workloads in native is _slower_
> > > on baremetal than as guests)?
> > >
> >
> > Hi Meng
> >
> > Yes, sure!
> >
> > While working on performance tests for smt-exposing patches from Joao
> > I run CPU bound workload in HVM guest and using same kernel in baremetal
> > run same test.
> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
> > I found that the time to complete the same test is few times more that
> > as it takes for the same under HVM guest.
> > I have tried tests where kernel threads pinned to cores and without pinning.
> > The execution times are most of the times take as twice longer, sometimes 4
> > times longer that HVM case.
> >
> > Interesting is not only that it takes sometimes 3-4 times more
> > than HVM guest, but also that test with bound threads (to cores) takes 
> > almost
> > 3 times longer
> > to execute than running same cpu-bound test under HVM (in all
> > configurations).
> 
> 
> wow~ I didn't expect the native performance can be so "bad".... ;-)

Yes, quite a surprise :)
> 
> >
> >
> > I run each test 5 times and here are the execution times (seconds):
> >
> > -------------------------------------------------
> >         baremetal           |
> > thread_bind | thread unbind | HVM pinned to cores
> > ----------- |---------------|---------------------
> >      74     |     83        |        28
> >      74     |     88        |        28
> >      74     |     38        |        28
> >      74     |     73        |        28
> >      74     |     87        |        28
> >
> > Sometimes better times were on unbinded tests, but not often enough
> > to present it here. Some results are much worse and reach up to 120
> > seconds.
> >
> > Each test has 8 kernel threads. In baremetal case I tried the following:
> > - numa off,on;
> > - all cpus are on;
> > - isolate cpus from first node;
> > - set intel_idle.max_cstate=1;
> > - disable intel_pstate;
> >
> > I dont think I have exhausted all the options here, but it looked like
> > two last changes did improve performance, but was still not comparable to
> > HVM case.
> > I am trying to find where regression had happened. Performance on newer
> > kernel (I tried 4.5.0-rc4+) was close or better than HVM.
> >
> > I am trying to find f there were some relevant regressions to understand
> > the reason of this.
> 
> 
> I see. If this is only happening for the SMT, it may be caused by the
> SMT-related load balancing in Linux scheduler.
> However, I have disabled the HT on my machine. Probably, that's also
> the reason why I didn't see so much different in performance.

I did enable tracing to see if maybe there is extensive migration:
Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical 
cpus.

Kernel threads are not binded and here is the output for the life of one of the 
threads:

cat ./t-komp_trace |grep t-kompressor|grep 18883

    t-kompressor-18883 [028] d... 69458.596403: sched_switch: 
prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> 
next_comm=swapper/28 next_pid=0 next_prio=120
          insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: 
comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9
          <idle>-0     [009] d... 69458.669205: sched_switch: 
prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> 
next_comm=t-kompressor next_pid=18883 next_prio=120
    t-kompressor-18883 [009] d... 69486.997626: sched_switch: 
prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> 
next_comm=migration/9 next_pid=52 next_prio=0
     migration/9-52    [009] d... 69486.997632: sched_migrate_task: 
comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
          <idle>-0     [025] d... 69486.997641: sched_switch: 
prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> 
next_comm=t-kompressor next_pid=18883 next_prio=120
    t-kompressor-18883 [025] d... 69486.997710: sched_switch: 
prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> 
next_comm=swapper/25 next_pid=0 next_prio=120
          insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: 
comm=t-kompressor pid=18883


Threads are being spawned from two cores, then some of the threads migrate to 
other cores.
In the example above threads is being spawned on cpu 27 and when woken up, runs 
on cpu 009.
Later it migrated to 025 which is the second thread of the same core (009).
While I am not sure why this migration happens, it does not seem to contribute 
a lot.
Anyway this picture repeats for some other threads (some stay where they were 
woken up):

    t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald 
pid=3820 prio=120 orig_cpu=14 dest_cpu=11
    migration/13-72    [013] d... 69486.707459: sched_migrate_task: 
comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29
    migration/14-77    [014] d... 69486.783818: sched_migrate_task: 
comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30
     migration/8-47    [008] d... 69486.792667: sched_migrate_task: 
comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24
    migration/15-82    [015] d... 69486.796429: sched_migrate_task: 
comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31
    migration/10-57    [010] d... 69486.857848: sched_migrate_task: 
comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26
     migration/9-52    [009] d... 69486.997632: sched_migrate_task: 
comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
    migration/28-147   [028] d... 69503.073577: sched_migrate_task: 
comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10

All threads are running on their own cores and some migrate to second 
smt-thread over time.
I probably should have traced some other scheduling events, but I did not yet 
find any relevant ones yet.

> 
> >
> >
> >
> > What kernel you guys use?
> 
> 
> I'm using a quite old kernel
> 3.10.31
> . The reason why I'm using this kernel is because I want to use the
> LITMUS^RT [1], which is a linux testbed for real-time scheduling
> research. (It has a new version though, and I can upgrade to the
> latest version to see if the "problem" still occurs.)

Yes, it will be interesting to see the outcome.

What difference in numbers do you see?
What the machines you are seeing it on?
Is your workload is purely cpu-bound?


Thanks!

> 
> Thanks and Best Regards,
> 
> Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.