Xen project Mailing List

Re: [Xen-devel] Question about high CPU load during iperf ethernet testing

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

From: Iurii Konovalenko <iurii.konovalenko@xxxxxxxxxxxxxxx>

Date: Sat, 27 Sep 2014 14:20:08 +0300

Cc: Julien Grall <julien.grall@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>

Delivery-date: Sat, 27 Sep 2014 11:20:30 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, Sep 25, 2014 at 6:10 PM, Ian Campbell <Ian.Campbell@xxxxxxxxxx> wrote: > On Thu, 2014-09-25 at 17:12 +0300, Iurii Konovalenko wrote: >> On Wed, Sep 24, 2014 at 8:57 PM, Stefano Stabellini >> <stefano.stabellini@xxxxxxxxxxxxx> wrote: >> > On Wed, 24 Sep 2014, Iurii Konovalenko wrote: >> >> Hi, Stefano! >> >> Thank you for your reply! >> >> >> >> On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini >> >> <stefano.stabellini@xxxxxxxxxxxxx> wrote: >> >> > On Mon, 22 Sep 2014, Iurii Konovalenko wrote: >> >> >> Hello, all! >> >> >> >> >> >> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5). >> >> >> Xen version is 4.4. >> >> >> I run only Linux (kernel 3.8) as Dom0, no other active domains (For >> >> >> clear tests results I decided not to start DomU). >> >> >> iperf server is started on host, iperf client is started on board with >> >> >> command line "iperf -c 192.168.2.10 -w 256k -m >> >> >> -f M -d -t 60". >> >> > >> >> > Just to double check: you are running the iperf test in Dom0, correct? >> >> >> >> Yes, iperf is running in Dom0. >> >> >> >> >> During test I studied CPU load with top tool on Dom0, and saw, that >> >> >> one VCPU is totally loaded, spending about 50% in >> >> >> software IRQs, and 50% in system. >> >> >> Running the same test on clear Linux without Xen, I saw that CPU load >> >> >> is about 2-4%. >> >> >> >> >> >> I decided to debug a bit, so I used "({register uint64_t _r; asm >> >> >> volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r" >> >> >> (_r)); _r; })" command to read timer counter before and after >> >> >> operations I want to test. >> >> >> >> >> >> In such way I've found, that most time of CPU is spent in functions >> >> >> enable_irq/disable_irq_nosync and >> >> >> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs %0, cpsr >> >> >> @ arch_local_irq_save"/"msr cpsr_c, %0 @ >> >> >> local_irq_restore"). When running without Xen it should not take so >> >> >> much time. >> >> > >> >> > There is nothing Xen specific in the Linux ARM implementation of >> >> > spin_lock_irqsave/spin_unlock_irqrestore and >> >> > enable_irq/disable_irq_nosync. >> >> > >> >> >> >> That is strange, because my explorations show a lot of time is spent >> >> there, for example in spin_unlock_irqrestore (mostly in mrs >> >> instuction) about 20%, when running in Dom0. >> > >> > Unless you are doing something wrong in your measurements, if you really >> > narrowed it down to one instruction then I would try to do the same on a >> > different SoC of another vendor to see if it is actually an hardware issue. >> > >> > >> >> >> So, could anyone explain me some questions: >> >> >> 1. Is it normal behaviour? >> >> > >> >> > No, it is not normal. >> >> > Assuming that you assign all the memory to Dom0 and as many vcpu as >> >> > physical cpus on your platform then you should get the same numbers as >> >> > native. >> >> >> >> OK, so I might do something wrong. >> >> >> >> >> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap >> >> >> access to cpsr register, that leads to >> >> >> additional overhead, but I can't find place in sources where it >> >> >> happens. >> >> > >> >> > We don't trap cpsr. >> >> >> >> It is strange, because it was only one my assumption, where time can be >> >> spent. >> >> So could you please advise where to go to understand the reason of >> >> such high VCPU load? >> > >> > I don't know. When you say that arch_local_irq_save is the one taking >> > all the time, do you actually have something like: >> > >> > time1 = read CNTVCT; >> > arch_local_irq_save(); >> > time2 = read CNTVCT; >> > printk(time2-time1); >> > >> > in your code? >> >> Almost like this, with only difference that I accumulate difference of >> time in variable, and when it's value is grater then 1s, make print, >> like this: >> >> volatile u64 total; >> #define SECONDS(_s) ((s64)((_s) * 1000000000ULL)) >> static inline s64 ticks_to_ns(uint64_t ticks) >> { >> return muldiv64(ticks, SECONDS(1), 1000 * 6144 ); >> } >> >> time1 = read CNTVCT; >> arch_local_irq_save(); >> time2 = read CNTVCT; >> total += time2-time1; >> if(ticks_to_ns(total) > 1000000000) >> { >> printk("1 second spent here"); >> total = 0; >> } > > Doesn't that just tell you that this code is called frequently and hence > clocks up the total faster? As opposed to telling you that this > operation is slow. > > I'd expect IRQs to go on and off quite a lot under normal operation e.g. > certain spinlock variants will disable irqs for example and might be > frequently used. > > Ian. Of cause it can be because of frequent calls, but why such difference with and without Xen? I thought number of spinlock calls is more or less equal with Xen or without. Best regards. Iurii Konovalenko | Senior Software Engineer GlobalLogic _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.