[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen on ARM IRQ latency and scheduler overhead

To: Stefano Stabellini <sstabellini@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxx>
From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
Date: Fri, 17 Feb 2017 19:40:45 +0100
Cc: george.dunlap@xxxxxxxxxxxxx, edgar.iglesias@xxxxxxxxxx, julien.grall@xxxxxxx
Delivery-date: Fri, 17 Feb 2017 18:41:16 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, 2017-02-09 at 16:54 -0800, Stefano Stabellini wrote:
> These are the results, in nanosec:
> 
>                         AVG     MIN     MAX     WARM MAX
> 
> NODEBUG no WFI          1890    1800    3170    2070
> NODEBUG WFI             4850    4810    7030    4980
> NODEBUG no WFI credit2  2217    2090    3420    2650
> NODEBUG WFI credit2     8080    7890    10320   8300
> 
> DEBUG no WFI            2252    2080    3320    2650
> DEBUG WFI               6500    6140    8520    8130
> DEBUG WFI, credit2      8050    7870    10680   8450
> 
> As you can see, depending on whether the guest issues a WFI or not
> while
> waiting for interrupts, the results change significantly.
> Interestingly,
> credit2 does worse than credit1 in this area.
> 
I did some measuring myself, on x86, with different tools. So,
cyclictest is basically something very very similar to the app
Stefano's app.

I've run it both within Dom0, and inside a guest. I also run a Xen
build (in this case, only inside of the guest).

> We are down to 2000-3000ns. Then, I started investigating the
> scheduler.
> I measured how long it takes to run "vcpu_unblock": 1050ns, which is
> significant. I don't know what is causing the remaining 1000-2000ns,
> but
> I bet on another scheduler function. Do you have any suggestions on
> which one?
> 
So, vcpu_unblock() calls vcpu_wake(), which then invokes the
scheduler's wakeup related functions.

If you time vcpu_unblock(), from beginning to end of the function, you
actually capture quite a few things. E.g., the scheduler lock is taken
inside vcpu_wake(), so you're basically including time spent waited on
the lock in the estimation.

That is probably ok (as in, lock contention definitely is something
relevant to latency), but it is expected for things to be rather
different between Credit1 and Credit2.

I've, OTOH, tried to time, SCHED_OP(wake) and SCHED_OP(do_schedule),
and here's the result. Numbers are in cycles (I've used RDTSC) and, for
making sure to obtain consistent and comparable numbers, I've set the
frequency scaling governor to performance.

Dom0, [performance]                                                     
                cyclictest 1us  cyclictest 1ms  cyclictest 100ms                

(cycles)        Credit1 Credit2 Credit1 Credit2 Credit1 Credit2         
wakeup-avg      2429    2035    1980    1633    2535    1979            
wakeup-max      14577   113682  15153   203136  12285   115164          
sched-avg       1716    1860    2527    1651    2286    1670            
sched-max       16059   15000   12297   101760  15831   13122           

VM, [performance]                                                       
                cyclictest 1us  cyclictest 1ms  cyclictest 100ms make -j xen    
(cycles)        Credit1 Credit2 Credit1 Credit2 Credit1 Credit2  Credit1 Credit2
wakeup-avg      2213    2128    1944    2342    2374    2213     2429    1618
wakeup-max      9990    10104   11262   9927    10290   10218    14430   15108
sched-avg       2437    2472    1620    1594    2498    1759     2449    1809
sched-max       14100   14634   10071   9984    10878   8748     16476   14220

Actually, TSC on this box should be stable and invariant, so I guess I
can try with the default governor. Will do that on Monday. Does ARM
have frequency scaling (I did remember something on xen-devel, but I am
not sure whether it landed upstream)?

But anyway. You're seeing big differences between Credit1 and Credit2,
while I, at least as far as the actual schedulers' code is concerned,
don't.

Credit2 shows higher wakeup-max values, but only in cases where the
workload runs in dom0. But it also shows better (lower) averages, in
both the two kind of workload considered and both in the dom0 and VM
case.

I therefore wonder what is actually responsible for the huge
differences between the two scheduler that you are seeing... could be
lock contention, but with only 4 pCPUs and 2 active vCPUs, I honestly
doubt it...

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] Xen on ARM IRQ latency and scheduler overhead
  - From: Stefano Stabellini
- Re: [Xen-devel] Xen on ARM IRQ latency and scheduler overhead
  - From: Julien Grall

References:
- [Xen-devel] Xen on ARM IRQ latency and scheduler overhead
  - From: Stefano Stabellini

Prev by Date: Re: [Xen-devel] qemu-upstream triggering OOM killer
Next by Date: Re: [Xen-devel] Unable to boot Xen 4.8 with iommu=0
Previous by thread: Re: [Xen-devel] Xen on ARM IRQ latency and scheduler overhead
Next by thread: Re: [Xen-devel] Xen on ARM IRQ latency and scheduler overhead
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.