[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC v2 00/16] Old GIC (gic-vgic) optimizations for GICV2

To: André Przywara <andre.przywara@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
From: Andrii Anisov <andrii.anisov@xxxxxxxxx>
Date: Wed, 16 Jan 2019 19:32:31 +0200
Cc: Julien Grall <julien.grall@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Andrii Anisov <andrii_anisov@xxxxxxxx>
Delivery-date: Wed, 16 Jan 2019 17:32:42 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hello Andre,

On 02.01.19 20:33, André Przywara wrote:

Many thanks for generating these numbers, this is very useful.

But: could you make any sense out them? I plotted them, but they don't
seem to be very conclusive.


Those numbers are mostly intended to show per patch effects. But I kept them
consequently merged because of several dependencies between patches. Also
number of possible patches combination is quite  big :), so I stuck at that
one. I guess plotting is not quite informative in this particular case.

I am scratching my head about the following
issues:
- The behaviour seems to be somewhat different between the four cases.
Which one do you care about, in particular? Is vwfi=native an option for
your use case? Which scheduler do you need or want to use?

- Which of the four columns (max, warm_max, min, avg) are you actually
interested in? For a hard realtime system I would expect max, or maybe
warm_max?

I think of max(warm_max) for RT, and avg for multimedia (infotainment) use
cases.

- Some of the patches seem to *increase* the latency.

Surprisingly to me, some patches *do* increase the latency. But I kept them on
the line. In order to show all effects.

Patch 08/16 for
instance sticks out here. Most of the times the latency decreases again
afterwards, with a later patch, but I wonder if you could just pick the
patches that help or somehow explain those outliers.

- Can you give some more background on how you generated the numbers? I
haven't looked with too much detail into the benchmark, but I wonder about:
  * What is the runtime of your test? It seems to run forever and updates
    the stats twice a second, if I get this correctly. So for how long
    did you let it run?

I did run it until avg stabilized near some minimal value, and a bit more.
Please note that avg is a moving average here.

  * Do we have any idea what the reliability of the values are? Can we
    somehow calculate the standard deviation, for instance?

I guess TBM could be changed in that way.

That would
    help to get an idea about the error range we are looking at.

  * Is this still the same system as in [4]?

Yes, sure. The system is the same.

The resolution in there is
    only 120ns, some of the values for two patches differ exactly by that
    amount.Timer resolution is 120ns. And I can't squize something better on my 
HW :(.

So is this actually within the error margin?

Here the avg value gives a clue of the patch impact.


Also it seems that this test is the only thing running on the system,
beside the idle VCPU. Is that a reasonable way to assess real world
interrupt latency?

Not really. It is a trivial, but reproducible test bench. It gives numbers for 
the
straight-forward interrupt delivery path only. No contentions on lock, no 
concurrent
interrupts inserting, no multiple interrupts processing per hyp entry are 
covered.

For instance that means that the IRQ handler mostly
hits even in the L1 cache, which I wouldn't expect in actual systems.

Agree.

The method to separate warm_max from max seems to support this. Do you
have some numbers from that 3D benchmark you mentioned earlier, to put
this into perspective and show that improvements in one benefit the
other as well?

I did not check this particular patch series with glmark2 yet.

Also I looked at some code and started to cook up something myself[6].
The first two patches there should replace your patch 01/16 in a much
cleaner and easier way, along the lines I mentioned before in a reply to
a former post of yours.


So, numbers for those two patches on top of 
7f28661f6a7ce3d82f881b9afedfebca7f2cf116 are as following:

xen/arm: VGIC: Optimise is_lpi() for GICv2-only configuration
xen/arm: VGIC: wrap PRISTINE_LPI bit check
        max=9600 warm_max=6840 min=6600 avg=6608        max=4320 warm_max=3120 
min=2880 avg=2905        max=9120 warm_max=7200 min=6600 avg=6870        
max=4320 warm_max=3000 min=2880 avg=2887


Then I looked at the IRQ handler and stumbled upon the function pointers
we are using. I was eyeing them before, because my hunch is they are
costly, especially on big cores, as it might be hard for the CPU to
speculate correctly. Basically something like a call to
gic_hw_ops->gic_read_irq() translates into:
        ldr     x0, [x20]
        ldr     x0, [x0, #72]
        blr     x0
That contains two dependency on loads. If one of them misses in the
cache, the whole pipeline is stalled, if the CPU doesn't speculate both
loads correctly (which it might, but we don't know).
This is extra annoying since those function pointers never change, and
are always pointing to the GICv2 functions if CONFIG_GICV3 is not
defined. On top of this some functions are really trivial, so we pay a
big price for something that might be a single MMIO read or even system
register access. I think the proper solution (TM) would be to patch
these accesses once we know which GIC we are running on. Linux does
something to this effect, which benefits GICv3 in particular. From
quickly looking at it, Xen seems to lack the infrastructure (jump labels
and more sophisticated runtime patching) to easily copy this method.

So the top three patches in [6] address this in a somewhat hackish way,
just to show if this method improves something. I just compile tested
this, so would be curious if you could give it a try and test both
functionality and performance. A nice side effect is that those patches
benefit both the old and new VGIC. The effect on the TBM benchmark might
not be too visible, though, due to the hot caches.


Numbers for those two patches on top of 
7f28661f6a7ce3d82f881b9afedfebca7f2cf116 are as following:

xen/arm: VGIC: always use GICv2 functions if GICV3 is not defined
xen/arm: VGIC: export GICv2 functions
xen/arm: VGIC: wrap gic_hw_ops calls
        max=9360 warm_max=6960 min=6600 avg=6684        max=4320 warm_max=3120 
min=2880 avg=2981        max=9600 warm_max=7440 min=6600 avg=7027        
max=4560 warm_max=3000 min=2880 avg=2889

As you can see numbers are inconsistent. For the column 3 max is increased.
I guess it should be checked for glmark2 with several concurrent VMs.


Cheers,
Andre.

[6] https://github.com/Andre-ARM/xen/commits/vgic-opt


        sched=credit2 vwfi=trap                         sched=credit2 
vwfi=native                       sched=credit vwfi=trap                        
  sched=credit vwfi=native

7f28661f6a7ce3d82f881b9afedfebca7f2cf116
        max=9480 warm_max=7200 min=6600 avg=6743        max=4680 warm_max=3240 
min=3000 avg=3007        max=9480 warm_max=7920 min=6720 avg=7009        
max=4560 warm_max=3000 min=2880 avg=2979

gic:gic-vgic: separate GIV3 code more thoroughly

        max=9720 warm_max=6960 min=6600 avg=6617        max=5040 warm_max=3840 
min=2880 avg=2905        max=9480 warm_max=7200 min=6600 avg=6871        
max=4560 warm_max=3000 min=2880 avg=2887

gic-vgic:vgic: avoid excessive conversions

        max=9360 warm_max=6720 min=6480 avg=6578        max=4800 warm_max=3120 
min=2880 avg=2895        max=9480 warm_max=7080 min=6600 avg=6804        
max=4800 warm_max=3120 min=2880 avg=2887

gic:vgic:gic-vgic: introduce non-atomic bitops

        max=9120 warm_max=6600 min=6480 avg=6546        max=4920 warm_max=3000 
min=2760 avg=2872        max=9120 warm_max=6720 min=6480 avg=6574        
max=4200 warm_max=3120 min=2760 avg=2798

gic: drop interrupts enabling on interrupts processing
        max=9240 warm_max=7080 min=6360 avg=6492        max=5040 warm_max=3240 
min=2760 avg=2767        max=9240 warm_max=6720 min=6480 avg=6491        
max=4440 warm_max=3000 min=2760 avg=2809

gic-vgic: skip irqs locking in gic_restore_pending_irqs()
        max=9000 warm_max=6720 min=6360 avg=6430        max=4320 warm_max=3120 
min=2640 avg=2671        max=9240 warm_max=6720 min=6360 avg=6459        
max=4440 warm_max=2880 min=2640 avg=2668

vgic: move pause_flags check out of vgic spinlock
        max=9240 warm_max=6720 min=6360 avg=6431        max=4800 warm_max=2880 
min=2640 avg=2675        max=9360 warm_max=6600 min=6360 avg=6435        
max=4440 warm_max=2760 min=2640 avg=2647

vgic: move irq_to_pending out of lock
        max=8520 warm_max=7440 min=6360 avg=6444        max=4680 warm_max=3000 
min=2640 avg=2753        max=9480 warm_max=6720 min=6360 avg=6445        
max=4200 warm_max=3000 min=2640 avg=2667

gic-vgic:vgic: do not keep disabled IRQs in any of queues
        max=9120 warm_max=7920 min=6360 avg=6447        max=4440 warm_max=2760 
min=2760 avg=2767        max=10440 warm_max=7560 min=6360 avg=6459       
max=4440 warm_max=3840 min=2640 avg=2669

xen/arm: Re-enable interrupt later in the trap path
        max=9720 warm_max=9120 min=6360 avg=6441        max=4440 warm_max=2880 
min=2760 avg=2767        max=9360 warm_max=6960 min=6360 avg=6451        
max=4680 warm_max=2880 min=2640 avg=2675

gic-vgic: skip irqs locking in vgic_sync_from_lrs
        max=9240 warm_max=7080 min=6360 avg=6431        max=4920 warm_max=3120 
min=2640 avg=2678        max=9480 warm_max=6960 min=6360 avg=6443        
max=4680 warm_max=2880 min=2640 avg=2667

gic-v2: Write HCR only on change
        max=9840 warm_max=6600 min=6360 avg=6459        max=4440 warm_max=2760 
min=2520 avg=2527        max=9480 warm_max=7920 min=6360 avg=6445        
max=4320 warm_max=2760 min=2520 avg=2527

gic-v2: avoid HCR reading for GICv2
        max=9480 warm_max=7680 min=6360 avg=6443        max=4320 warm_max=2880 
min=2520 avg=2527        max=9360 warm_max=7080 min=6720 avg=6750        
max=3960 warm_max=2640 min=2400 avg=2487

hack: arm/domain: simplify context restore from idle vcpu
        max=9360 warm_max=6720 min=6000 avg=6214        max=5040 warm_max=2640 
min=2520 avg=2527        max=9480 warm_max=7080 min=6240 avg=6367        
max=4080 warm_max=2880 min=2400 avg=2527

hack: move gicv2 LRs reads and writes out of spinlocks
        max=9480 warm_max=6840 min=6600 avg=6612        max=4800 warm_max=2760 
min=2640 avg=2739        max=9000 warm_max=7200 min=6600 avg=6636        
max=4560 warm_max=2760 min=2520 avg=2619

gic: vgic: align frequently accessed data by cache line size
        max=9840 warm_max=6600 min=6240 avg=6288        max=4440 warm_max=2880 
min=2640 avg=2682        max=8280 warm_max=6720 min=6360 avg=6488        
max=4080 warm_max=2880 min=2640 avg=2678

gic: separate ppi processing
        NOT SUITABLE FOR EVALUATION WITH TBM

[1] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03328.html
[2] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03291.html
[3] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03285.html
[4] https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg00881.html
[5] 
https://docs.google.com/spreadsheets/d/1J_u9UDowaonnaKtgiugXqtIFT-c2E4Ss2vxgL6NnbNo/edit?usp=sharing

Andrii Anisov (15):
   gic:gic-vgic: separate GIV3 code more thoroughly
   gic-vgic:vgic: avoid excessive conversions
   gic:vgic:gic-vgic: introduce non-atomic bitops
   gic: drop interrupts enabling on interrupts processing
   gic-vgic: skip irqs locking in gic_restore_pending_irqs()
   vgic: move pause_flags check out of vgic spinlock
   vgic: move irq_to_pending out of lock
   gic-vgic:vgic: do not keep disabled IRQs in any of queues
   gic-vgic: skip irqs locking in vgic_sync_from_lrs
   gic-v2: Write HCR only on change
   gic-v2: avoid HCR reading for GICv2
   hack: arm/domain: simplify context restore from idle vcpu
   hack: move gicv2 LRs reads and writes out of spinlocks
   gic: vgic: align frequently accessed data by cache line size
   gic: separate ppi processing

Julien Grall (1):
   xen/arm: Re-enable interrupt later in the trap path

[11] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03282.html

  xen/arch/arm/arm64/entry.S   | 11 +++---
  xen/arch/arm/domain.c        | 25 +++++++-----
  xen/arch/arm/gic-v2.c        | 82 +++++++++++++++++++++++++-------------
  xen/arch/arm/gic-v3-its.c    |  2 +
  xen/arch/arm/gic-v3-lpi.c    |  2 +
  xen/arch/arm/gic-v3.c        |  4 +-
  xen/arch/arm/gic-vgic.c      | 87 +++++++++++++++++++++++++----------------
  xen/arch/arm/gic.c           | 32 +++++++++++++--
  xen/arch/arm/irq.c           | 32 +++++++++++++++
  xen/arch/arm/traps.c         |  6 +++
  xen/arch/arm/vgic-v3-its.c   |  2 +-
  xen/arch/arm/vgic.c          | 93 +++++++++++++++++++++++++++++++++++++-------
  xen/arch/arm/vgic/vgic.c     |  2 +
  xen/include/asm-arm/config.h |  2 +-
  xen/include/asm-arm/gic.h    | 10 ++---
  xen/include/asm-arm/irq.h    |  3 ++
  xen/include/asm-arm/vgic.h   | 24 ++++++++----
  xen/include/xen/sched.h      |  1 +
  18 files changed, 310 insertions(+), 110 deletions(-)


--
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

References:
- Re: [Xen-devel] [RFC v2 00/16] Old GIC (gic-vgic) optimizations for GICV2
  - From: André Przywara

Prev by Date: Re: [Xen-devel] [PATCH v3 04/11] optee: add OP-TEE mediator skeleton
Next by Date: Re: [Xen-devel] [PATCH v3 04/11] optee: add OP-TEE mediator skeleton
Previous by thread: Re: [Xen-devel] [RFC v2 00/16] Old GIC (gic-vgic) optimizations for GICV2
Next by thread: Re: [Xen-devel] [RFC v2 00/16] Old GIC (gic-vgic) optimizations for GICV2
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.