[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC v2 00/16] Old GIC (gic-vgic) optimizations for GICV2
Hello Andre, On 02.01.19 20:33, André Przywara wrote: Many thanks for generating these numbers, this is very useful. But: could you make any sense out them? I plotted them, but they don't seem to be very conclusive. Those numbers are mostly intended to show per patch effects. But I kept them consequently merged because of several dependencies between patches. Also number of possible patches combination is quite big :), so I stuck at that one. I guess plotting is not quite informative in this particular case. I am scratching my head about the following issues: - The behaviour seems to be somewhat different between the four cases. Which one do you care about, in particular? Is vwfi=native an option for your use case? Which scheduler do you need or want to use? - Which of the four columns (max, warm_max, min, avg) are you actually interested in? For a hard realtime system I would expect max, or maybe warm_max? I think of max(warm_max) for RT, and avg for multimedia (infotainment) use cases. - Some of the patches seem to *increase* the latency. Surprisingly to me, some patches *do* increase the latency. But I kept them on the line. In order to show all effects. Patch 08/16 for instance sticks out here. Most of the times the latency decreases again afterwards, with a later patch, but I wonder if you could just pick the patches that help or somehow explain those outliers. - Can you give some more background on how you generated the numbers? I haven't looked with too much detail into the benchmark, but I wonder about: * What is the runtime of your test? It seems to run forever and updates the stats twice a second, if I get this correctly. So for how long did you let it run? I did run it until avg stabilized near some minimal value, and a bit more. Please note that avg is a moving average here. * Do we have any idea what the reliability of the values are? Can we somehow calculate the standard deviation, for instance? I guess TBM could be changed in that way. That would help to get an idea about the error range we are looking at. * Is this still the same system as in [4]? Yes, sure. The system is the same. The resolution in there is only 120ns, some of the values for two patches differ exactly by that amount.Timer resolution is 120ns. And I can't squize something better on my HW :(. So is this actually within the error margin? Here the avg value gives a clue of the patch impact. Also it seems that this test is the only thing running on the system, beside the idle VCPU. Is that a reasonable way to assess real world interrupt latency? Not really. It is a trivial, but reproducible test bench. It gives numbers for the straight-forward interrupt delivery path only. No contentions on lock, no concurrent interrupts inserting, no multiple interrupts processing per hyp entry are covered. For instance that means that the IRQ handler mostly hits even in the L1 cache, which I wouldn't expect in actual systems. Agree. The method to separate warm_max from max seems to support this. Do you have some numbers from that 3D benchmark you mentioned earlier, to put this into perspective and show that improvements in one benefit the other as well? I did not check this particular patch series with glmark2 yet. Also I looked at some code and started to cook up something myself[6]. The first two patches there should replace your patch 01/16 in a much cleaner and easier way, along the lines I mentioned before in a reply to a former post of yours. So, numbers for those two patches on top of 7f28661f6a7ce3d82f881b9afedfebca7f2cf116 are as following: xen/arm: VGIC: Optimise is_lpi() for GICv2-only configuration xen/arm: VGIC: wrap PRISTINE_LPI bit check max=9600 warm_max=6840 min=6600 avg=6608 max=4320 warm_max=3120 min=2880 avg=2905 max=9120 warm_max=7200 min=6600 avg=6870 max=4320 warm_max=3000 min=2880 avg=2887 Then I looked at the IRQ handler and stumbled upon the function pointers we are using. I was eyeing them before, because my hunch is they are costly, especially on big cores, as it might be hard for the CPU to speculate correctly. Basically something like a call to gic_hw_ops->gic_read_irq() translates into: ldr x0, [x20] ldr x0, [x0, #72] blr x0 That contains two dependency on loads. If one of them misses in the cache, the whole pipeline is stalled, if the CPU doesn't speculate both loads correctly (which it might, but we don't know). This is extra annoying since those function pointers never change, and are always pointing to the GICv2 functions if CONFIG_GICV3 is not defined. On top of this some functions are really trivial, so we pay a big price for something that might be a single MMIO read or even system register access. I think the proper solution (TM) would be to patch these accesses once we know which GIC we are running on. Linux does something to this effect, which benefits GICv3 in particular. From quickly looking at it, Xen seems to lack the infrastructure (jump labels and more sophisticated runtime patching) to easily copy this method. So the top three patches in [6] address this in a somewhat hackish way, just to show if this method improves something. I just compile tested this, so would be curious if you could give it a try and test both functionality and performance. A nice side effect is that those patches benefit both the old and new VGIC. The effect on the TBM benchmark might not be too visible, though, due to the hot caches. Numbers for those two patches on top of 7f28661f6a7ce3d82f881b9afedfebca7f2cf116 are as following: xen/arm: VGIC: always use GICv2 functions if GICV3 is not defined xen/arm: VGIC: export GICv2 functions xen/arm: VGIC: wrap gic_hw_ops calls max=9360 warm_max=6960 min=6600 avg=6684 max=4320 warm_max=3120 min=2880 avg=2981 max=9600 warm_max=7440 min=6600 avg=7027 max=4560 warm_max=3000 min=2880 avg=2889 As you can see numbers are inconsistent. For the column 3 max is increased. I guess it should be checked for glmark2 with several concurrent VMs. Cheers, Andre. [6] https://github.com/Andre-ARM/xen/commits/vgic-optsched=credit2 vwfi=trap sched=credit2 vwfi=native sched=credit vwfi=trap sched=credit vwfi=native 7f28661f6a7ce3d82f881b9afedfebca7f2cf116 max=9480 warm_max=7200 min=6600 avg=6743 max=4680 warm_max=3240 min=3000 avg=3007 max=9480 warm_max=7920 min=6720 avg=7009 max=4560 warm_max=3000 min=2880 avg=2979 gic:gic-vgic: separate GIV3 code more thoroughly max=9720 warm_max=6960 min=6600 avg=6617 max=5040 warm_max=3840 min=2880 avg=2905 max=9480 warm_max=7200 min=6600 avg=6871 max=4560 warm_max=3000 min=2880 avg=2887 gic-vgic:vgic: avoid excessive conversions max=9360 warm_max=6720 min=6480 avg=6578 max=4800 warm_max=3120 min=2880 avg=2895 max=9480 warm_max=7080 min=6600 avg=6804 max=4800 warm_max=3120 min=2880 avg=2887 gic:vgic:gic-vgic: introduce non-atomic bitops max=9120 warm_max=6600 min=6480 avg=6546 max=4920 warm_max=3000 min=2760 avg=2872 max=9120 warm_max=6720 min=6480 avg=6574 max=4200 warm_max=3120 min=2760 avg=2798 gic: drop interrupts enabling on interrupts processing max=9240 warm_max=7080 min=6360 avg=6492 max=5040 warm_max=3240 min=2760 avg=2767 max=9240 warm_max=6720 min=6480 avg=6491 max=4440 warm_max=3000 min=2760 avg=2809 gic-vgic: skip irqs locking in gic_restore_pending_irqs() max=9000 warm_max=6720 min=6360 avg=6430 max=4320 warm_max=3120 min=2640 avg=2671 max=9240 warm_max=6720 min=6360 avg=6459 max=4440 warm_max=2880 min=2640 avg=2668 vgic: move pause_flags check out of vgic spinlock max=9240 warm_max=6720 min=6360 avg=6431 max=4800 warm_max=2880 min=2640 avg=2675 max=9360 warm_max=6600 min=6360 avg=6435 max=4440 warm_max=2760 min=2640 avg=2647 vgic: move irq_to_pending out of lock max=8520 warm_max=7440 min=6360 avg=6444 max=4680 warm_max=3000 min=2640 avg=2753 max=9480 warm_max=6720 min=6360 avg=6445 max=4200 warm_max=3000 min=2640 avg=2667 gic-vgic:vgic: do not keep disabled IRQs in any of queues max=9120 warm_max=7920 min=6360 avg=6447 max=4440 warm_max=2760 min=2760 avg=2767 max=10440 warm_max=7560 min=6360 avg=6459 max=4440 warm_max=3840 min=2640 avg=2669 xen/arm: Re-enable interrupt later in the trap path max=9720 warm_max=9120 min=6360 avg=6441 max=4440 warm_max=2880 min=2760 avg=2767 max=9360 warm_max=6960 min=6360 avg=6451 max=4680 warm_max=2880 min=2640 avg=2675 gic-vgic: skip irqs locking in vgic_sync_from_lrs max=9240 warm_max=7080 min=6360 avg=6431 max=4920 warm_max=3120 min=2640 avg=2678 max=9480 warm_max=6960 min=6360 avg=6443 max=4680 warm_max=2880 min=2640 avg=2667 gic-v2: Write HCR only on change max=9840 warm_max=6600 min=6360 avg=6459 max=4440 warm_max=2760 min=2520 avg=2527 max=9480 warm_max=7920 min=6360 avg=6445 max=4320 warm_max=2760 min=2520 avg=2527 gic-v2: avoid HCR reading for GICv2 max=9480 warm_max=7680 min=6360 avg=6443 max=4320 warm_max=2880 min=2520 avg=2527 max=9360 warm_max=7080 min=6720 avg=6750 max=3960 warm_max=2640 min=2400 avg=2487 hack: arm/domain: simplify context restore from idle vcpu max=9360 warm_max=6720 min=6000 avg=6214 max=5040 warm_max=2640 min=2520 avg=2527 max=9480 warm_max=7080 min=6240 avg=6367 max=4080 warm_max=2880 min=2400 avg=2527 hack: move gicv2 LRs reads and writes out of spinlocks max=9480 warm_max=6840 min=6600 avg=6612 max=4800 warm_max=2760 min=2640 avg=2739 max=9000 warm_max=7200 min=6600 avg=6636 max=4560 warm_max=2760 min=2520 avg=2619 gic: vgic: align frequently accessed data by cache line size max=9840 warm_max=6600 min=6240 avg=6288 max=4440 warm_max=2880 min=2640 avg=2682 max=8280 warm_max=6720 min=6360 avg=6488 max=4080 warm_max=2880 min=2640 avg=2678 gic: separate ppi processing NOT SUITABLE FOR EVALUATION WITH TBM [1] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03328.html [2] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03291.html [3] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03285.html [4] https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg00881.html [5] https://docs.google.com/spreadsheets/d/1J_u9UDowaonnaKtgiugXqtIFT-c2E4Ss2vxgL6NnbNo/edit?usp=sharing Andrii Anisov (15): gic:gic-vgic: separate GIV3 code more thoroughly gic-vgic:vgic: avoid excessive conversions gic:vgic:gic-vgic: introduce non-atomic bitops gic: drop interrupts enabling on interrupts processing gic-vgic: skip irqs locking in gic_restore_pending_irqs() vgic: move pause_flags check out of vgic spinlock vgic: move irq_to_pending out of lock gic-vgic:vgic: do not keep disabled IRQs in any of queues gic-vgic: skip irqs locking in vgic_sync_from_lrs gic-v2: Write HCR only on change gic-v2: avoid HCR reading for GICv2 hack: arm/domain: simplify context restore from idle vcpu hack: move gicv2 LRs reads and writes out of spinlocks gic: vgic: align frequently accessed data by cache line size gic: separate ppi processing Julien Grall (1): xen/arm: Re-enable interrupt later in the trap path [11] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03282.html xen/arch/arm/arm64/entry.S | 11 +++--- xen/arch/arm/domain.c | 25 +++++++----- xen/arch/arm/gic-v2.c | 82 +++++++++++++++++++++++++------------- xen/arch/arm/gic-v3-its.c | 2 + xen/arch/arm/gic-v3-lpi.c | 2 + xen/arch/arm/gic-v3.c | 4 +- xen/arch/arm/gic-vgic.c | 87 +++++++++++++++++++++++++---------------- xen/arch/arm/gic.c | 32 +++++++++++++-- xen/arch/arm/irq.c | 32 +++++++++++++++ xen/arch/arm/traps.c | 6 +++ xen/arch/arm/vgic-v3-its.c | 2 +- xen/arch/arm/vgic.c | 93 +++++++++++++++++++++++++++++++++++++------- xen/arch/arm/vgic/vgic.c | 2 + xen/include/asm-arm/config.h | 2 +- xen/include/asm-arm/gic.h | 10 ++--- xen/include/asm-arm/irq.h | 3 ++ xen/include/asm-arm/vgic.h | 24 ++++++++---- xen/include/xen/sched.h | 1 + 18 files changed, 310 insertions(+), 110 deletions(-) -- Sincerely, Andrii Anisov. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |