[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed
On Wed, 31 Jul 2019 12:02:20 +0100 Julien Grall <julien.grall@xxxxxxx> wrote: Hi, > On 30/07/2019 18:35, Andrii Anisov wrote: > > > > On 26.07.19 13:59, Julien Grall wrote: > >> Hi, > >> > >> On 26/07/2019 11:37, Andrii Anisov wrote: > >>> From: Andrii Anisov <andrii_anisov@xxxxxxxx> > >>> > >>> On ARM64 we know exactly if trap happened from hypervisor or guest, so > >>> we do not need to take that decision. This reduces a condition for > >>> all enter_hypervisor_head calls and the function call for traps from > >>> the hypervisor mode. > >> > >> One condition lost but ... > > > > ...In the hot path (actually at any trap). > > Everything is in the hot path here, yet there are a lot of other branches. So > why this branch in particular? > > As I have mentioned a few times before, there are a difference between the > theory and the practice. In theory, removing a branch looks nice. But in > practice this may not be the case. > > In this particular case, I don't believe this is going to have a real impact > on > the performance. > > The PSTATE has been saved a few instructions before in cpu_user_regs, so there > are high chance the value will still be in the L1 cache. I agree on this, and second the idea of *not* micro-optimising code just for the sake of it. If you have numbers that back this up, it would be a different story. > The compiler may also decide to do the direct branch when not in guest_mode. > A > trap from the hypervisor is mostly for interrupts. So there are chance this > is > not going to have a real impact on the overall of the interrupt handling. > > If you are really worry of the impact of branch then there are a few more > important places (with a greater benefits) to look: > 1) It seems the compiler use a jump table for the switch case in > do_trap_guest_sync(), so it will result to multiple direct branch everytime. I don't think it's worth to "fix" this issue. The compiler has done this for a reason, and I would guess it figured that this is cheaper than other ways of solving this. If you are really paranoid about this, I would try to compile this with tuning for your particular core (-mtune), so that the compiler can throw in more micro-architectural knowledge about the cost of certain instructions. > 2) Indirect branch have a non-negligible cost compare to direct branch. > This is a lot used in the interrupt code (see gic_hw_ops->read_irq()). All of > them are known at boot time, so they could be replace with direct branch. x86 > recently introduced alternative_call() for this purpose. This could be > re-used > on Arm. This is indeed something I was always worried about: It looks cheap and elegant in the C source code, but is potentially expensive on hardware. The particular snippet is: ... 249024: d5033fdf isb 249028: f9401e80 ldr x0, [x20, #56] 24902c: f9407801 ldr x1, [x0, #240] 249030: 2a1303e0 mov w0, w19 249034: d63f0020 blr x1 ... In case of an interrupt, the first load will probably miss the cache, and the CPU is stuck now, because due to the dependencies it can't do much else. It can't even predict the branch and speculatively execute anything, because the destination address is yet another dependent load away. This might not matter for little cores like A53s, because they wouldn't speculate anyway. But better cores (A72, for instance) would most likely benefit from an optimisation in this area. Cheers, Andre. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |