[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC 5/6] arm64: call enter_hypervisor_head only when it is needed



On Wed, 31 Jul 2019 12:02:20 +0100
Julien Grall <julien.grall@xxxxxxx> wrote:

Hi,

> On 30/07/2019 18:35, Andrii Anisov wrote:
> > 
> > On 26.07.19 13:59, Julien Grall wrote:  
> >> Hi,
> >>
> >> On 26/07/2019 11:37, Andrii Anisov wrote:  
> >>> From: Andrii Anisov <andrii_anisov@xxxxxxxx>
> >>>
> >>> On ARM64 we know exactly if trap happened from hypervisor or guest, so
> >>> we do not need to take that decision. This reduces a condition for
> >>> all enter_hypervisor_head calls and the function call for traps from
> >>> the hypervisor mode.  
> >>
> >> One condition lost but ...  
> > 
> > ...In the hot path (actually at any trap).  
> 
> Everything is in the hot path here, yet there are a lot of other branches. So 
> why this branch in particular?
> 
> As I have mentioned a few times before, there are a difference between the 
> theory and the practice. In theory, removing a branch looks nice. But in 
> practice this may not be the case.
> 
> In this particular case, I don't believe this is going to have a real impact 
> on 
> the performance.
> 
> The PSTATE has been saved a few instructions before in cpu_user_regs, so there
> are high chance the value will still be in the L1 cache.

I agree on this, and second the idea of *not* micro-optimising code just for 
the sake of it. If you have numbers that back this up, it would be a different 
story.

> The compiler may also decide to do the direct branch when not in guest_mode. 
> A 
> trap from the hypervisor is mostly for interrupts. So there are chance this 
> is 
> not going to have a real impact on the overall of the interrupt handling.
> 
> If you are really worry of the impact of branch then there are a few more 
> important places (with a greater benefits) to look:
>      1) It seems the compiler use a jump table for the switch case in 
> do_trap_guest_sync(), so it will result to multiple direct branch everytime.

I don't think it's worth to "fix" this issue. The compiler has done this for a 
reason, and I would guess it figured that this is cheaper than other ways of 
solving this. If you are really paranoid about this, I would try to compile 
this with tuning for your particular core (-mtune), so that the compiler can 
throw in more micro-architectural knowledge about the cost of certain 
instructions.

>      2) Indirect branch have a non-negligible cost compare to direct branch. 
> This is a lot used in the interrupt code (see gic_hw_ops->read_irq()). All of 
> them are known at boot time, so they could be replace with direct branch. x86 
> recently introduced alternative_call() for this purpose. This could be 
> re-used 
> on Arm.

This is indeed something I was always worried about: It looks cheap and elegant 
in the C source code, but is potentially expensive on hardware. The particular 
snippet is:
...
  249024:       d5033fdf        isb
  249028:       f9401e80        ldr     x0, [x20, #56]
  24902c:       f9407801        ldr     x1, [x0, #240]
  249030:       2a1303e0        mov     w0, w19
  249034:       d63f0020        blr     x1
...
In case of an interrupt, the first load will probably miss the cache, and the 
CPU is stuck now, because due to the dependencies it can't do much else. It 
can't even predict the branch and speculatively execute anything, because the 
destination address is yet another dependent load away.
This might not matter for little cores like A53s, because they wouldn't 
speculate anyway. But better cores (A72, for instance) would most likely 
benefit from an optimisation in this area.

Cheers,
Andre.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.