[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 3/5] x86/pv: Optimise prefetching in svm_load_segs()

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
From: Jan Beulich <jbeulich@xxxxxxxx>
Date: Thu, 10 Sep 2020 16:57:50 +0200
Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>
Delivery-date: Thu, 10 Sep 2020 14:58:05 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 09.09.2020 11:59, Andrew Cooper wrote:
> Split into two functions.  Passing a load of zeros in results in somewhat poor
> register scheduling in __context_switch().

I'm afraid I don't understand why this would be, no matter that
I trust you having observed this being the case: The registers
used for passing parameters are all call-clobbered anyway, so
the compiler can't use them for anything across the call. And
it would look pretty poor code generation wise if the XORs to
clear them (which effectively have no latency at all) would be
scheduled far ahead of the call, especially when there's better
use for the registers. The observation wasn't possibly from
before your recent dropping of two of the parameters, when they
couldn't all be passed in registers (albeit even then it would
be odd, as the change then should merely have lead to a slightly
smaller stack frame of the function)?

> Update the prefetching comment to note that the main point is the TLB fill.
> 
> Reorder the writes in svm_load_segs() to access the VMCB fields in ascending
> order, which gets better next-line prefetch behaviour out of hardware.  Update
> the prefetch instruction to match.
> 
> The net delta is:
> 
>   add/remove: 1/0 grow/shrink: 0/2 up/down: 38/-39 (-1)
>   Function                                     old     new   delta
>   svm_load_segs_prefetch                         -      38     +38
>   __context_switch                             967     951     -16
>   svm_load_segs                                291     268     -23

A net win of 1 byte ;-)

> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -1520,6 +1520,19 @@ static void svm_init_erratum_383(const struct 
> cpuinfo_x86 *c)
>  }
>  
>  #ifdef CONFIG_PV
> +void svm_load_segs_prefetch(void)
> +{
> +    struct vmcb_struct *vmcb = this_cpu(host_vmcb_va);

const?

> +    if ( vmcb )
> +        /*
> +         * The main reason for this prefetch is for the TLB fill.  Use the
> +         * opporunity to fetch the lowest address used, to get the best
> +         * behaviour out of hardwares next-line prefetcher.

Nit: "opportunity" and "hardware's" ?

I'm not opposed to the change, but before giving my R-b I'd like to
understand the register scheduling background a little better.

Jan

Follow-Ups:
- Re: [PATCH 3/5] x86/pv: Optimise prefetching in svm_load_segs()
  - From: Andrew Cooper

References:
- [PATCH 0/5] x86/pv: Minor perf improvements in segment handling
  - From: Andrew Cooper
- [PATCH 3/5] x86/pv: Optimise prefetching in svm_load_segs()
  - From: Andrew Cooper

Prev by Date: [PATCH v6 8/8] iommu: stop calling IOMMU page tables 'p2m tables'
Next by Date: [PATCH] tools/libs/guest: fix Makefile regarding zlib options
Previous by thread: [PATCH 3/5] x86/pv: Optimise prefetching in svm_load_segs()
Next by thread: Re: [PATCH 3/5] x86/pv: Optimise prefetching in svm_load_segs()
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.