 
	
| [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available
 On 23.08.2024 01:06, Andrew Cooper wrote:
> A few RFC points.
> 
>  * I throught we had an x86 general lib-y but I can't find one, hence why it's
>    still in xen/lib/ for now.
We indeed have nothing like that yet. The file name should then imo not be
arch-* though, but x86-*. Or you could put it in xen/lib/x86/. It could even
be obj-y rather than lib-y, unless you expect we'll be able to get rid of
all uses, which seems unlikely at least due to bitmap_weight(). Otoh with
my ABI-level series the call site in arch_hweightl() could then be made go
away for v2 and above, at which point it living in lib-y will be preferable.
>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>    __attribute__((no_caller_saved_registers)) and can forgo writing this in 
> asm.
> 
>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
>    habit of complaining about incompatible instructions even when it's not
>    emitting them.
What is this part about? What incompatible instructions, in particular?
> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned 
> long x)
>  }
>  #define arch_flsl arch_flsl
>  
> +static always_inline unsigned int arch_hweightl(unsigned long x)
> +{
> +    unsigned int r;
> +
> +    /*
> +     * arch_generic_hweightl() is written in ASM in order to preserve all
> +     * registers, as the compiler can't see the call.
> +     *
> +     * This limits the POPCNT instruction to using the same ABI as a function
> +     * call (input in %rdi, output in %eax) but that's fine.
> +     */
> +    alternative_io("call arch_generic_hweightl",
> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
> +                   [val] "D" (x));
If you made [val] an output ("+D") you could avoid preserving the register
in the function. And I'd expect the register wouldn't be re-used often
afterwards, so its clobbering likely won't harm code quality very much.
> --- /dev/null
> +++ b/xen/lib/arch-generic-hweightl.S
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +        .file __FILE__
> +
> +#include <xen/linkage.h>
> +
> +/*
> + * An implementation of generic_hweightl() used on hardware without the 
> POPCNT
> + * instruction.
> + *
> + * This function is called from within an ALTERNATIVE in arch_hweightl().
> + * i.e. behind the back of the compiler.  Therefore all registers are callee
> + * preserved.
> + *
> + * The ASM is what GCC-12 emits for generic_hweightl() in a release build of
> + * Xen, with spilling of %rdi/%rdx to preserve the callers registers.
> + */
> +FUNC(arch_generic_hweightl)
> +        push   %rdi
> +        push   %rdx
> +
> +        movabs $0x5555555555555555, %rdx
> +        mov    %rdi, %rax
> +        shr    $1, %rax
> +        and    %rdx, %rax
> +        sub    %rax, %rdi
> +        movabs $0x3333333333333333, %rax
> +        mov    %rdi, %rdx
> +        shr    $0x2, %rdi
Could I talk you into omitting the 0x here and ...
> +        and    %rax, %rdx
> +        and    %rax, %rdi
> +        add    %rdi, %rdx
> +        mov    %rdx, %rax
> +        shr    $0x4, %rax
... here, and maybe use ...
> +        add    %rdx, %rax
> +        movabs $0xf0f0f0f0f0f0f0f, %rdx
> +        and    %rdx, %rax
> +        movabs $0x101010101010101, %rdx
> +        imul   %rdx, %rax
> +        shr    $0x38, %rax
... 56 here (or even BITS_PER_LONG-8)?
Jan
 
 
 | 
|  | Lists.xenproject.org is hosted with RackSpace, monitoring our |