[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] x86: use POPCNT for hweight<N>() when available



On 21.03.23 16:35, Jan Beulich wrote:
On 21.03.2023 15:57, Roger Pau Monné wrote:
On Mon, Mar 20, 2023 at 10:48:45AM +0100, Jan Beulich wrote:
On 17.03.2023 13:26, Andrew Cooper wrote:
On 17/03/2023 11:22 am, Roger Pau Monné wrote:
On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote:
This is faster than using the software implementation, and the insn is
available on all half-way recent hardware. Therefore convert
generic_hweight<N>() to out-of-line functions (without affecting Arm)
and use alternatives patching to replace the function calls.

Note that the approach doesn#t work for clang, due to it not recognizing
-ffixed-*.
I've been giving this a look, and I wonder if it would be fine to
simply push and pop the scratch registers in the 'call' path of the
alternative, as that won't require any specific compiler option.

Hmm, ...

It's been a long while, and in that time I've learnt a lot more about
performance, but my root objection to the approach taken here still
stands - it is penalising the common case to optimise some pointless
corner cases.

Yes - on the call path, an extra push/pop pair (or few) to get temp
registers is basically free.

... what is "a few"? We'd need to push/pop all call-clobbered registers
except %rax, i.e. a total of eight. I consider this too much. Unless,
as you suggest further down, we wrote the fallback in assembly. Which I
have to admit I'm surprised you propose when we strive to reduce the
amount of assembly we have to maintain.

AMD added popcnt in 2007 and Intel in 2008.  While we shouldn't
mandate popcnt support, I think we also shouldn't overly worry about
the non-popcnt path.

We agree here.

Also, how can you assert that the code generated without the scratch
registers being usable won't be worse than the penalty of pushing and
popping such registers on the stack and letting the routines use all
registers freely?

Irrespective of the -ffixed-* the compiler can make itself sufficiently
many registers available to use, by preserving just the ones it actually
uses. So that's pretty certainly not more PUSH/POP than when we would
blindly preserve all which may need preserving (no matter whether
they're actually used).

Yet then both this question and ...

I very much prefer to have a non-optimal non-popcnt path, but have
popcnt support for both gcc and clang, and without requiring any
compiler options.

... this makes me wonder whether we're thinking of fundamentally
different generated code that we would end up with. Since the
(apparently agreed upon) goal is to optimize for the "popcnt
available" case, mainline code should be not significantly longer that
what's needed for the single instruction itself, or alternatively a
CALL insn. Adding pushes/pops of all call clobbered registers around
the CALL would grow mainline code too much (for my taste), i.e.
irrespective of these PUSH/POP all getting NOP-ed out by alternatives
patching. (As an aside I consider fiddling with the stack pointer in
inline asm() risky, and hence I would prefer to avoid such whenever
possible. Yes, it can be written so it's independent of what the
compiler thinks the stack pointer points at, but such constructs are
fragile as soon as one wants to change them a little later on.)

Are you perhaps thinking of indeed having the PUSH/POP in mainline
code? Or some entirely different model?

What I could see us doing to accommodate Clang is to have wrappers
around the actual functions which do as you suggest and preserve all
relevant registers, no matter whether they're used.

You could just copy the PV_CALLEE_SAVE_REGS_THUNK() macro from the Linux
kernel, which is doing exactly that.


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.