[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v2] x86: use POPCNT for hweight<N>() when available
On 17.03.2023 13:26, Andrew Cooper wrote: > On 17/03/2023 11:22 am, Roger Pau Monné wrote: >> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote: >>> This is faster than using the software implementation, and the insn is >>> available on all half-way recent hardware. Therefore convert >>> generic_hweight<N>() to out-of-line functions (without affecting Arm) >>> and use alternatives patching to replace the function calls. >>> >>> Note that the approach doesn#t work for clang, due to it not recognizing >>> -ffixed-*. >> I've been giving this a look, and I wonder if it would be fine to >> simply push and pop the scratch registers in the 'call' path of the >> alternative, as that won't require any specific compiler option. Hmm, ... > It's been a long while, and in that time I've learnt a lot more about > performance, but my root objection to the approach taken here still > stands - it is penalising the common case to optimise some pointless > corner cases. > > Yes - on the call path, an extra push/pop pair (or few) to get temp > registers is basically free. ... what is "a few"? We'd need to push/pop all call-clobbered registers except %rax, i.e. a total of eight. I consider this too much. Unless, as you suggest further down, we wrote the fallback in assembly. Which I have to admit I'm surprised you propose when we strive to reduce the amount of assembly we have to maintain. > Right now, the generic_hweight() helpers are static inline in > xen/bitops.h and this is nonsense. The absolute best they should be is > extern inline in our new lib/ and I'll bet that the compilers stop > inlining them there and then. That would be an orthogonal move, wouldn't it? I'm also not really willing to go as far as calling the present way of it working "nonsense". I could be talked into doing such a transformation in a separate patch, but only if it is halfway certain that this won't again be effort invested just to then face further opposition (other maintainers may not agree with the movement, as we've seen for other remotely similar changes to "extern inline"). > Given new abi's like x86_64-v2 (which guarantees POPCNT as an available > feature), it would be nice to arrange to use __builtin_popcnt() to let > the compiler optimise to its hearts content, but outside of this case, > it is actively damaging to try and optimise for memory operands or to > inline the 8/16 case. > > So, for x86 specifically, we want: > > if ( CONFIG_POPCNT ) > __builtin_popcnt() > else > ALT( "popcnt D -> a", > "call arch_popcnt$N" ) > > and we can write arch_popcnt* in x86's lib/ and in assembly, because > these are trivial enough and we do need to be careful with registers/etc. How does x86_64-v2 matter here? And how does using __builtin_popcnt() help, which would use the "popcnt" insn only if we passed -march=popcnt (or any wider option implying this one) to the compiler? As to the 8-/16-bit case - I've already accepted to drop that special casing. The main reason I had it separate was because the generic code also has them special cased. In any event I would like to make all agreed upon changes in one go, hence why I didn't submit a new version yet. > I'm not sure if a "+D" vs "D" will matter at the top level. Probably > not, so it might be an easy way to get one tempt register. Other temp > registers can come from push/pop. > > > While we're at it, we should split hweight out of bitops and write the > common header in such a way that it defaults to the generic > implementations in lib/, and that will subsume the ARM header and also > make this work on RISC-V for free. Yet another independent change you're asking for. I've taken note of both of these separate requests, but without any guarantee (yet) that I'm going to actually carry them out. Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |