[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v2] x86: use POPCNT for hweight<N>() when available
On 21.03.2023 17:31, Roger Pau Monné wrote: > On Tue, Mar 21, 2023 at 04:35:54PM +0100, Jan Beulich wrote: >> On 21.03.2023 15:57, Roger Pau Monné wrote: >>> On Mon, Mar 20, 2023 at 10:48:45AM +0100, Jan Beulich wrote: >>>> On 17.03.2023 13:26, Andrew Cooper wrote: >>>>> On 17/03/2023 11:22 am, Roger Pau Monné wrote: >>>>>> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote: >>>>>>> This is faster than using the software implementation, and the insn is >>>>>>> available on all half-way recent hardware. Therefore convert >>>>>>> generic_hweight<N>() to out-of-line functions (without affecting Arm) >>>>>>> and use alternatives patching to replace the function calls. >>>>>>> >>>>>>> Note that the approach doesn#t work for clang, due to it not recognizing >>>>>>> -ffixed-*. >>>>>> I've been giving this a look, and I wonder if it would be fine to >>>>>> simply push and pop the scratch registers in the 'call' path of the >>>>>> alternative, as that won't require any specific compiler option. >>>> >>>> Hmm, ... >>>> >>>>> It's been a long while, and in that time I've learnt a lot more about >>>>> performance, but my root objection to the approach taken here still >>>>> stands - it is penalising the common case to optimise some pointless >>>>> corner cases. >>>>> >>>>> Yes - on the call path, an extra push/pop pair (or few) to get temp >>>>> registers is basically free. >>>> >>>> ... what is "a few"? We'd need to push/pop all call-clobbered registers >>>> except %rax, i.e. a total of eight. I consider this too much. Unless, >>>> as you suggest further down, we wrote the fallback in assembly. Which I >>>> have to admit I'm surprised you propose when we strive to reduce the >>>> amount of assembly we have to maintain. >>> >>> AMD added popcnt in 2007 and Intel in 2008. While we shouldn't >>> mandate popcnt support, I think we also shouldn't overly worry about >>> the non-popcnt path. >> >> We agree here. >> >>> Also, how can you assert that the code generated without the scratch >>> registers being usable won't be worse than the penalty of pushing and >>> popping such registers on the stack and letting the routines use all >>> registers freely? >> >> Irrespective of the -ffixed-* the compiler can make itself sufficiently >> many registers available to use, by preserving just the ones it actually >> uses. So that's pretty certainly not more PUSH/POP than when we would >> blindly preserve all which may need preserving (no matter whether >> they're actually used). >> >> Yet then both this question and ... >> >>> I very much prefer to have a non-optimal non-popcnt path, but have >>> popcnt support for both gcc and clang, and without requiring any >>> compiler options. >> >> ... this makes me wonder whether we're thinking of fundamentally >> different generated code that we would end up with. Since the >> (apparently agreed upon) goal is to optimize for the "popcnt >> available" case, mainline code should be not significantly longer that >> what's needed for the single instruction itself, or alternatively a >> CALL insn. Adding pushes/pops of all call clobbered registers around >> the CALL would grow mainline code too much (for my taste), i.e. >> irrespective of these PUSH/POP all getting NOP-ed out by alternatives >> patching. (As an aside I consider fiddling with the stack pointer in >> inline asm() risky, and hence I would prefer to avoid such whenever >> possible. > > Is this because we are likely to not end up with a proper trace if we > mess up with the stack pointer before a function call? That's a related issue, but not what I was thinking about. > I would like > to better understand your concerns with the stack pointer and inline > asm(). You can't use local variables anymore with "m" or "rm" constraints, as they might be accessed via %rsp. > Other options would be using no_caller_saved_registers, but that's not > available in all supported gcc versions, likely the same with clang, > but I wouldn't mind upping the minimal clang version. Right, you did suggest using this attribute before. But we continue to support older gcc. > What about allocating the size of the registers that need saving as an > on-stack variable visible to the compiler and then moving to/from > there in the inline asm()? That would address the fiddling-with-%rsp concern, but would further grow mainline code size. >> Yes, it can be written so it's independent of what the >> compiler thinks the stack pointer points at, but such constructs are >> fragile as soon as one wants to change them a little later on.) >> >> Are you perhaps thinking of indeed having the PUSH/POP in mainline >> code? Or some entirely different model? >> >> What I could see us doing to accommodate Clang is to have wrappers >> around the actual functions which do as you suggest and preserve all >> relevant registers, no matter whether they're used. That'll still be >> assembly code to maintain, but imo less of a concern than in Andrew's >> model of writing hweight<N>() themselves in assembly (provided I >> understood him correctly), for being purely mechanical operations. > > If possible I would prefer to use the same model for both gcc and > clang, or else things tend to get more complicated than strictly > needed. We can always decide to accept the extra overhead even with gcc. > I also wonder whether we could also get away without disabling of > coverage and ubsan for the hweight object file. But I assume as long > ass we do the function call in inline asm() (and thus kind of hidden > from the compiler) we need to disable any instrumentation because the > changed function context. I don't think there's anyway we can > manually add the function call prefix/suffix in the inline asm()? I don't know whether that would be possible (and portable across versions). But what I'm more concerned about is that such functions may also end up clobbering certain call-clobbered registers. (Note that when writing these in assembly, as suggested by Andrew, no such hooks would be used either.) Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |