[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] x86: use POPCNT for hweight<N>() when available



On 20.05.2020 13:43, Roger Pau Monné wrote:
> On Wed, May 20, 2020 at 12:57:27PM +0200, Jan Beulich wrote:
>> On 20.05.2020 12:28, Roger Pau Monné wrote:
>>> On Wed, May 20, 2020 at 12:17:15PM +0200, Jan Beulich wrote:
>>>> On 20.05.2020 11:31, Roger Pau Monné wrote:
>>>>> On Wed, May 20, 2020 at 10:31:38AM +0200, Jan Beulich wrote:
>>>>>> On 14.05.2020 16:05, Roger Pau Monné wrote:
>>>>>>> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote:
>>>>>>>> @@ -251,6 +255,10 @@ boot/mkelf32: boot/mkelf32.c
>>>>>>>>   efi/mkreloc: efi/mkreloc.c
>>>>>>>>        $(HOSTCC) $(HOSTCFLAGS) -g -o $@ $<
>>>>>>>>   
>>>>>>>> +nocov-y += hweight.o
>>>>>>>> +noubsan-y += hweight.o
>>>>>>>> +hweight.o: CFLAGS += $(foreach reg,cx dx si 8 9 10 11,-ffixed-r$(reg))
>>>>>>>
>>>>>>> Why not use clobbers in the asm to list the scratch registers? Is it
>>>>>>> that much expensive?
>>>>>>
>>>>>> The goal is to disturb the call sites as little as possible. There's
>>>>>> no point avoiding the scratch registers when no call is made (i.e.
>>>>>> when the POPCNT insn can be used). Taking away from the compiler 7
>>>>>> out of 15 registers (that it can hold live data in) seems quite a
>>>>>> lot to me.
>>>>>
>>>>> IMO using -ffixed-reg for all those registers is even worse, as that
>>>>> prevents the generated code in hweight from using any of those, thus
>>>>> greatly limiting the amount of registers and likely making the
>>>>> generated code rely heavily on pushing an popping from the stack?
>>>>
>>>> Okay, that's the other side of the idea behind all this: Virtually no
>>>> hardware we run on will lack POPCNT support, hence the quality of
>>>> these fallback routines matters only the very old hardware, where we
>>>> likely don't perform optimally already anyway.
>>>>
>>>>> This also has the side effect to limiting the usage of popcnt to gcc,
>>>>> which IMO is also not ideal.
>>>>
>>>> Agreed. I don't know enough about clang to be able to think of
>>>> possible alternatives. In any event there's no change to current
>>>> behavior for hypervisors built with clang.
>>>>
>>>>> I also wondered, since the in-place asm before patching is a call
>>>>> instruction, wouldn't inline asm at build time already assume that the
>>>>> scratch registers are clobbered?
>>>>
>>>> That would imply the compiler peeks into the string literal of the
>>>> asm(). At least gcc doesn't, and even if it did it couldn't infer an
>>>> ABI from seeing a CALL insn.
>>>
>>> Please bear with me, but then I don't understand what Linux is doing
>>> in arch/x86/include/asm/arch_hweight.h. I see no clobbers there,
>>> neither it seems like the __sw_hweight{32/64} functions are built
>>> without the usage of the scratch registers.
>>
>> __sw_hweight{32,64} are implemented in assembly, avoiding most
>> scratch registers while pushing/popping the ones which do get
>> altered.
> 
> Oh right, I was looking at lib/hweight.c instead of the arch one.
> 
> Would you agree to use the no_caller_saved_registers attribute (which
> is available AFAICT for both gcc and clang) for generic_hweightXX and
> then remove the asm prefix code in favour of the defines for
> hweight{8/16}?

At least for gcc no_caller_saved_registers isn't old enough to be
used unconditionally (nor is its companion -mgeneral-regs-only).
If you tell me it's fine to use unconditionally with clang, then
I can see about making this the preferred variant, with the
present one as a fallback.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.