[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC 3/4] Arm64: further speed-up to hweight{32, 64}()



>>> On 04.06.19 at 18:11, <julien.grall@xxxxxxx> wrote:
> On 5/31/19 10:53 AM, Jan Beulich wrote:
>> According to Linux commit e75bef2a4f ("arm64: Select
>> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
>> variant using only bitwise operations on at least some hardware, and no
>> worse on other.
>> 
>> Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
>> ---
>> RFC: To be honest I'm not fully convinced this is a win in particular in
>>       the hweight32() case, as there's no actual shift insn which gets
>>       replaced by the multiplication. Even for hweight64() the compiler
>>       could emit better code and avoid the explicit shift by 32 (which it
>>       emits at least for me).
> 
> I can see multiplication instruction used in both hweight32() and 
> hweight64() with the compiler I am using.

That is for which exact implementation? What I was referring to as
"could emit better code" was the multiplication-free variant, where
the compiler fails to recognize (afaict) another opportunity to fold
a shift into an arithmetic instruction:

        add     x0, x0, x0, lsr #4
        and     x0, x0, #0xf0f0f0f0f0f0f0f
        add     x0, x0, x0, lsr #8
        add     x0, x0, x0, lsr #16
>>>     lsr     x1, x0, #32
>>>     add     w0, w1, w0
        and     w0, w0, #0xff
        ret

Afaict the two marked insns could be replaced by

        add     x0, x0, x0, lsr #32

With there only a sequence of add-s remaining, I'm having
difficulty seeing how the use of mul+lsr would actually help:

        add     x0, x0, x0, lsr #4
        and     x0, x0, #0xf0f0f0f0f0f0f0f
        mov     x1, #0x101010101010101
        mul     x0, x0, x1
        lsr     x0, x0, #56
        ret

But of course I know nothing about throughput and latency
of such add-s with one of their operands shifted first. And
yes, the variant using mul is, comparing with the better
optimized case, still one insn smaller.

> I would expect the compiler could easily replace a multiply by a series 
> of shift but it would be more difficult to do the invert.
> 
> Also, this has been in Linux for a year now, so I am assuming Linux 
> folks are happy with changes (CCing Robin just in case I missed 
> anything). Therefore I am happy to give it a go on Xen as well.

In which case - can I take this as an ack, or do you want to first
pursue the discussion?

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.