[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC 3/4] Arm64: further speed-up to hweight{32, 64}()



Hi Jan,

On 05/06/2019 08:42, Jan Beulich wrote:
On 04.06.19 at 18:11, <julien.grall@xxxxxxx> wrote:
On 5/31/19 10:53 AM, Jan Beulich wrote:
According to Linux commit e75bef2a4f ("arm64: Select
ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
variant using only bitwise operations on at least some hardware, and no
worse on other.

Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
---
RFC: To be honest I'm not fully convinced this is a win in particular in
       the hweight32() case, as there's no actual shift insn which gets
       replaced by the multiplication. Even for hweight64() the compiler
       could emit better code and avoid the explicit shift by 32 (which it
       emits at least for me).

I can see multiplication instruction used in both hweight32() and
hweight64() with the compiler I am using.

That is for which exact implementation?

A simple call hweight64().

What I was referring to as
"could emit better code" was the multiplication-free variant, where
the compiler fails to recognize (afaict) another opportunity to fold
a shift into an arithmetic instruction:

        add     x0, x0, x0, lsr #4
        and     x0, x0, #0xf0f0f0f0f0f0f0f
        add     x0, x0, x0, lsr #8
        add     x0, x0, x0, lsr #16
        lsr     x1, x0, #32
        add     w0, w1, w0
        and     w0, w0, #0xff
        ret

Afaict the two marked insns could be replaced by

        add     x0, x0, x0, lsr #32

I am not a compiler expert. Anyway this likely depends on the version of the compiler you are using. They are becoming smarter and smarter.


With there only a sequence of add-s remaining, I'm having
difficulty seeing how the use of mul+lsr would actually help:

        add     x0, x0, x0, lsr #4
        and     x0, x0, #0xf0f0f0f0f0f0f0f
        mov     x1, #0x101010101010101
        mul     x0, x0, x1
        lsr     x0, x0, #56
        ret

But of course I know nothing about throughput and latency
of such add-s with one of their operands shifted first. And
yes, the variant using mul is, comparing with the better > optimized case, 
still one insn smaller.
The commit message in Linux (and Robin's answer) is pretty clear. It may improve on some core but does not make it worst on other.


I would expect the compiler could easily replace a multiply by a series
of shift but it would be more difficult to do the invert.

Also, this has been in Linux for a year now, so I am assuming Linux
folks are happy with changes (CCing Robin just in case I missed
anything). Therefore I am happy to give it a go on Xen as well.

In which case - can I take this as an ack, or do you want to first
pursue the discussion?

I will commit it later on with another bunch of patches.

Cheers,

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.