[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2] x86: use POPCNT for hweight<N>() when available
- To: Jan Beulich <jbeulich@xxxxxxxx>
- From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
- Date: Tue, 21 Mar 2023 15:57:30 +0100
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/D3X3/MDsPLqMzyhX7sxgf5gZeJBVXi2HUp0s1RMPvg=; b=iUd5Q3KIaJaq2o0PD47vGGRVL7pJob84I0jXECqXDRbsySV56PUuyXARVPvQZt7ysMgU3DWVbdg9/UCZ0BjHE5DhJVoKBv7bF0AWSfHkE++jQDYTGq3XhYBfaF9LtZ8pLM6jFwxP7qvWHot7xtxb1XRh1fBUwhusIU8+Ntgu/93CtDsEoZJX/tm+E7eGyTwxzXx1ib2zLzYlfqzgzd+wX2XXT0xQewCqq+fjkPJSrccqgP27X6tqL6iXc3n39z4/fMJ4Zl0Qr5OKFq1BE0nNOTjKZBXBdiAn1oV/aID3qZe17Kg9lbArhPOdjWca3g8P4CvOaF7Bub8l2L74V6xJkQ==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=UcuCTJxfihyl5e+pnQm3C00seAruN8oFsOJuiN0WKSY6Kf+SoNZki1mvbMax5r77KH1PKCldzncHv3+gR1wgHBoEDBH5LfsL8LdGrT75ruHgOco1Rgc7BxULb407RQdKb0l35k5F3pEbylah26Pye1kdo8vQuvXmJpB20t8saYrqUfgSazMmDTF+RHb5dac5vt+1hQN6C6HGHe4oED+BKfYDKKk+iQpkbqJYeMWyTkgZ0Vn9/xWmSQzkCkjYeMMEd1ax3YYn6Z97hydi6ngGoxwib+KdeC6NU33WBhnw7cTfYCGLkb7o5iufZVydGX2cF75QpZZZLV5W/TzdahVqZA==
- Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
- Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Wei Liu <wl@xxxxxxx>
- Delivery-date: Tue, 21 Mar 2023 14:57:55 +0000
- Ironport-data: A9a23:4a+YEKt3gTDwAunArBi9L3AccufnVHFfMUV32f8akzHdYApBsoF/q tZmKT3QPvmIN2f0KdxwPIzi8h9VscTUyYI2TgRu/y9gFi5A+JbJXdiXEBz9bniYRiHhoOCLz O1FM4Wdc5pkJpP4jk3wWlQ0hSAkjclkfpKlVKiffHg3HVQ+IMsYoUoLs/YjhYJ1isSODQqIu Nfjy+XSI1bg0DNvWo4uw/vrRChH4bKj6Fv0gnRkPaoQ5ASEziFPZH4iDfrZw0XQE9E88tGSH 44v/JnhlkvF8hEkDM+Sk7qTWiXmlZaLYGBiIlIPM0STqkAqSh4ai87XB9JFAatjsB2bnsgZ9 Tl4ncfYpTHFnEH7sL91vxFwS0mSNEDdkVPNCSDXXce7lyUqf5ZwqhnH4Y5f0YAwo45K7W9yG fMwBwtRXzvcismK/a+3Z8M9qs84MNP5BdZK0p1g5Wmx4fcOZ7nmG/+P3vkBmTA6i4ZJAOrUY NcfZXx3dhPcbhZTO1ARTpUjgOOvgXq5eDpdwL6XjfNvvy6Pk0osj/6xb7I5efTTLSlRtlyfq W/cuXzwHzkRNcCFyCrD+XWp7gPKtXqjAd1IT+DlppaGhnW+gS8UVSJIBWDirN/is1C0WNB+d 24tr39GQa8asRbDosPGdx+yrWOAvxUcc8FNCOB84waIooLE7gDcCmUaQzppbN09qNRwVTEsz kWOnd7iGXpoqrL9YW2Z3qeZq3W1Iyd9EIMZTSoNTA9A+N+6pog21kjLVow7TPHzicDpEzbtx TzMtDI5m7gYkc8M0eO84EzDhDWv4JPOS2bZ+znqY45s1SshDKbNWmBiwQGzASpoRGpBcmS8g Q==
- Ironport-hdrordr: A9a23:Lg6Sca9+hqkFi5glwNtuk+G/dr1zdoMgy1knxilNoENuH/Bwxv rFoB1E73TJYVYqN03IV+rwXZVoZUmsjaKdhrNhRotKPTOWwVdASbsP0WKM+V3d8kHFh41gPO JbAtJD4b7LfCdHZKTBkW6F+r8bqbHokZxAx92uqUuFJTsaF52IhD0JbjpzfHcGJjWvUvECZe ehD4d81nOdkTN9VLXJOlA1G8z44/HbnpPvZhALQzYh9Qm1lDutrJLqDhSC2R8acjVXhZMv63 LMnQDV7riq96jT8G6Q60bjq7Bt3PfxwNpKA8KBzuATNzXXkw6tIKhxRrGYuzgxgee3rHInis PFrRsMN9l6r1nRYma2ix3w3BSI6kdl11bSjXujxVfzq83wQzw3T+JHmIJiaxPcr24tpst13q 5n13+Q88M/N2KKoA3No/zzEz16nEu9pnQv1cYVknxkSIMbLJtct5YW8k95GIoJWAj69IckOu 9zC9y03ocfTXqqK1Ti+kV/yt2lWXo+Wj+AX0g5o8SQlwNbmXhopnFosPA3rzMlztYQWpNE7+ PLPuBDj7dVVPIbaqp7GaMoXda3Inale2OMDEuiZXDcUI0XMXPErJD6pJ8v4vuxRZAOxJwu3L zcTVJjs3IocU6GM7zB4HRyyGGPfIyBZ0Wu9ikHjKIJ/4EUBYCbfhFrcWpe0/dJ+J4kc4nms/ XaAuMiPxasFxqoJW9z5XyPZ3BjEwhhbCQrgKdLZ7uvmLO9FmS4jJ2sTN/jYJzQLB0DZkTTRl M+YRmbHrQz0qnsYA61vCTs
- List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
On Mon, Mar 20, 2023 at 10:48:45AM +0100, Jan Beulich wrote:
> On 17.03.2023 13:26, Andrew Cooper wrote:
> > On 17/03/2023 11:22 am, Roger Pau Monné wrote:
> >> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote:
> >>> This is faster than using the software implementation, and the insn is
> >>> available on all half-way recent hardware. Therefore convert
> >>> generic_hweight<N>() to out-of-line functions (without affecting Arm)
> >>> and use alternatives patching to replace the function calls.
> >>>
> >>> Note that the approach doesn#t work for clang, due to it not recognizing
> >>> -ffixed-*.
> >> I've been giving this a look, and I wonder if it would be fine to
> >> simply push and pop the scratch registers in the 'call' path of the
> >> alternative, as that won't require any specific compiler option.
>
> Hmm, ...
>
> > It's been a long while, and in that time I've learnt a lot more about
> > performance, but my root objection to the approach taken here still
> > stands - it is penalising the common case to optimise some pointless
> > corner cases.
> >
> > Yes - on the call path, an extra push/pop pair (or few) to get temp
> > registers is basically free.
>
> ... what is "a few"? We'd need to push/pop all call-clobbered registers
> except %rax, i.e. a total of eight. I consider this too much. Unless,
> as you suggest further down, we wrote the fallback in assembly. Which I
> have to admit I'm surprised you propose when we strive to reduce the
> amount of assembly we have to maintain.
AMD added popcnt in 2007 and Intel in 2008. While we shouldn't
mandate popcnt support, I think we also shouldn't overly worry about
the non-popcnt path.
Also, how can you assert that the code generated without the scratch
registers being usable won't be worse than the penalty of pushing and
popping such registers on the stack and letting the routines use all
registers freely?
I very much prefer to have a non-optimal non-popcnt path, but have
popcnt support for both gcc and clang, and without requiring any
compiler options.
Thanks, Roger.
|