[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] x86: use POPCNT for hweight<N>() when available


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Fri, 17 Mar 2023 12:26:45 +0000
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Th8UrnJfcbt6A5MhWI8OrQHQs93O4Oq4pWvWPHJR+sM=; b=KwTC84+YjmTQsM/5tNNXRl55Ko2Va6xrC3VcfHuvvJjttCNec62zLtKuGRJF/GUR9ZyPHJ2mUvotZ+EDFrecxFAZn1de9W0ARD7YIFtKCYTpLrbAB1EE2KaAWHT0IkaKjHTsX056V481Z50cPEDQAAqnRNMZxajWovcjxpSB+cZNnx4Om1w5rs6vB39yAjh1a66Mtl2O04utnybxq9kujY1nRG075Zj7O1rURUSKy+vma+fP/jafx3aJnjRG9LhPn/7jGx5kXetD6TKWfftuvJhcO6XpXfaQNDWkcaVNhIklSTQUWHHESSNww0IWv6Tesc+XoPSK2mijm/LHS4zByg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=YKWYhKpzgkNFPUdGrC/r7Ztdp48PGIW7L7YxVjF/GorMNl39MKJwfl3/nF+pK1sqA4FFllbXt+Xt//JjH2RaUOQpIkpXAToC5VxddRb38szrktShTlh1qnHepsiQwOjxWt/zv3+0w6+vauQcMihoW28BlbL2tmOTfCPBlwszle04bRux/yCCYc2RUwQhrsUlF+Pp0wsuqhHi2sQ2OfOgPuTNXdiiZDztJ+V3lV6L2UFquMQU0K3XkaIGWvipM48TN4Y8mhuDKbzH/X8cZUYLxDP8KRpsDBiVZb3ePP6mFXk60lEE84xn4BYItm6hM3P1DvsF/Lr8PPcODK86ym9ZrA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Wei Liu <wl@xxxxxxx>
  • Delivery-date: Fri, 17 Mar 2023 12:27:29 +0000
  • Ironport-data: A9a23:Wx+f26laV2mtt379tEpp+f3o5gysJ0RdPkR7XQ2eYbSJt1+Wr1Gzt xJLCm7UMqzcMGameI8gao6/9R4HucfSx4RhSFZu/y08FSMWpZLJC+rCIxarNUt+DCFhoGFPt JxCN4aafKjYaleG+39B55C49SEUOZmgH+a6U6icfHgqH2eIcQ954Tp7gek1n4V0ttawBgKJq LvartbWfVSowFaYCEpNg064gE4p7aSaVA8w5ARkPqgQ5QCGzhH5MbpETU2PByqgKmVrNrbSq 9brlNmR4m7f9hExPdKp+p6TnpoiG+O60aCm0xK6aoD66vRwjnVaPpUTbZLwXXx/mTSR9+2d/ f0W3XCGpaXFCYWX8AgVe0Ew/yiTpsSq8pefSZS0mZT7I0Er7xIAahihZa07FdRwxwp5PY1B3 dUlNR4ybjyEvfqd2oOJS8ZMiZ9gKMa+aevzulk4pd3YJdAPZMiZBo/svJpf1jp2gd1SF/HDY cZfcSBocBnLfxxIPBEQFY46m+CrwHL4dlW0qnrM/fZxvzeVkVU3jOeF3Nn9I7RmQe18mEqCq 32A1GP+GhwAb/SUyCaf82LqjejK9c/+cNtKT+PoraM62DV/wEQUS0EfVHXhrcPptWSsYt9TL 08I/xgX+P1aGEuDC4OVsweDiFyuswMYWtFQO/Yn8wzLwa3Riy6JC25BQjNfZdgOsM4tWSdsx lKPh8nuBzFkrPuSU3313qiQhSO/P24SN2BqWMMfZQ4M4t2mqodqiBvKFoxnCPTt0oKzHizsy TeXqiR4n68UkcMAy6S8+xbAni6ooZ/KCAUy4207Q16Y0++wX6b9D6TA1LQRxa8fRGpFZjFtZ EQ5pvU=
  • Ironport-hdrordr: A9a23:ha3wtainzMbGEH7pc8CX/r3NOnBQXlMji2hC6mlwRA09TyX4ra yTdZEgviMc5wxwZJhNo7G90ey7MArhHLROkO4s1NSZMzUOxlHYSr2KhLGKq1eMJ8S9zJ8k6U 4HSdkENDSaNzZHZKjBkXWFOudl7N6b8L25wcfypk0dMj2CspsQlTuR3Dzrb3FedU19CZ0lD4 rZw8xIqTa6EE5nDPiTNz0+U+/fvM2OsZTpbxIcQzsq9wWK5AnYjYLSIlyj0hACSCMK+Kwl8m TOjmXCl8aemsD+8BPaynTCq69bgd7wjuZEbfb87vQoFg==
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 17/03/2023 11:22 am, Roger Pau Monné wrote:
> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote:
>> This is faster than using the software implementation, and the insn is
>> available on all half-way recent hardware. Therefore convert
>> generic_hweight<N>() to out-of-line functions (without affecting Arm)
>> and use alternatives patching to replace the function calls.
>>
>> Note that the approach doesn#t work for clang, due to it not recognizing
>> -ffixed-*.
> I've been giving this a look, and I wonder if it would be fine to
> simply push and pop the scratch registers in the 'call' path of the
> alternative, as that won't require any specific compiler option.

It's been a long while, and in that time I've learnt a lot more about
performance, but my root objection to the approach taken here still
stands - it is penalising the common case to optimise some pointless
corner cases.

Yes - on the call path, an extra push/pop pair (or few) to get temp
registers is basically free.


Right now, the generic_hweight() helpers are static inline in
xen/bitops.h and this is nonsense.  The absolute best they should be is
extern inline in our new lib/ and I'll bet that the compilers stop
inlining them there and then.

Given new abi's like x86_64-v2 (which guarantees POPCNT as an available
feature), it would be nice to arrange to use __builtin_popcnt() to let
the compiler optimise to its hearts content, but outside of this case,
it is actively damaging to try and optimise for memory operands or to
inline the 8/16 case.

So, for x86 specifically, we want:

if ( CONFIG_POPCNT )
    __builtin_popcnt()
else
    ALT( "popcnt D -> a",
         "call arch_popcnt$N" )

and we can write arch_popcnt* in x86's lib/ and in assembly, because
these are trivial enough and we do need to be careful with registers/etc.

I'm not sure if a "+D" vs "D" will matter at the top level.  Probably
not, so it might be an easy way to get one tempt register.  Other temp
registers can come from push/pop.


While we're at it, we should split hweight out of bitops and write the
common header in such a way that it defaults to the generic
implementations in lib/, and that will subsume the ARM header and also
make this work on RISC-V for free.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.