Xen project Mailing List

Re: [PATCH v2] x86: use POPCNT for hweight<N>() when available

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Tue, 21 Mar 2023 18:02:08 +0100

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=BA+WzcR8xh0aD68iDNMC6IbMvR0wFWFv5Phxvyr4UnI=; b=lxVr/lxiXpzeLVFn6GKolp1RltecCRA8L33qmvQhABwlgRihp60e2ng7NzrarhrjVoG184SkR/pHcH1ojgR5DCyZiQzlG0bgyKLFTpHD7+Bt76VvPNImjdzn8EM0/yPrS/RNFgQnwSEEUW8VWNw+x7mwk4BAyDSHzdpg+NtDVFeuHRA5pD+jpwxxNPvOzc670RthbC0PJxpYLVzj101GDbmToqYYjC/k8/TFVu81LXI+rAzI1ETzgIO4HxoHqXEXpEc49N2xhVNc1wNlLDubSOLnKMH7y4ECJ8cxITx4jdPTc/80UhOpKqW9esK/C/+dRmVDiZmVena8uZ+/zWDfZA==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=aRfP5smR6/nr2/p8SPiQ0cNm+hIQUWmumzEbxRyP5sivf4OjK8aTUy31z8DVr+u+fxMCZLVk+a6oelNQE3OAFXC60d8wUDV1dbP+7WOgA4rYnGDMR18kBbVg/UfIXR4F+GU2RB9GkhsFPvwcpfB7T23kEoiO7MFh75zWYrJ60uuuTNMqGaW4g/al+lYm4vacLij7n8LdBVTj5lIYto8VLYuqQcGyDBHud5ASOg4FFk2ApA995cv1ugDLSttlbOmAbblmeseV4ILtrFlF72F+vYjrVzG0DO5tLkBVGgUJMfcA3fhK5p/46roMVKgN6mrsL24VeYBE3yPHoVBxXqBvBQ==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Wei Liu <wl@xxxxxxx>

Delivery-date: Tue, 21 Mar 2023 17:02:37 +0000

Ironport-data: A9a23:Hrf+Z6nOX5cG8UWQq4kuBCPo5gynJ0RdPkR7XQ2eYbSJt1+Wr1Gzt xJNDTvTbKzeZWfyKdknPI2w8E4P65KGy4MwTAM+pHhkRCMWpZLJC+rCIxarNUt+DCFhoGFPt JxCN4aafKjYaleG+39B55C49SEUOZmgH+a6U6icfHgqH2eIcQ954Tp7gek1n4V0ttawBgKJq LvartbWfVSowFaYCEpNg064gE4p7aSaVA8w5ARkPqgQ5gaGzRH5MbpETU2PByqgKmVrNrbSq 9brlNmR4m7f9hExPdKp+p6TnpoiG+O60aCm0xK6aoD66vRwjnVaPpUTbZLwXXx/mTSR9+2d/ f0W3XCGpaXFCYWX8AgVe0Ew/yiTpsSq8pefSZS0mZT7I0Er7xIAahihZa07FdRwxwp5PY1B3 aU7cS1WQkG4vrOzzeuRVel32tY8HMa+aevzulk4pd3YJdAPZMmZBo/stZpf1jp2gd1SF/HDY cZfcSBocBnLfxxIPBEQFY46m+CrwHL4dlW0qnrM/fZxvzeVkVE3iee2WDbWUoXiqcF9hEGXq 3iA523kKhobKMae2XyO9XfEaurnxHuhAd9MSuTgnhJsqAyM6WsaMTkvbAG2mMuyiGycavxDB nVBr0LCqoB3riRHVOLVXRe1vXqFtR40QMdLHqsx7wTl4rrZ5UOVC3YJShZFacc6r4kmSDoyz FiLktj1Qzt1v9W9Vna15rqS6zSoNkAowXQqYCYFSU4P5YnlqYRq1BbXFI88S+iyk8H/Hiz2z 3aSti8iir4PjMkNkaKm4VTAhDHqrZ/MJuIo2jjqsquexlsRTOaYi0aAtTA3Md4owF6lc2S8

Ironport-hdrordr: A9a23:ilTnBaDABQLktUPlHejLsseALOsnbusQ8zAXPh9KJCC9I/bzqy nxpp8mPH/P5wr5lktQ/OxoHJPwOU80kqQFmrX5XI3SJTUO3VHFEGgM1+vfKlHbak7DH6tmpN 1dmstFeaLN5DpB/KHHCWCDer5PoeVvsprY49s2p00dMT2CAJsQizuRZDzrcHGfE2J9dOcE/d enl4N6T33KQwVlUu2LQl0+G8TTrdzCk5zrJTYAGh4c8QGLyR+49bLgFBCc/xEGFxdC260r/2 TpmxHwovzLiYD39jbsk0voq7hGktrozdVOQOSKl8guMz3pziq4eYh7XLWGnTYt5MWi8kwjnt Xgqwope+5z93TSVGeopgaF4Xir7B8er1vZjXOIi3rqpsL0ABo8Fsp6nIpcNj/U8VApst1Q2L 9CmzvxjesdMTrw2ADGo/TYXRBjkUS55VIkjO4olnRaFa8TcqVYo4Az9F5cVL0AACX5woY6F/ QGNrCV2N9mNXehK1zJtGhmx9KhGlw1Axe9W0AH/veY1jBH9UoJu3cw9Yg6pDMt5Zg9Q55L66 DvKaJzjoxDSccQcOZUGPoBadHfMB2NfTv8dEapZXj3HqAOPHzA77Tt5q8u2e2scJsUiLMvhZ X6Vk9Cv2JaQTOhNSS35uwJzvnxehT+Ydy0ofsuoqSR+4eMC4YDCBfzCGzHyKCb0rEi6s6yYY fHBHsZOY6lEYLUI/c44+TPYegtFZAgarxlhj8aYSP4niuZEPydisXrNNDuGZHKLREIHkvCP1 prZkmAGCwH1DHmZkPF

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Mar 21, 2023 at 05:41:52PM +0100, Jan Beulich wrote: > On 21.03.2023 17:31, Roger Pau Monné wrote: > > On Tue, Mar 21, 2023 at 04:35:54PM +0100, Jan Beulich wrote: > >> On 21.03.2023 15:57, Roger Pau Monné wrote: > >>> On Mon, Mar 20, 2023 at 10:48:45AM +0100, Jan Beulich wrote: > >>>> On 17.03.2023 13:26, Andrew Cooper wrote: > >>>>> On 17/03/2023 11:22 am, Roger Pau Monné wrote: > >>>>>> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote: > >>>>>>> This is faster than using the software implementation, and the insn is > >>>>>>> available on all half-way recent hardware. Therefore convert > >>>>>>> generic_hweight<N>() to out-of-line functions (without affecting Arm) > >>>>>>> and use alternatives patching to replace the function calls. > >>>>>>> > >>>>>>> Note that the approach doesn#t work for clang, due to it not > >>>>>>> recognizing > >>>>>>> -ffixed-*. > >>>>>> I've been giving this a look, and I wonder if it would be fine to > >>>>>> simply push and pop the scratch registers in the 'call' path of the > >>>>>> alternative, as that won't require any specific compiler option. > >>>> > >>>> Hmm, ... > >>>> > >>>>> It's been a long while, and in that time I've learnt a lot more about > >>>>> performance, but my root objection to the approach taken here still > >>>>> stands - it is penalising the common case to optimise some pointless > >>>>> corner cases. > >>>>> > >>>>> Yes - on the call path, an extra push/pop pair (or few) to get temp > >>>>> registers is basically free. > >>>> > >>>> ... what is "a few"? We'd need to push/pop all call-clobbered registers > >>>> except %rax, i.e. a total of eight. I consider this too much. Unless, > >>>> as you suggest further down, we wrote the fallback in assembly. Which I > >>>> have to admit I'm surprised you propose when we strive to reduce the > >>>> amount of assembly we have to maintain. > >>> > >>> AMD added popcnt in 2007 and Intel in 2008. While we shouldn't > >>> mandate popcnt support, I think we also shouldn't overly worry about > >>> the non-popcnt path. > >> > >> We agree here. > >> > >>> Also, how can you assert that the code generated without the scratch > >>> registers being usable won't be worse than the penalty of pushing and > >>> popping such registers on the stack and letting the routines use all > >>> registers freely? > >> > >> Irrespective of the -ffixed-* the compiler can make itself sufficiently > >> many registers available to use, by preserving just the ones it actually > >> uses. So that's pretty certainly not more PUSH/POP than when we would > >> blindly preserve all which may need preserving (no matter whether > >> they're actually used). > >> > >> Yet then both this question and ... > >> > >>> I very much prefer to have a non-optimal non-popcnt path, but have > >>> popcnt support for both gcc and clang, and without requiring any > >>> compiler options. > >> > >> ... this makes me wonder whether we're thinking of fundamentally > >> different generated code that we would end up with. Since the > >> (apparently agreed upon) goal is to optimize for the "popcnt > >> available" case, mainline code should be not significantly longer that > >> what's needed for the single instruction itself, or alternatively a > >> CALL insn. Adding pushes/pops of all call clobbered registers around > >> the CALL would grow mainline code too much (for my taste), i.e. > >> irrespective of these PUSH/POP all getting NOP-ed out by alternatives > >> patching. (As an aside I consider fiddling with the stack pointer in > >> inline asm() risky, and hence I would prefer to avoid such whenever > >> possible. > > > > Is this because we are likely to not end up with a proper trace if we > > mess up with the stack pointer before a function call? > > That's a related issue, but not what I was thinking about. > > > I would like > > to better understand your concerns with the stack pointer and inline > > asm(). > > You can't use local variables anymore with "m" or "rm" constraints, as > they might be accessed via %rsp. > > > Other options would be using no_caller_saved_registers, but that's not > > available in all supported gcc versions, likely the same with clang, > > but I wouldn't mind upping the minimal clang version. > > Right, you did suggest using this attribute before. But we continue to > support older gcc. FWIW, I would be fine with not enabling the optimization if the attribute is not available, but then again this means adding more logic. At the end this is just an optimization, so I think it's fine to request a version of gcc greater than the supported baseline. > > What about allocating the size of the registers that need saving as an > > on-stack variable visible to the compiler and then moving to/from > > there in the inline asm()? > > That would address the fiddling-with-%rsp concern, but would further > grow mainline code size. I'm fine with such growth, unless you can prove the growth in code size shadows the performance win from the usage of popcnt. > >> Yes, it can be written so it's independent of what the > >> compiler thinks the stack pointer points at, but such constructs are > >> fragile as soon as one wants to change them a little later on.) > >> > >> Are you perhaps thinking of indeed having the PUSH/POP in mainline > >> code? Or some entirely different model? > >> > >> What I could see us doing to accommodate Clang is to have wrappers > >> around the actual functions which do as you suggest and preserve all > >> relevant registers, no matter whether they're used. That'll still be > >> assembly code to maintain, but imo less of a concern than in Andrew's > >> model of writing hweight<N>() themselves in assembly (provided I > >> understood him correctly), for being purely mechanical operations. > > > > If possible I would prefer to use the same model for both gcc and > > clang, or else things tend to get more complicated than strictly > > needed. > > We can always decide to accept the extra overhead even with gcc. > > > I also wonder whether we could also get away without disabling of > > coverage and ubsan for the hweight object file. But I assume as long > > ass we do the function call in inline asm() (and thus kind of hidden > > from the compiler) we need to disable any instrumentation because the > > changed function context. I don't think there's anyway we can > > manually add the function call prefix/suffix in the inline asm()? > > I don't know whether that would be possible (and portable across > versions). But what I'm more concerned about is that such functions > may also end up clobbering certain call-clobbered registers. (Note > that when writing these in assembly, as suggested by Andrew, no such > hooks would be used either.) Right, we would just pick the Linux assembly for those helpers. I have to admit it feels weird, because I generally try to avoid growing our usage of assembly, and hence this would seem like a step backwards towards that goal. Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.