[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: x86: memset() / clear_page() / page scrubbing


  • To: Jan Beulich <jbeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Tue, 13 Apr 2021 14:17:19 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=XU6SCAC8tSqQ5SZ56GRk5s8jm2BTCkOODkOEpV+B8rE=; b=G02R1ZJG62IiIAivvaTQ8MEUNcS8PMgufL9wcDcKIr1XCYJKTgtZU27KMg7cudMQudmMxM55k12cHkRDpUgllcRL2phpI4piNQ2NEsHjijdCbiJ0R4+cmfNsj+he2X+DLalmNGVWHjh6gKm3UbgB8kNc0dQGMh+LkBzwTkS+/EY4LhwmtjEIWrJMpDpG7ZlL5ADpCHxwBSycA6Jnsi/aF0QPoOJ/7zkX4yvKjox5CSV6mRG03OpqtUZDOU6Fgg/nMIsrezF2/5gO6tY8ptNJWMnFkwDotzl12jiIAFcmyY+KM0lSpHr7P19WPejfbvPurGgWYYsA+YUiXlv+Rtb1XQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ZMJcUUyQ6NsaIgHlanLpd3N5VNX1iohUmq6Q3sNfmyxQ7a17OoNTkfCA+ILGopEIZWTusottGjVGvu5OF8HdywvwOrXnrV0To6+Zw6PWjtjmpBuWIXiORLp76Mt+10fSX2wdA0J9/WlPyjrrQk88YKAStBAFkGjnFVxN1qSMBEFZ8mQca4+mb9+0Z0Jli6ENsY1dT83nQOQp52GwguoO6QEBAoJ87eOf7MP1Svmr+fT59nVvZa/BAwLf35vektt0rkwZ4Y4y7bKAfeaMxb+NdtcA7TMU3d7LT6ihnNrzosJ08TCrIg4BNTY/eXHVUmCmwAdQkwpdExzaMm66jFKq/w==
  • Authentication-results: esa3.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Tue, 13 Apr 2021 13:17:40 +0000
  • Ironport-hdrordr: A9a23:jvLxJ6gIGudphcdPrtSYc7+h63BQXwB13DAbvn1ZSRFFG/Gwv9 yynfgdyB//gCsQXnZlotybJKycWxrnmKJdy48XILukQU3aqHKlRbsSiLfK7h/BP2nF9uBb3b p9aKQWMrfNJHVzkMqS2maFOvk6xt3vys6VrMP/61socg1wcaFn6G5Ce2SmO2l7XhNPC5Z8NL f03Kp6jgGtc3gWcci3b0NtN4T+jubGiY78Zlo+DwMngTPksRqT9LX4HxKEty1uMA9n/LFKyw n4uj283IqPmbWRyhjQ12jchq4m4ufJ+594K+GnzuQQIjXooA60aIpmQK3qhkFJnMifrGwEvf OJjxA8P9liy365RBDInTLdnzPO/Rxry3j+xUSWiXHuyPaJOw4SOo56qq9yNj76gnBQ2O1U4e Zw8E+y86dzN1fmmh/w4tDZPisa7nackD4ZvsM4y0BEXZB2Us43kaUvuHl7Pb0nByzA5IUuAI BVfbrhzccTS1+cYnzD11MfpuCEbzA2FheCdEAIptaY5ThQhGx41EsV3qUk7w49yK4=
  • Ironport-sdr: upUcqEpx7as25A7HmHKbDt6+PmdpOOx54tmRhTC9Qi1xMDwF6tLKVk9PfaAP+A88/fuYUI6Cfa Drv0LjRQApx2LkyrMAaAgarfilGkVso4PFpFuxHS6a7946KWurOnShKr+LmqSuWAPa2MbCnx8H e3raKDf0MlEdvMdoRWZOTyY326L2y0D7ZmDg5uOmnzhLFLsXXxuF3cddWZuRKJ98xnX9jbdSwz CpFQh6bYyDfdGZRhmE3zHTBGRrSn054Y0VCDLkHJ+2zyaDb532OJx5Goe3TfHcdBCgFX8TZZDo joQ=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 08/04/2021 14:58, Jan Beulich wrote:
> All,
>
> since over the years we've been repeatedly talking of changing the
> implementation of these fundamental functions, I've taken some time
> to do some measurements (just for possible clear_page() alternatives
> to keep things manageable). I'm not sure I want to spend as much time
> subsequently on memcpy() / copy_page() (or more, because there are
> yet more combinations of arguments to consider), so for the moment I
> think the route we're going to pick here is going to more or less
> also apply to those.
>
> The present copy_page() is the way it is because of the desire to
> avoid disturbing the cache. The effect of REP STOS on the L1 cache
> (compared to the present use of MOVNTI) is more or less noticable on
> all hardware, and at least on Intel hardware more noticable when the
> cache starts out clean. For L2 the results are more mixed when
> comparing cache-clean and cache-filled cases, but the difference
> between MOVNTI and REP STOS remains or (at least on Zen2 and older
> Intel hardware) becomes more prominent.
>
> Otoh REP STOS, as was to be expected, in most cases has meaningfully
> lower latency than MOVNTI.
>
> Because I was curious I also included AVX (32-byte stores), AVX512
> (64-byte stores), and AMD's CLZERO in my testing. While AVX is a
> clear win except on the vendors' first generations implementing it
> (but I've left out any playing with CR0.TS, which is what I expect
> would take this out as an option), AVX512 isn't on Skylake (perhaps
> newer hardware does better). CLZERO has slightly higher impact on
> L1 than MOVNTI, but lower than REP STOS. Its latency is between
> both when the caches are warm, and better than both when the caches
> are cold.
>
> Therefore I think that we want to distinguish page clearing (where
> we care about latency) from (background) page scrubbing (where I
> think the goal ought to be to avoid disturbing the caches). That
> would make it
> - REP STOS{L,Q} for clear_page() (perhaps also to be used for
>   synchronous scrubbing),
> - MOVNTI for scrub_page() (when done from idle context), unless
>   CLZERO is available.
> Whether in addition we should take into consideration activity of
> other (logical) CPUs sharing caches I don't know - this feels like
> it could get complex pretty quickly.
>
> For memset() we already simply use REP STOSB. I don't see a strong
> need to change that, but it may be worth to consider bringing it
> closer to memcpy() - try to do the main chunk with REP STOS{L,Q}.
> They perform somewhat better in a number of cases (including when
> ERMS is advertised, i.e. on my Haswell and Skylake, which isn't
> what I would have expected). We may want to put the whole thing in
> a .S file though, seeing that the C function right now consists of
> little more than an asm().
>
> For memcpy() I'm inclined to suggest that we simply use REP MOVSB
> on ERMS hardware, and stay with what we have everywhere else.
>
> copy_page() (or really copy_domain_page()) doesn't have many uses,
> so I'm not sure how worthwhile it is to do much optimization there.
> It might be an option to simply expand it to memcpy(), like Arm
> does.
>
> Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we
> may want to figure out whether using these for strlen(), strcmp(),
> strchr(), memchr(), and/or memcmp() would be a win.
>
> Thoughts anyone, before I start creating actual patches?

Do you have actual numbers from these experiments?  I've seen your patch
from the thread, but at a minimum its missing some hunks adding new
CPUID bits.  I do worry however whether the testing is likely to be
realistic for non-idle scenarios.

It is very little surprise that AVX-512 on Skylake is poor.  The
frequency hit from using %zmm is staggering.  IceLake is expected to be
better, but almost certainly won't exceed REP MOVSB, which is optimised
in microcode for the data width of the CPU.

For memset(), please don't move in the direction of memcpy().  memcpy()
is problematic because the common case is likely to be a multiple of 8
bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting
avoiding.  The "Fast Zero length $FOO" bits on future parts indicate
when passing %ecx=0 is likely to be faster than branching around the
invocation.

With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any
cleverness about larger word sizes.  The Linux forms do this fairly well
already, and probably better than Xen, although there might be some room
for improvement IMO.

It is worth nothing that we have extra variations of memset/memcpy where
__builtin_memcpy() gets expanded inline, and the result is a
compiler-chosen sequence, and doesn't hit any of our optimised
sequences.  I'm not sure what to do about this, because there is surely
a larger win from the cases which can be turned into a single mov, or an
elided store/copy, than using a potentially inefficient sequence in the
rare cases.  Maybe there is room for a fine-tuning option to say "just
call memset() if you're going to expand it inline".


For all set/copy operations, whether you want non-temporal or not
depends on when/where the lines are next going to be consumed.  Page
scrubbing in idle context is the only example I can think of where we
aren't plausibly going to consume the destination imminently.  Even
clear/copy page in a hypercall doesn't want to be non-temporal, because
chances are good that the vcpu is going to touch the page on return.

~Andrew




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.