[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: x86: memset() / clear_page() / page scrubbing


  • To: jbeulich@xxxxxxxx
  • From: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
  • Date: Thu, 8 Apr 2021 23:08:45 -0700
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=QU+vUFAgfWQAcpiYJv8KvXXmhJ5GhVE0kd9+7VWGA84=; b=B8eyChHlh0onwzi/oB1/aQhntbRekYR7EwgGMwaM1bN6Fru+fTWDsmPv7PpiCzpHXJCartURA9gwYqZtbpRLTg4c3XoXqdYb0GqD5kGzYlHoLwxHT43wiF8i4hlFBCtes1xJj4FHsuHZiEm+WVUPZB57ln3nlZ+drGsKzpUL6CjbXoJngpJLaEpXbEpzrRTQRkS86L+VWsCTNCABmdUigWqAr9AHgWpBqsVwH/9b/IEA5PCNRWByfKD0LFt6BTv9FDOygKLlxRD0ZdtbhZ/BB/56bjvDzdjvzp8xmBlLrFPpasM5iLImA4Qh8A94ybFvZt46N10Zo41q/xE0HNKqyg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=SESaxICiWt4u3PcT+W2ccLPKa0Nu8hiPTlXD2c2Dv0DOFKrEFyf+0mlevwSf/2IP+QEwfQOqTNux0flj6BKQn5KvuuhGSglN2IRehZA2s2RTY8xaoCRTh95xc22RupZo+RrEFMOcrnq1GnOWqBE+s0FP5tNGXqA817VfONXlXbypBrR8QbMzk1XXIpTxt9LyKQBTJs1eyVAIYR7rYVMp2AsSKj/kjlBGaJ+4OFc29zLl2PIWTdlb4AaiEIiX/+Ij6rRJxURksW26PGJgcK9kEEIsEAQVTOJKLRxYt+r0a3N21vJb7ZVrI7ZdaSPna/OvdL4kge3rh4kjNCkZwS7nNg==
  • Authentication-results: suse.com; dkim=none (message not signed) header.d=none;suse.com; dmarc=none action=none header.from=oracle.com;
  • Cc: andrew.cooper3@xxxxxxxxxx, roger.pau@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Fri, 09 Apr 2021 06:13:04 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi Jan,

I'm working on somewhat related optimizations on Linux (clear_page(),
going in the opposite direction, from REP STOSB to MOVNT) and have
some comments/questions below.

(Discussion on v1 here:
https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@xxxxxxxxxx/)

On 4/8/2021 6:58 AM, Jan Beulich wrote:
> All,
>
> since over the years we've been repeatedly talking of changing the
> implementation of these fundamental functions, I've taken some time
> to do some measurements (just for possible clear_page() alternatives
> to keep things manageable). I'm not sure I want to spend as much time
> subsequently on memcpy() / copy_page() (or more, because there are
> yet more combinations of arguments to consider), so for the moment I
> think the route we're going to pick here is going to more or less
> also apply to those.
>
> The present copy_page() is the way it is because of the desire to
> avoid disturbing the cache. The effect of REP STOS on the L1 cache
> (compared to the present use of MOVNTI) is more or less noticable on
> all hardware, and at least on Intel hardware more noticable when the
> cache starts out clean. For L2 the results are more mixed when
> comparing cache-clean and cache-filled cases, but the difference
> between MOVNTI and REP STOS remains or (at least on Zen2 and older
> Intel hardware) becomes more prominent.

Could you give me any pointers on the cache-effects on this? This
obviously makes sense but I couldn't come up with any benchmarks
which would show this in a straight-forward fashion.

>
> Otoh REP STOS, as was to be expected, in most cases has meaningfully
> lower latency than MOVNTI.
>
> Because I was curious I also included AVX (32-byte stores), AVX512
> (64-byte stores), and AMD's CLZERO in my testing. While AVX is a
> clear win except on the vendors' first generations implementing it
> (but I've left out any playing with CR0.TS, which is what I expect
> would take this out as an option), AVX512 isn't on Skylake (perhaps
> newer hardware does better). CLZERO has slightly higher impact on
> L1 than MOVNTI, but lower than REP STOS.

Could you elaborate on what kind of difference in L1 impact you are
talking about? Evacuation of cachelines?

> Its latency is between
> both when the caches are warm, and better than both when the caches
> are cold.
>
> Therefore I think that we want to distinguish page clearing (where
> we care about latency) from (background) page scrubbing (where I
> think the goal ought to be to avoid disturbing the caches). That
> would make it
> - REP STOS{L,Q} for clear_page() (perhaps also to be used for
>   synchronous scrubbing),
> - MOVNTI for scrub_page() (when done from idle context), unless
>   CLZERO is available.
> Whether in addition we should take into consideration activity of
> other (logical) CPUs sharing caches I don't know - this feels like
> it could get complex pretty quickly.

The one other case might be for ~L3 (or larger) regions. In my tests,
MOVNT/CLZERO is almost always better (the one exception being Skylake)
wrt both cache and latency for larger extents.

In the particular cases I was looking at (mmap+MAP_POPULATE and
page-fault path), that makes the choice of always using MOVNT/CLZERO
easy for GB pages, but fuzzier for 2MB pages.

Not sure if the large-page case is interesting for you though.


Thanks
Ankur

>
> For memset() we already simply use REP STOSB. I don't see a strong
> need to change that, but it may be worth to consider bringing it
> closer to memcpy() - try to do the main chunk with REP STOS{L,Q}.
> They perform somewhat better in a number of cases (including when
> ERMS is advertised, i.e. on my Haswell and Skylake, which isn't
> what I would have expected). We may want to put the whole thing in
> a .S file though, seeing that the C function right now consists of
> little more than an asm().
>
> For memcpy() I'm inclined to suggest that we simply use REP MOVSB
> on ERMS hardware, and stay with what we have everywhere else.
>
> copy_page() (or really copy_domain_page()) doesn't have many uses,
> so I'm not sure how worthwhile it is to do much optimization there.
> It might be an option to simply expand it to memcpy(), like Arm
> does.
>
> Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we
> may want to figure out whether using these for strlen(), strcmp(),
> strchr(), memchr(), and/or memcmp() would be a win.
>
> Thoughts anyone, before I start creating actual patches?
>
> Jan
>



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.