Xen project Mailing List

Re: x86: memset() / clear_page() / page scrubbing

From: Ankur Arora <ankur.a.arora@xxxxxxxxxx>

Date: Thu, 8 Apr 2021 23:08:45 -0700

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=QU+vUFAgfWQAcpiYJv8KvXXmhJ5GhVE0kd9+7VWGA84=; b=B8eyChHlh0onwzi/oB1/aQhntbRekYR7EwgGMwaM1bN6Fru+fTWDsmPv7PpiCzpHXJCartURA9gwYqZtbpRLTg4c3XoXqdYb0GqD5kGzYlHoLwxHT43wiF8i4hlFBCtes1xJj4FHsuHZiEm+WVUPZB57ln3nlZ+drGsKzpUL6CjbXoJngpJLaEpXbEpzrRTQRkS86L+VWsCTNCABmdUigWqAr9AHgWpBqsVwH/9b/IEA5PCNRWByfKD0LFt6BTv9FDOygKLlxRD0ZdtbhZ/BB/56bjvDzdjvzp8xmBlLrFPpasM5iLImA4Qh8A94ybFvZt46N10Zo41q/xE0HNKqyg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=SESaxICiWt4u3PcT+W2ccLPKa0Nu8hiPTlXD2c2Dv0DOFKrEFyf+0mlevwSf/2IP+QEwfQOqTNux0flj6BKQn5KvuuhGSglN2IRehZA2s2RTY8xaoCRTh95xc22RupZo+RrEFMOcrnq1GnOWqBE+s0FP5tNGXqA817VfONXlXbypBrR8QbMzk1XXIpTxt9LyKQBTJs1eyVAIYR7rYVMp2AsSKj/kjlBGaJ+4OFc29zLl2PIWTdlb4AaiEIiX/+Ij6rRJxURksW26PGJgcK9kEEIsEAQVTOJKLRxYt+r0a3N21vJb7ZVrI7ZdaSPna/OvdL4kge3rh4kjNCkZwS7nNg==

Authentication-results: suse.com; dkim=none (message not signed) header.d=none;suse.com; dmarc=none action=none header.from=oracle.com;

Cc: andrew.cooper3@xxxxxxxxxx, roger.pau@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Fri, 09 Apr 2021 06:13:04 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi Jan, I'm working on somewhat related optimizations on Linux (clear_page(), going in the opposite direction, from REP STOSB to MOVNT) and have some comments/questions below. (Discussion on v1 here: https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@xxxxxxxxxx/) On 4/8/2021 6:58 AM, Jan Beulich wrote: > All, > > since over the years we've been repeatedly talking of changing the > implementation of these fundamental functions, I've taken some time > to do some measurements (just for possible clear_page() alternatives > to keep things manageable). I'm not sure I want to spend as much time > subsequently on memcpy() / copy_page() (or more, because there are > yet more combinations of arguments to consider), so for the moment I > think the route we're going to pick here is going to more or less > also apply to those. > > The present copy_page() is the way it is because of the desire to > avoid disturbing the cache. The effect of REP STOS on the L1 cache > (compared to the present use of MOVNTI) is more or less noticable on > all hardware, and at least on Intel hardware more noticable when the > cache starts out clean. For L2 the results are more mixed when > comparing cache-clean and cache-filled cases, but the difference > between MOVNTI and REP STOS remains or (at least on Zen2 and older > Intel hardware) becomes more prominent. Could you give me any pointers on the cache-effects on this? This obviously makes sense but I couldn't come up with any benchmarks which would show this in a straight-forward fashion. > > Otoh REP STOS, as was to be expected, in most cases has meaningfully > lower latency than MOVNTI. > > Because I was curious I also included AVX (32-byte stores), AVX512 > (64-byte stores), and AMD's CLZERO in my testing. While AVX is a > clear win except on the vendors' first generations implementing it > (but I've left out any playing with CR0.TS, which is what I expect > would take this out as an option), AVX512 isn't on Skylake (perhaps > newer hardware does better). CLZERO has slightly higher impact on > L1 than MOVNTI, but lower than REP STOS. Could you elaborate on what kind of difference in L1 impact you are talking about? Evacuation of cachelines? > Its latency is between > both when the caches are warm, and better than both when the caches > are cold. > > Therefore I think that we want to distinguish page clearing (where > we care about latency) from (background) page scrubbing (where I > think the goal ought to be to avoid disturbing the caches). That > would make it > - REP STOS{L,Q} for clear_page() (perhaps also to be used for > synchronous scrubbing), > - MOVNTI for scrub_page() (when done from idle context), unless > CLZERO is available. > Whether in addition we should take into consideration activity of > other (logical) CPUs sharing caches I don't know - this feels like > it could get complex pretty quickly. The one other case might be for ~L3 (or larger) regions. In my tests, MOVNT/CLZERO is almost always better (the one exception being Skylake) wrt both cache and latency for larger extents. In the particular cases I was looking at (mmap+MAP_POPULATE and page-fault path), that makes the choice of always using MOVNT/CLZERO easy for GB pages, but fuzzier for 2MB pages. Not sure if the large-page case is interesting for you though. Thanks Ankur > > For memset() we already simply use REP STOSB. I don't see a strong > need to change that, but it may be worth to consider bringing it > closer to memcpy() - try to do the main chunk with REP STOS{L,Q}. > They perform somewhat better in a number of cases (including when > ERMS is advertised, i.e. on my Haswell and Skylake, which isn't > what I would have expected). We may want to put the whole thing in > a .S file though, seeing that the C function right now consists of > little more than an asm(). > > For memcpy() I'm inclined to suggest that we simply use REP MOVSB > on ERMS hardware, and stay with what we have everywhere else. > > copy_page() (or really copy_domain_page()) doesn't have many uses, > so I'm not sure how worthwhile it is to do much optimization there. > It might be an option to simply expand it to memcpy(), like Arm > does. > > Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we > may want to figure out whether using these for strlen(), strcmp(), > strchr(), memchr(), and/or memcmp() would be a win. > > Thoughts anyone, before I start creating actual patches? > > Jan >

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.