[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: x86: memset() / clear_page() / page scrubbing
Hi Jan, I'm working on somewhat related optimizations on Linux (clear_page(), going in the opposite direction, from REP STOSB to MOVNT) and have some comments/questions below. (Discussion on v1 here: https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@xxxxxxxxxx/) On 4/8/2021 6:58 AM, Jan Beulich wrote: > All, > > since over the years we've been repeatedly talking of changing the > implementation of these fundamental functions, I've taken some time > to do some measurements (just for possible clear_page() alternatives > to keep things manageable). I'm not sure I want to spend as much time > subsequently on memcpy() / copy_page() (or more, because there are > yet more combinations of arguments to consider), so for the moment I > think the route we're going to pick here is going to more or less > also apply to those. > > The present copy_page() is the way it is because of the desire to > avoid disturbing the cache. The effect of REP STOS on the L1 cache > (compared to the present use of MOVNTI) is more or less noticable on > all hardware, and at least on Intel hardware more noticable when the > cache starts out clean. For L2 the results are more mixed when > comparing cache-clean and cache-filled cases, but the difference > between MOVNTI and REP STOS remains or (at least on Zen2 and older > Intel hardware) becomes more prominent. Could you give me any pointers on the cache-effects on this? This obviously makes sense but I couldn't come up with any benchmarks which would show this in a straight-forward fashion. > > Otoh REP STOS, as was to be expected, in most cases has meaningfully > lower latency than MOVNTI. > > Because I was curious I also included AVX (32-byte stores), AVX512 > (64-byte stores), and AMD's CLZERO in my testing. While AVX is a > clear win except on the vendors' first generations implementing it > (but I've left out any playing with CR0.TS, which is what I expect > would take this out as an option), AVX512 isn't on Skylake (perhaps > newer hardware does better). CLZERO has slightly higher impact on > L1 than MOVNTI, but lower than REP STOS. Could you elaborate on what kind of difference in L1 impact you are talking about? Evacuation of cachelines? > Its latency is between > both when the caches are warm, and better than both when the caches > are cold. > > Therefore I think that we want to distinguish page clearing (where > we care about latency) from (background) page scrubbing (where I > think the goal ought to be to avoid disturbing the caches). That > would make it > - REP STOS{L,Q} for clear_page() (perhaps also to be used for > synchronous scrubbing), > - MOVNTI for scrub_page() (when done from idle context), unless > CLZERO is available. > Whether in addition we should take into consideration activity of > other (logical) CPUs sharing caches I don't know - this feels like > it could get complex pretty quickly. The one other case might be for ~L3 (or larger) regions. In my tests, MOVNT/CLZERO is almost always better (the one exception being Skylake) wrt both cache and latency for larger extents. In the particular cases I was looking at (mmap+MAP_POPULATE and page-fault path), that makes the choice of always using MOVNT/CLZERO easy for GB pages, but fuzzier for 2MB pages. Not sure if the large-page case is interesting for you though. Thanks Ankur > > For memset() we already simply use REP STOSB. I don't see a strong > need to change that, but it may be worth to consider bringing it > closer to memcpy() - try to do the main chunk with REP STOS{L,Q}. > They perform somewhat better in a number of cases (including when > ERMS is advertised, i.e. on my Haswell and Skylake, which isn't > what I would have expected). We may want to put the whole thing in > a .S file though, seeing that the C function right now consists of > little more than an asm(). > > For memcpy() I'm inclined to suggest that we simply use REP MOVSB > on ERMS hardware, and stay with what we have everywhere else. > > copy_page() (or really copy_domain_page()) doesn't have many uses, > so I'm not sure how worthwhile it is to do much optimization there. > It might be an option to simply expand it to memcpy(), like Arm > does. > > Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we > may want to figure out whether using these for strlen(), strcmp(), > strchr(), memchr(), and/or memcmp() would be a win. > > Thoughts anyone, before I start creating actual patches? > > Jan >
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |