|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: x86: memset() / clear_page() / page scrubbing
On 2021-04-08 11:38 p.m., Jan Beulich wrote: On 09.04.2021 08:08, Ankur Arora wrote:I'm working on somewhat related optimizations on Linux (clear_page(), going in the opposite direction, from REP STOSB to MOVNT) and have some comments/questions below.Interesting.On 4/8/2021 6:58 AM, Jan Beulich wrote:All, since over the years we've been repeatedly talking of changing the implementation of these fundamental functions, I've taken some time to do some measurements (just for possible clear_page() alternatives to keep things manageable). I'm not sure I want to spend as much time subsequently on memcpy() / copy_page() (or more, because there are yet more combinations of arguments to consider), so for the moment I think the route we're going to pick here is going to more or less also apply to those. The present copy_page() is the way it is because of the desire to avoid disturbing the cache. The effect of REP STOS on the L1 cache (compared to the present use of MOVNTI) is more or less noticable on all hardware, and at least on Intel hardware more noticable when the cache starts out clean. For L2 the results are more mixed when comparing cache-clean and cache-filled cases, but the difference between MOVNTI and REP STOS remains or (at least on Zen2 and older Intel hardware) becomes more prominent.Could you give me any pointers on the cache-effects on this? This obviously makes sense but I couldn't come up with any benchmarks which would show this in a straight-forward fashion.No benchmarks in that sense, but a local debugging patch measuring things before bringing up APs, to have a reasonably predictable environment. I have attached it for your reference. Thanks, that does look like a pretty good predictable test. (Btw, there might be an oversight in the clear_page_clzero() logic. I believe that also needs an sfence.) Just curious: you had commented out the local irq disable/enable clauses. Is that because you decided that it the code ran at an early enough point that they were not required or some other reason? Otoh REP STOS, as was to be expected, in most cases has meaningfully lower latency than MOVNTI. Because I was curious I also included AVX (32-byte stores), AVX512 (64-byte stores), and AMD's CLZERO in my testing. While AVX is a clear win except on the vendors' first generations implementing it (but I've left out any playing with CR0.TS, which is what I expect would take this out as an option), AVX512 isn't on Skylake (perhaps newer hardware does better). CLZERO has slightly higher impact on L1 than MOVNTI, but lower than REP STOS.Could you elaborate on what kind of difference in L1 impact you are talking about? Evacuation of cachelines?Replacement of ones, yes. As you may see from that patch, I prefill the cache, do the clearing, and then measure how much longer the same operation takes that was used for prefilling. If the clearing left the cache completely alone (or if the hw prefetcher was really good), there would be no difference. Yeah, that does sound like a good way to get an idea of how much the clear_page_x() does perturb the cache.
Agreed MOVNT/CLZERO do seem ideally suited for background scrubbing. Alas, AFAICS Linux currently only does foreground cleaning. The only reason for I can think of for that "decision" is maybe that there one trusted user with a significant footprint -- the page cache -- where pages can be allocate without needing to clear. That said, given that background scrubbing is a fairly cheap way of time-shifting work to idle without negatively affecting the cache it does make sense to move towards it for at least a subset of pages. The only potential negative could be higher power consumption because idle is spending less time in C-states. That said, that also seems like a wash given that this only shifts when we do the clearing. Would you have any intuition on, if the power consumption of the non-temporal primitives is meaningfully different from REP STOS and friends? Ankur Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |