[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: x86: memset() / clear_page() / page scrubbing
On 08/04/2021 14:58, Jan Beulich wrote: > All, > > since over the years we've been repeatedly talking of changing the > implementation of these fundamental functions, I've taken some time > to do some measurements (just for possible clear_page() alternatives > to keep things manageable). I'm not sure I want to spend as much time > subsequently on memcpy() / copy_page() (or more, because there are > yet more combinations of arguments to consider), so for the moment I > think the route we're going to pick here is going to more or less > also apply to those. > > The present copy_page() is the way it is because of the desire to > avoid disturbing the cache. The effect of REP STOS on the L1 cache > (compared to the present use of MOVNTI) is more or less noticable on > all hardware, and at least on Intel hardware more noticable when the > cache starts out clean. For L2 the results are more mixed when > comparing cache-clean and cache-filled cases, but the difference > between MOVNTI and REP STOS remains or (at least on Zen2 and older > Intel hardware) becomes more prominent. > > Otoh REP STOS, as was to be expected, in most cases has meaningfully > lower latency than MOVNTI. > > Because I was curious I also included AVX (32-byte stores), AVX512 > (64-byte stores), and AMD's CLZERO in my testing. While AVX is a > clear win except on the vendors' first generations implementing it > (but I've left out any playing with CR0.TS, which is what I expect > would take this out as an option), AVX512 isn't on Skylake (perhaps > newer hardware does better). CLZERO has slightly higher impact on > L1 than MOVNTI, but lower than REP STOS. Its latency is between > both when the caches are warm, and better than both when the caches > are cold. > > Therefore I think that we want to distinguish page clearing (where > we care about latency) from (background) page scrubbing (where I > think the goal ought to be to avoid disturbing the caches). That > would make it > - REP STOS{L,Q} for clear_page() (perhaps also to be used for > synchronous scrubbing), > - MOVNTI for scrub_page() (when done from idle context), unless > CLZERO is available. > Whether in addition we should take into consideration activity of > other (logical) CPUs sharing caches I don't know - this feels like > it could get complex pretty quickly. > > For memset() we already simply use REP STOSB. I don't see a strong > need to change that, but it may be worth to consider bringing it > closer to memcpy() - try to do the main chunk with REP STOS{L,Q}. > They perform somewhat better in a number of cases (including when > ERMS is advertised, i.e. on my Haswell and Skylake, which isn't > what I would have expected). We may want to put the whole thing in > a .S file though, seeing that the C function right now consists of > little more than an asm(). > > For memcpy() I'm inclined to suggest that we simply use REP MOVSB > on ERMS hardware, and stay with what we have everywhere else. > > copy_page() (or really copy_domain_page()) doesn't have many uses, > so I'm not sure how worthwhile it is to do much optimization there. > It might be an option to simply expand it to memcpy(), like Arm > does. > > Looking forward, on CPUs having "Fast Short REP CMPSB/SCASB" we > may want to figure out whether using these for strlen(), strcmp(), > strchr(), memchr(), and/or memcmp() would be a win. > > Thoughts anyone, before I start creating actual patches? Do you have actual numbers from these experiments? I've seen your patch from the thread, but at a minimum its missing some hunks adding new CPUID bits. I do worry however whether the testing is likely to be realistic for non-idle scenarios. It is very little surprise that AVX-512 on Skylake is poor. The frequency hit from using %zmm is staggering. IceLake is expected to be better, but almost certainly won't exceed REP MOVSB, which is optimised in microcode for the data width of the CPU. For memset(), please don't move in the direction of memcpy(). memcpy() is problematic because the common case is likely to be a multiple of 8 bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting avoiding. The "Fast Zero length $FOO" bits on future parts indicate when passing %ecx=0 is likely to be faster than branching around the invocation. With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any cleverness about larger word sizes. The Linux forms do this fairly well already, and probably better than Xen, although there might be some room for improvement IMO. It is worth nothing that we have extra variations of memset/memcpy where __builtin_memcpy() gets expanded inline, and the result is a compiler-chosen sequence, and doesn't hit any of our optimised sequences. I'm not sure what to do about this, because there is surely a larger win from the cases which can be turned into a single mov, or an elided store/copy, than using a potentially inefficient sequence in the rare cases. Maybe there is room for a fine-tuning option to say "just call memset() if you're going to expand it inline". For all set/copy operations, whether you want non-temporal or not depends on when/where the lines are next going to be consumed. Page scrubbing in idle context is the only example I can think of where we aren't plausibly going to consume the destination imminently. Even clear/copy page in a hypercall doesn't want to be non-temporal, because chances are good that the vcpu is going to touch the page on return. ~Andrew
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |