[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: x86: memset() / clear_page() / page scrubbing



On 15.04.2021 18:21, Andrew Cooper wrote:
> On 14/04/2021 09:12, Jan Beulich wrote:
>> On 13.04.2021 15:17, Andrew Cooper wrote:
>>> Do you have actual numbers from these experiments?
>> Attached is the collected raw output from a number of systems.
> 
> Wow Tulsa is vintage.  Is that new enough to have nonstop_tsc ?

No.

>>> For memset(), please don't move in the direction of memcpy().  memcpy()
>>> is problematic because the common case is likely to be a multiple of 8
>>> bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting
>>> avoiding.
>> And you say this despite me having pointed out that REP STOSL may
>> be faster in a number of cases? Or do you mean to suggest we should
>> branch around the trailing REP {MOV,STO}SB?
>>
>>>   The "Fast Zero length $FOO" bits on future parts indicate
>>> when passing %ecx=0 is likely to be faster than branching around the
>>> invocation.
>> IOW down the road we could use alternatives patching to remove such
>> branches. But this of course is only if we don't end up using
>> exclusively REP MOVSB / REP STOSB there anyway, as you seem to be
>> suggesting ...
>>
>>> With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any
>>> cleverness about larger word sizes.  The Linux forms do this fairly well
>>> already, and probably better than Xen, although there might be some room
>>> for improvement IMO.
>> ... here.
>>
>> As to the Linux implementations - for memcpy_erms() I don't think
>> I see any room for improvement in the function itself. We could do
>> alternatives patching somewhat differently (and I probably would).
>> For memset_erms() the tiny bit of improvement over Linux'es code
>> that I would see is to avoid the partial register access when
>> loading %al. But to be honest - in both cases I wouldn't have
>> bothered looking at their code anyway, if you hadn't pointed me
>> there.
> 
> Answering multiple of the points together.
> 
> Yes, the partial register access on %al was one thing I spotted, and
> movbzl would be an improvement.  The alternatives are a bit weird, but
> they're best as they are IMO.  It makes a useful enough difference to
> backtraces/etc, and unconditional jmp's are about as close to free as
> you can get these days.
> 
> On an ERMS system, we want to use REP MOVSB unilaterally.  It is my
> understanding that it is faster across the board than any algorithm
> variation trying to use wider accesses.

Not according to the numbers I've collected. There are cases where
clearing a full page via REP STOS{L,Q} is (often just a little)
faster. Whether this also applies to MOVS I can't tell.

>>> It is worth nothing that we have extra variations of memset/memcpy where
>>> __builtin_memcpy() gets expanded inline, and the result is a
>>> compiler-chosen sequence, and doesn't hit any of our optimised
>>> sequences.  I'm not sure what to do about this, because there is surely
>>> a larger win from the cases which can be turned into a single mov, or an
>>> elided store/copy, than using a potentially inefficient sequence in the
>>> rare cases.  Maybe there is room for a fine-tuning option to say "just
>>> call memset() if you're going to expand it inline".
>> You mean "just call memset() instead of expanding it inline"?
> 
> I think want I really mean is "if the result of optimising memset() is
> going to result in a REP instruction, call memset() instead".
> 
> You want the compiler to do conversion to single mov's/etc, but you
> don't want is ...
> 
>> If the inline expansion is merely REP STOS, I'm not sure we'd
>> actually gain anything from keeping the compiler from expanding it
>> inline. But if the inline construct was more complicated (as I
>> observe e.g. in map_vcpu_info() with gcc 10), then it would likely
>> be nice if there was such a control. I'll take note to see if I
>> can find anything.
> 
> ... this.  What GCC currently expands inline is a REP MOVS{L,Q}, with
> the first and final element done manually ahead of the REP, presumably
> for prefetching/pagewalk reasons.

Not sure about the reasons, but the compiler doesn't always do it
like this - there are also cases of plain REP STOSQ. My initial
guess the splitting of the first and last elements was when the
compiler couldn't prove the buffer is 8-byte aligned and a
multiple of 8 bytes in size.

>>> For all set/copy operations, whether you want non-temporal or not
>>> depends on when/where the lines are next going to be consumed.  Page
>>> scrubbing in idle context is the only example I can think of where we
>>> aren't plausibly going to consume the destination imminently.  Even
>>> clear/copy page in a hypercall doesn't want to be non-temporal, because
>>> chances are good that the vcpu is going to touch the page on return.
>> I'm afraid the situation isn't as black-and-white. Take HAP or
>> IOMMU page table allocations, for example: They need to clear the
>> full page, yes. But often this is just to then insert one single
>> entry, i.e. re-use exactly one of the cache lines.
> 
> I consider this an orthogonal problem.  When we're not double-scrubbing
> most memory Xen uses, most of this goes away.
> 
> Even if we do need to scrub a pagetable to use, we're never(?) complete
> at the end of the scrub, and need to make further writes imminently. 

Right, but often to just one of the cache lines.

> These never want non-temporal accesses, because you never want to write
> into recently-evicted line, and there's no plausible way that trying to
> mix and match temporal and non-temporal stores is going to be a
> worthwhile optimisation to try.

Is a singe MOV following (with some distance and with SFENCE in
between) a sequence of MOVNTI going to have an effect worse than
the same MOV trying to store to a cache line that's not in cache?

>> Or take initial
>> population of guest RAM: The larger the guest, the less likely it
>> is for every individual page to get accessed again before its
>> contents get evicted from the caches. Judging from what Ankur said,
>> once we get to around L3 capacity, MOVNT / CLZERO may be preferable
>> there.
> 
> Initial population of guests doesn't matter at all, because nothing
> (outside of the single threaded toolstack process issuing the
> construction hypercalls) is going to touch the pages until the VM is
> unpaused.  The only async accesses I can think of are xenstored and
> xenconsoled starting up, and those are after the RAM is populated.
> 
> In cases like this, current might be a good way of choosing between
> temporal and non-temporal accesses.
> 
> As before, not double scrubbing will further improve things.
> 
>> I think in cases where we don't know how the page is going to be
>> used subsequently, we ought to favor latency over cache pollution
>> avoidance.
> 
> I broadly agree.  I think the cases where its reasonably safe to use the
> pollution-avoidance are fairly obvious, and there is a steep cost to
> wrongly-using non-temporal accesses.
> 
>> But in cases where we know the subsequent usage pattern,
>> we may want to direct scrubbing / zeroing accordingly. Yet of
>> course it's not very helpful that there's no way to avoid
>> polluting caches and still have reasonably low latency, so using
>> some heuristics may be unavoidable.
> 
> I don't think any heuristics beyond current, or possibly
> d->creation_finished are going to be worthwhile, but I think these alone
> can net us a decent win.
> 
>> And of course another goal of mine would be to avoid double zeroing
>> of pages: When scrubbing uses clear_page() anyway, there's no point
>> in the caller then calling clear_page() again. IMO, just like we
>> have xzalloc(), we should also have MEMF_zero. Internally the page
>> allocator can know whether a page was already scrubbed, and it
>> does know for sure whether scrubbing means zeroing.
> 
> I think we've discussed this before.  I'm in favour, but I'm absolutely
> certain that that wants be spelled MEMF_dirty (or equiv), so forgetting
> it fails safe, and code which is using dirty allocations is clearly
> identified and can be audited easily.

Well, there's a difference between scrubbing and zeroing. We already
have MEMF_no_scrub. And we already force callers to think about
whether they want zeroed memory (outside of the page allocator), by
having both xmalloc() and xzalloc() (and their relatives). So while
for scrubbing I could see your point, I'm not sure we should force
everyone who doesn't need zeroed pages to pass MEMF_dirty (or
whatever the name, as I don't particularly like this one). It's quite
the other way around - right now no pages come out of the page
allocator in known state content-wise. Parties presently calling
clear_page() right afterwards could easily, cleanly, and in a risk-
free manner be converted to use MEMF_zero.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.