[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching



>>> On 02.10.18 at 12:51, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 02/10/18 11:36, Jan Beulich wrote:
>>>>> On 25.09.18 at 16:14, <JBeulich@xxxxxxxx> wrote:
>>> Emulation requiring device model assistance uses a form of instruction
>>> re-execution, assuming that the second (and any further) pass takes
>>> exactly the same path. This is a valid assumption as far as use of CPU
>>> registers goes (as those can't change without any other instruction
>>> executing in between), but is wrong for memory accesses. In particular
>>> it has been observed that Windows might page out buffers underneath
>>> an instruction currently under emulation (hitting between two passes).
>>> If the first pass translated a linear address successfully, any subsequent
>>> pass needs to do so too, yielding the exact same translation.
>>>
>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>> a form of "paging structure cache") to make sure above described
>>> assumption holds. This is a very simplistic implementation for now: Only
>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>>
>>> There's also some seemingly unrelated cleanup here which was found
>>> desirable on the way.
>>>
>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>> 3: x86/HVM: implement memory read caching
>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>>
>>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
>>> can't currently find enough time to carry out the requested further
>>> rework.
>> Andrew, George?
> 
> You've not fixed anything from my concerns with v1.
> 
> This doesn't behave like real hardware, and definitely doesn't behave as
> named - "struct hvmemul_cache" is simply false.  If it were named
> hvmemul_psc (or some other variation on Paging Structure Cache) then it
> wouldn't be so bad, as the individual levels do make more sense in that
> context (not that it would make the behaviour any closer to how hardware
> actually works).
> 
> I'm also not overly happy with the conditional nature of the caching, or
> that it isn't a transparent read-through cache.  This leads to a large
> amount of boilerplate code for every user.

So after the call yesterday I've been thinking about this some more,
coming to the conclusion that it is actually what you ask for that
would end up being architecturally incorrect.

First of all, there's absolutely no way to implement a general purpose
cache that's mimic-ing hardware behavior. This is simply because we
cannot observe remove (v)CPU-s' writes to the cached areas.

What we instead need is a store where we can retain the result of
every independent memory read. Let me illustrate this at the
example of the gather family of instructions: Let's assume such an
instruction has its [xyz]mm index register all zero. This will produce
up to 16 reads from exactly the same memory location. Each of
these reads is an independent one, and hence each of them is
liable to observe different values (due to the coherent nature of
the processor caches and their protocol), if another CPU updates
the location frequently enough. As a result, to correctly replay
(parts of) such an instruction, we'd have to store up to 16
different values all for the same physical memory slot. Obviously
to distinguish them we'd need to somehow tag them.

This is why the current series does not try to use the "cache" for
other than page table elements (but leaves it possible for this
to be added later, without renaming anything). We'd have to
have the insn emulator layer (or the latest the top level
hvmemul_*() hooks) produce such tags, requiring an extension
to the emulator memory access hooks.

Now the same consideration applies to page table reads: Each
level's read is an independent one, and therefore may observe
a value different from a higher level's read even if the same
physical slot is referenced (via recursive page tables). Here,
however, we're in the "luxury" position that we have both
"natural" tagging (the page table level), and we don't need to
care about subsequent writes (other than the general purpose
cache, the paging structure caches don't "snoop" later writes,
but require active invalidation). It is my understanding that
the different page table levels have, in hardware, entirely
independent paging structure caches, i.e. a level 3 read would
not be satisfied by a level 4 PSC entry; in fact I think the spec
does not exclude either option, but suggests such an
independent behavior as the current choice of implementation.

Similarly there's the write behavior: We specifically would
require insn operand writes to _not_ update the "cache", as
during replay we still need to see the original value; the
aforementioned tagging would prevent this. The exception
here is the setting of A/D bits during page table walks: On
replay we _do_ want to observe them set if the first run
through had to set them. (The same would imo apply to
setting descriptor table accessed bits, btw.)

As to the term "cache" - would "latch" perhaps be a better name
to reflect the purpose?

Finally a word on your desire of making this a transparent thing
rather than something handed through as function arguments:
Almost everything ends up in __hvm_copy(). Therefore,
without a respective function argument, we'd need to globally
record state on the vCPU as to whether a particular memory
access ought to consult / update the "cache". Memory accesses
done in the context of hypercalls, for example, are specifically
not supposed to touch that "cache" - not only because that's
not how hypercalls are supposed to behave (they are required
to read guest memory exactly once only anyway), but also
because whatever size we'd make that "cache", a large
enough batched hypercall could exceed that size.

I'm not meaning to say this is impossible to implement correctly,
but I think going the function argument route is far easier to
prove correct, since at the relevant layer(s) of a call tree you
can see whether "caching" is intended to be in effect or not.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.