Xen project Mailing List

Re: [Xen-devel] Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching

To: "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Thu, 11 Oct 2018 00:51:01 -0600

Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Paul Durrant <paul.durrant@xxxxxxxxxx>

Delivery-date: Thu, 11 Oct 2018 06:51:27 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

>>> On 02.10.18 at 12:51, <andrew.cooper3@xxxxxxxxxx> wrote: > On 02/10/18 11:36, Jan Beulich wrote: >>>>> On 25.09.18 at 16:14, <JBeulich@xxxxxxxx> wrote: >>> Emulation requiring device model assistance uses a form of instruction >>> re-execution, assuming that the second (and any further) pass takes >>> exactly the same path. This is a valid assumption as far as use of CPU >>> registers goes (as those can't change without any other instruction >>> executing in between), but is wrong for memory accesses. In particular >>> it has been observed that Windows might page out buffers underneath >>> an instruction currently under emulation (hitting between two passes). >>> If the first pass translated a linear address successfully, any subsequent >>> pass needs to do so too, yielding the exact same translation. >>> >>> Introduce a cache (used just by guest page table accesses for now, i.e. >>> a form of "paging structure cache") to make sure above described >>> assumption holds. This is a very simplistic implementation for now: Only >>> exact matches are satisfied (no overlaps or partial reads or anything). >>> >>> There's also some seemingly unrelated cleanup here which was found >>> desirable on the way. >>> >>> 1: x86/mm: add optional cache to GLA->GFN translation >>> 2: x86/mm: use optional cache in guest_walk_tables() >>> 3: x86/HVM: implement memory read caching >>> 4: x86/HVM: prefill cache with PDPTEs when possible >>> >>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I >>> can't currently find enough time to carry out the requested further >>> rework. >> Andrew, George? > > You've not fixed anything from my concerns with v1. > > This doesn't behave like real hardware, and definitely doesn't behave as > named - "struct hvmemul_cache" is simply false. If it were named > hvmemul_psc (or some other variation on Paging Structure Cache) then it > wouldn't be so bad, as the individual levels do make more sense in that > context (not that it would make the behaviour any closer to how hardware > actually works). > > I'm also not overly happy with the conditional nature of the caching, or > that it isn't a transparent read-through cache. This leads to a large > amount of boilerplate code for every user. So after the call yesterday I've been thinking about this some more, coming to the conclusion that it is actually what you ask for that would end up being architecturally incorrect. First of all, there's absolutely no way to implement a general purpose cache that's mimic-ing hardware behavior. This is simply because we cannot observe remove (v)CPU-s' writes to the cached areas. What we instead need is a store where we can retain the result of every independent memory read. Let me illustrate this at the example of the gather family of instructions: Let's assume such an instruction has its [xyz]mm index register all zero. This will produce up to 16 reads from exactly the same memory location. Each of these reads is an independent one, and hence each of them is liable to observe different values (due to the coherent nature of the processor caches and their protocol), if another CPU updates the location frequently enough. As a result, to correctly replay (parts of) such an instruction, we'd have to store up to 16 different values all for the same physical memory slot. Obviously to distinguish them we'd need to somehow tag them. This is why the current series does not try to use the "cache" for other than page table elements (but leaves it possible for this to be added later, without renaming anything). We'd have to have the insn emulator layer (or the latest the top level hvmemul_*() hooks) produce such tags, requiring an extension to the emulator memory access hooks. Now the same consideration applies to page table reads: Each level's read is an independent one, and therefore may observe a value different from a higher level's read even if the same physical slot is referenced (via recursive page tables). Here, however, we're in the "luxury" position that we have both "natural" tagging (the page table level), and we don't need to care about subsequent writes (other than the general purpose cache, the paging structure caches don't "snoop" later writes, but require active invalidation). It is my understanding that the different page table levels have, in hardware, entirely independent paging structure caches, i.e. a level 3 read would not be satisfied by a level 4 PSC entry; in fact I think the spec does not exclude either option, but suggests such an independent behavior as the current choice of implementation. Similarly there's the write behavior: We specifically would require insn operand writes to _not_ update the "cache", as during replay we still need to see the original value; the aforementioned tagging would prevent this. The exception here is the setting of A/D bits during page table walks: On replay we _do_ want to observe them set if the first run through had to set them. (The same would imo apply to setting descriptor table accessed bits, btw.) As to the term "cache" - would "latch" perhaps be a better name to reflect the purpose? Finally a word on your desire of making this a transparent thing rather than something handed through as function arguments: Almost everything ends up in __hvm_copy(). Therefore, without a respective function argument, we'd need to globally record state on the vCPU as to whether a particular memory access ought to consult / update the "cache". Memory accesses done in the context of hypercalls, for example, are specifically not supposed to touch that "cache" - not only because that's not how hypercalls are supposed to behave (they are required to read guest memory exactly once only anyway), but also because whatever size we'd make that "cache", a large enough batched hypercall could exceed that size. I'm not meaning to say this is impossible to implement correctly, but I think going the function argument route is far easier to prove correct, since at the relevant layer(s) of a call tree you can see whether "caching" is intended to be in effect or not. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.