Xen project Mailing List

Re: [Xen-devel] [PATCH v5 2/4] x86/HVM: implement memory read caching for insn emulation

To: Jan Beulich <jbeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: "Tian, Kevin" <kevin.tian@xxxxxxxxx>

Date: Tue, 10 Mar 2020 02:39:34 +0000

Accept-language: en-US

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, "Nakajima, Jun" <jun.nakajima@xxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Paul Durrant <paul@xxxxxxx>

Delivery-date: Tue, 10 Mar 2020 02:40:05 +0000

Dlp-product: dlpe-windows

Dlp-reaction: no-action

Dlp-version: 11.2.0.6

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHV8UT3ZAlTniCSrkK5dvC9Y95HUKhBJYkQ

Thread-topic: [PATCH v5 2/4] x86/HVM: implement memory read caching for insn emulation

> From: Jan Beulich <jbeulich@xxxxxxxx> > Sent: Tuesday, March 3, 2020 6:17 PM > > Emulation requiring device model assistance uses a form of instruction > re-execution, assuming that the second (and any further) pass takes > exactly the same path. This is a valid assumption as far as use of CPU ah, I was not aware of such form. I thought the emulation is split into two phases: decoding and send i/o request to device model, and then completing inst emulation with device model's response and previously-decoded information... > registers goes (as those can't change without any other instruction > executing in between [1]), but is wrong for memory accesses. In > particular it has been observed that Windows might page out buffers > underneath an instruction currently under emulation (hitting between two > passes). If the first pass read a memory operand successfully, any > subsequent pass needs to get to see the exact same value. > > Introduce a cache to make sure above described assumption holds. This > is a very simplistic implementation for now: Only exact matches are > satisfied (no overlaps or partial reads or anything); this is sufficient > for the immediate purpose of making re-execution an exact replay. The > cache also won't be used just yet for guest page walks; that'll be the > subject of a subsequent change. a cache implies that the aforementioned two-pass problem is only mitigated instead of completely fixed? btw is there any performance impact from this patch? > > With the cache being generally transparent to upper layers, but with it > having limited capacity yet being required for correctness, certain > users of hvm_copy_from_guest_*() need to disable caching temporarily, > without invalidating the cache. Note that the adjustments here to > hvm_hypercall() and hvm_task_switch() are benign at this point; they'll > become relevant once we start to be able to emulate respective insns > through the main emulator (and more changes will then likely be needed > to nested code). > > As to the actual data page in a problamtic scenario, there are a couple > of aspects to take into consideration: > - We must be talking about an insn accessing two locations (two memory > ones, one of which is MMIO, or a memory and an I/O one). > - If the non I/O / MMIO side is being read, the re-read (if it occurs at > all) is having its result discarded, by taking the shortcut through > the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note > how, among all the re-issue sanity checks there, we avoid comparing > the actual data. > - If the non I/O / MMIO side is being written, it is the OSes > responsibility to avoid actually moving page contents to disk while > there might still be a write access in flight - this is no different > in behavior from bare hardware. > - Read-modify-write accesses are, as always, complicated, and while we > deal with them better nowadays than we did in the past, we're still > not quite there to guarantee hardware like behavior in all cases > anyway. Nothing is getting worse by the changes made here, afaict. > > In __hvm_copy() also reduce p's scope and change its type to void *. > > [1] Other than on actual hardware, actions like > XEN_DOMCTL_sethvmcontext, XEN_DOMCTL_setvcpucontext, > VCPUOP_initialise, INIT, or SIPI issued against the vCPU can occur > while the vCPU is blocked waiting for a device model to return data. > In such cases emulation now gets canceled, though, and hence re- > execution correctness is unaffected. > > Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> > --- > TBD: In principle the caching here yields unnecessary the one used for > insn bytes (vio->mmio_insn{,_bytes}. However, to seed the cache > with the data SVM may have made available, we'd have to also know > the corresponding GPA. It's not safe, however, to re-walk the page > tables to find out, as the page tables may have changed in the > meantime. Therefore I guess we need to keep the duplicate > functionality for now. A possible solution to this could be to use > a physical-address-based cache for page table accesses (and looking > forward also e.g. SVM/VMX insn emulation), and a linear-address- > based one for all other reads. > --- > v5: Re-arrange bitfield. Use domain_crash() in hvmemul_write_cache(). > Move hvmemul_{read,write}_cache() stubs to later patch. Also adjust > hvmemul_cancel(). Add / extend comments. Re-base. > v4: Re-write for cache to become transparent to callers. > v3: Add text about the actual data page to the description. > v2: Re-base. > > --- a/xen/arch/x86/hvm/emulate.c > +++ b/xen/arch/x86/hvm/emulate.c > @@ -28,6 +28,19 @@ > #include <asm/iocap.h> > #include <asm/vm_event.h> > > +struct hvmemul_cache > +{ > + /* The cache is disabled as long as num_ents > max_ents. */ > + unsigned int num_ents; > + unsigned int max_ents; > + struct { > + paddr_t gpa:PADDR_BITS; > + unsigned int :BITS_PER_LONG - PADDR_BITS - 8; > + unsigned int size:8; > + unsigned long data; > + } ents[]; > +}; > + > static void hvmtrace_io_assist(const ioreq_t *p) > { > unsigned int size, event; > @@ -136,6 +149,8 @@ void hvmemul_cancel(struct vcpu *v) > vio->mmio_access = (struct npfec){}; > vio->mmio_retry = false; > vio->g2m_ioport = NULL; > + > + hvmemul_cache_disable(v); > } > > static int hvmemul_do_io( > @@ -1883,12 +1898,17 @@ static int hvmemul_rep_movs( > rc = HVMTRANS_okay; > } > else > + { > + unsigned int token = hvmemul_cache_disable(curr); > + > /* > * We do a modicum of checking here, just for paranoia's sake and to > * definitely avoid copying an unitialised buffer into guest address > * space. > */ > rc = hvm_copy_from_guest_phys(buf, sgpa, bytes); > + hvmemul_cache_restore(curr, token); > + } > > if ( rc == HVMTRANS_okay ) > rc = hvm_copy_to_guest_phys(dgpa, buf, bytes, curr); > @@ -2551,6 +2571,19 @@ static int _hvm_emulate_one(struct hvm_e > struct hvm_vcpu_io *vio = &curr->arch.hvm.hvm_io; > int rc; > > + /* > + * Enable caching if it's currently disabled, but leave the cache > + * untouched if it's already enabled, for re-execution to consume > + * entries populated by an earlier pass. > + */ > + if ( vio->cache->num_ents > vio->cache->max_ents ) > + { > + ASSERT(vio->io_req.state == STATE_IOREQ_NONE); > + vio->cache->num_ents = 0; > + } > + else > + ASSERT(vio->io_req.state == STATE_IORESP_READY); > + > hvm_emulate_init_per_insn(hvmemul_ctxt, vio->mmio_insn, > vio->mmio_insn_bytes); > > @@ -2564,6 +2597,7 @@ static int _hvm_emulate_one(struct hvm_e > { > vio->mmio_cache_count = 0; > vio->mmio_insn_bytes = 0; > + hvmemul_cache_disable(curr); > } > else > { > @@ -2856,6 +2890,123 @@ void hvm_dump_emulation_state(const char > hvmemul_ctxt->insn_buf); > } > > +int hvmemul_cache_init(struct vcpu *v) > +{ > + /* > + * No insn can access more than 16 independent linear addresses > (AVX512F > + * scatters/gathers being the worst). Each such linear range can span a > + * page boundary, i.e. may require two page walks. Account for each insn > + * byte individually, for simplicity. > + */ > + const unsigned int nents = (CONFIG_PAGING_LEVELS + 1) * > + (MAX_INST_LEN + 16 * 2); > + struct hvmemul_cache *cache = xmalloc_flex_struct(struct > hvmemul_cache, > + ents, nents); > + > + if ( !cache ) > + return -ENOMEM; > + > + /* Cache is disabled initially. */ > + cache->num_ents = nents + 1; > + cache->max_ents = nents; > + > + v->arch.hvm.hvm_io.cache = cache; > + > + return 0; > +} > + > +unsigned int hvmemul_cache_disable(struct vcpu *v) > +{ > + struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache; > + unsigned int token = cache->num_ents; > + > + cache->num_ents = cache->max_ents + 1; > + > + return token; > +} > + > +void hvmemul_cache_restore(struct vcpu *v, unsigned int token) > +{ > + struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache; > + > + ASSERT(cache->num_ents > cache->max_ents); > + cache->num_ents = token; > +} > + > +bool hvmemul_read_cache(const struct vcpu *v, paddr_t gpa, > + void *buffer, unsigned int size) > +{ > + const struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache; > + unsigned int i; > + > + /* Cache unavailable? */ > + if ( cache->num_ents > cache->max_ents ) > + return false; > + > + while ( size > sizeof(cache->ents->data) ) > + { > + i = gpa & (sizeof(cache->ents->data) - 1) > + ? -gpa & (sizeof(cache->ents->data) - 1) > + : sizeof(cache->ents->data); > + if ( !hvmemul_read_cache(v, gpa, buffer, i) ) > + return false; > + gpa += i; > + buffer += i; > + size -= i; > + } > + > + for ( i = 0; i < cache->num_ents; ++i ) > + if ( cache->ents[i].gpa == gpa && cache->ents[i].size == size ) > + { > + memcpy(buffer, &cache->ents[i].data, size); > + return true; > + } > + > + return false; > +} > + > +void hvmemul_write_cache(const struct vcpu *v, paddr_t gpa, > + const void *buffer, unsigned int size) > +{ > + struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache; > + unsigned int i; > + > + /* Cache unavailable? */ > + if ( cache->num_ents > cache->max_ents ) > + return; > + > + while ( size > sizeof(cache->ents->data) ) > + { > + i = gpa & (sizeof(cache->ents->data) - 1) > + ? -gpa & (sizeof(cache->ents->data) - 1) > + : sizeof(cache->ents->data); > + hvmemul_write_cache(v, gpa, buffer, i); > + gpa += i; > + buffer += i; > + size -= i; > + } > + > + for ( i = 0; i < cache->num_ents; ++i ) > + if ( cache->ents[i].gpa == gpa && cache->ents[i].size == size ) > + { > + memcpy(&cache->ents[i].data, buffer, size); > + return; > + } > + > + if ( unlikely(i >= cache->max_ents) ) > + { > + domain_crash(v->domain); > + return; > + } > + > + cache->ents[i].gpa = gpa; > + cache->ents[i].size = size; > + > + memcpy(&cache->ents[i].data, buffer, size); > + > + cache->num_ents = i + 1; > +} > + > /* > * Local variables: > * mode: C > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -726,6 +726,8 @@ int hvm_domain_initialise(struct domain > /* This function and all its descendants need to be to be idempotent. */ > void hvm_domain_relinquish_resources(struct domain *d) > { > + struct vcpu *v; > + > if ( hvm_funcs.domain_relinquish_resources ) > alternative_vcall(hvm_funcs.domain_relinquish_resources, d); > > @@ -742,6 +744,9 @@ void hvm_domain_relinquish_resources(str > rtc_deinit(d); > pmtimer_deinit(d); > hpet_deinit(d); > + > + for_each_vcpu ( d, v ) > + hvmemul_cache_destroy(v); > } > > void hvm_domain_destroy(struct domain *d) > @@ -1549,6 +1554,10 @@ int hvm_vcpu_initialise(struct vcpu *v) > > v->arch.hvm.inject_event.vector = HVM_EVENT_VECTOR_UNSET; > > + rc = hvmemul_cache_init(v); > + if ( rc ) > + goto fail4; > + > rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */ > if ( rc != 0 ) > goto fail4; > @@ -1584,6 +1593,7 @@ int hvm_vcpu_initialise(struct vcpu *v) > fail5: > free_compat_arg_xlat(v); > fail4: > + hvmemul_cache_destroy(v); > hvm_funcs.vcpu_destroy(v); > fail3: > vlapic_destroy(v); > @@ -2945,6 +2955,7 @@ void hvm_task_switch( > unsigned int eflags, new_cpl; > pagefault_info_t pfinfo; > int exn_raised, rc; > + unsigned int token = hvmemul_cache_disable(v); > struct tss32 tss; > > hvm_get_segment_register(v, x86_seg_gdtr, &gdt); > @@ -3152,6 +3163,8 @@ void hvm_task_switch( > out: > hvm_unmap_entry(optss_desc); > hvm_unmap_entry(nptss_desc); > + > + hvmemul_cache_restore(v, token); > } > > enum hvm_translation_result hvm_translate_get_page( > @@ -3242,7 +3255,6 @@ static enum hvm_translation_result __hvm > gfn_t gfn; > struct page_info *page; > p2m_type_t p2mt; > - char *p; > int count, todo = size; > > ASSERT(is_hvm_vcpu(v)); > @@ -3290,11 +3302,17 @@ static enum hvm_translation_result __hvm > return HVMTRANS_need_retry; > } > > - p = __map_domain_page(page) + pgoff; > - > - if ( flags & HVMCOPY_to_guest ) > + if ( (flags & HVMCOPY_to_guest) || > + !hvmemul_read_cache(v, gfn_to_gaddr(gfn) | pgoff, buf, count) ) > { > - if ( p2m_is_discard_write(p2mt) ) > + void *p = __map_domain_page(page) + pgoff; > + > + if ( !(flags & HVMCOPY_to_guest) ) > + { > + memcpy(buf, p, count); > + hvmemul_write_cache(v, gfn_to_gaddr(gfn) | pgoff, buf, > count); > + } > + else if ( p2m_is_discard_write(p2mt) ) > { > static unsigned long lastpage; > > @@ -3311,13 +3329,9 @@ static enum hvm_translation_result __hvm > memset(p, 0, count); > paging_mark_pfn_dirty(v->domain, _pfn(gfn_x(gfn))); > } > - } > - else > - { > - memcpy(buf, p, count); > - } > > - unmap_domain_page(p); > + unmap_domain_page(p); > + } > > addr += count; > if ( buf ) > --- a/xen/arch/x86/hvm/hypercall.c > +++ b/xen/arch/x86/hvm/hypercall.c > @@ -22,6 +22,7 @@ > #include <xen/hypercall.h> > #include <xen/nospec.h> > > +#include <asm/hvm/emulate.h> > #include <asm/hvm/support.h> > > static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) > arg) > @@ -159,6 +160,7 @@ int hvm_hypercall(struct cpu_user_regs * > struct domain *currd = curr->domain; > int mode = hvm_guest_x86_mode(curr); > unsigned long eax = regs->eax; > + unsigned int token; > > switch ( mode ) > { > @@ -183,7 +185,18 @@ int hvm_hypercall(struct cpu_user_regs * > } > > if ( (eax & 0x80000000) && is_viridian_domain(currd) ) > - return viridian_hypercall(regs); > + { > + int ret; > + > + /* See comment below. */ > + token = hvmemul_cache_disable(curr); > + > + ret = viridian_hypercall(regs); > + > + hvmemul_cache_restore(curr, token); > + > + return ret; > + } > > BUILD_BUG_ON(ARRAY_SIZE(hvm_hypercall_table) > > ARRAY_SIZE(hypercall_args_table)); > @@ -202,6 +215,12 @@ int hvm_hypercall(struct cpu_user_regs * > return HVM_HCALL_completed; > } > > + /* > + * Caching is intended for instruction emulation only. Disable it > + * for any accesses by hypercall argument copy-in / copy-out. > + */ > + token = hvmemul_cache_disable(curr); > + > curr->hcall_preempted = false; > > if ( mode == 8 ) > @@ -295,6 +314,8 @@ int hvm_hypercall(struct cpu_user_regs * > #endif > } > > + hvmemul_cache_restore(curr, token); > + > HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax); > > if ( curr->hcall_preempted ) > --- a/xen/arch/x86/hvm/intercept.c > +++ b/xen/arch/x86/hvm/intercept.c > @@ -20,6 +20,7 @@ > #include <xen/types.h> > #include <xen/sched.h> > #include <asm/regs.h> > +#include <asm/hvm/emulate.h> > #include <asm/hvm/hvm.h> > #include <asm/hvm/support.h> > #include <asm/hvm/domain.h> > @@ -163,6 +164,9 @@ int hvm_process_io_intercept(const struc > { > if ( p->data_is_ptr ) > { > + struct vcpu *curr = current; > + unsigned int token = hvmemul_cache_disable(curr); > + > data = 0; > switch ( hvm_copy_from_guest_phys(&data, p->data + step * i, > p->size) ) > @@ -179,9 +183,11 @@ int hvm_process_io_intercept(const struc > ASSERT_UNREACHABLE(); > /* fall through */ > default: > - domain_crash(current->domain); > + domain_crash(curr->domain); > return X86EMUL_UNHANDLEABLE; > } > + > + hvmemul_cache_restore(curr, token); > } > else > data = p->data; > --- a/xen/arch/x86/hvm/svm/svm.c > +++ b/xen/arch/x86/hvm/svm/svm.c > @@ -1040,6 +1040,8 @@ void svm_vmenter_helper(const struct cpu > struct vcpu *curr = current; > struct vmcb_struct *vmcb = curr->arch.hvm.svm.vmcb; > > + ASSERT(hvmemul_cache_disabled(curr)); > + > svm_asid_handle_vmrun(); > > if ( unlikely(tb_init_done) ) > --- a/xen/arch/x86/hvm/vmsi.c > +++ b/xen/arch/x86/hvm/vmsi.c > @@ -35,6 +35,7 @@ > #include <xen/irq.h> > #include <xen/vpci.h> > #include <public/hvm/ioreq.h> > +#include <asm/hvm/emulate.h> > #include <asm/hvm/io.h> > #include <asm/hvm/vpic.h> > #include <asm/hvm/vlapic.h> > @@ -607,6 +608,7 @@ void msix_write_completion(struct vcpu * > if ( !ctrl_address && snoop_addr && > v->arch.hvm.hvm_io.msix_snoop_gpa ) > { > + unsigned int token = hvmemul_cache_disable(v); > const struct msi_desc *desc; > uint32_t data; > > @@ -621,6 +623,8 @@ void msix_write_completion(struct vcpu * > sizeof(data)) == HVMTRANS_okay && > !(data & PCI_MSIX_VECTOR_BITMASK) ) > ctrl_address = snoop_addr; > + > + hvmemul_cache_restore(v, token); > } > > if ( !ctrl_address ) > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -4362,6 +4362,8 @@ bool vmx_vmenter_helper(const struct cpu > struct hvm_vcpu_asid *p_asid; > bool_t need_flush; > > + ASSERT(hvmemul_cache_disabled(curr)); > + > /* Shadow EPTP can't be updated here because irqs are disabled */ > if ( nestedhvm_vcpu_in_guestmode(curr) && > vcpu_nestedhvm(curr).stale_np2m ) > return false; > --- a/xen/include/asm-x86/hvm/emulate.h > +++ b/xen/include/asm-x86/hvm/emulate.h > @@ -13,6 +13,7 @@ > #define __ASM_X86_HVM_EMULATE_H__ > > #include <xen/err.h> > +#include <xen/sched.h> > #include <asm/hvm/hvm.h> > #include <asm/x86_emulate.h> > > @@ -97,6 +98,39 @@ int hvmemul_do_pio_buffer(uint16_t port, > uint8_t dir, > void *buffer); > > +#ifdef CONFIG_HVM > +/* > + * The cache controlled by the functions below is not like an ordinary CPU > + * cache, i.e. aiming to help performance, but a "secret store" which is > + * needed for correctness. The issue it helps addressing is the need for > + * re-execution of an insn (after data was provided by a device model) to > + * observe the exact same memory state, i.e. to specifically not observe any > + * updates which may have occurred in the meantime by other agents. > + * Therefore this cache gets > + * - enabled when emulation of an insn starts, > + * - disabled across processing secondary things like a hypercall resulting > + * from insn emulation, > + * - disabled again when an emulated insn is known to not require any > + * further re-execution. > + */ > +int __must_check hvmemul_cache_init(struct vcpu *v); > +static inline void hvmemul_cache_destroy(struct vcpu *v) > +{ > + XFREE(v->arch.hvm.hvm_io.cache); > +} > +bool hvmemul_read_cache(const struct vcpu *, paddr_t gpa, > + void *buffer, unsigned int size); > +void hvmemul_write_cache(const struct vcpu *, paddr_t gpa, > + const void *buffer, unsigned int size); > +unsigned int hvmemul_cache_disable(struct vcpu *); > +void hvmemul_cache_restore(struct vcpu *, unsigned int token); > +/* For use in ASSERT()s only: */ > +static inline bool hvmemul_cache_disabled(struct vcpu *v) > +{ > + return hvmemul_cache_disable(v) == hvmemul_cache_disable(v); > +} > +#endif > + > void hvm_dump_emulation_state(const char *loglvl, const char *prefix, > struct hvm_emulate_ctxt *hvmemul_ctxt, int rc); > > --- a/xen/include/asm-x86/hvm/vcpu.h > +++ b/xen/include/asm-x86/hvm/vcpu.h > @@ -77,6 +77,8 @@ struct hvm_vcpu_io { > /* For retries we shouldn't re-fetch the instruction. */ > unsigned int mmio_insn_bytes; > unsigned char mmio_insn[16]; > + struct hvmemul_cache *cache; > + > /* > * For string instruction emulation we need to be able to signal a > * necessary retry through other than function return codes. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.