[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 2/4] x86/HVM: implement memory read caching for insn emulation



> From: Jan Beulich <jbeulich@xxxxxxxx>
> Sent: Tuesday, March 3, 2020 6:17 PM
> 
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far as use of CPU

ah, I was not aware of such form. I thought the emulation is split
into two phases: decoding and send i/o request to device model, and
then completing inst emulation with device model's response and
previously-decoded information... 

> registers goes (as those can't change without any other instruction
> executing in between [1]), but is wrong for memory accesses. In
> particular it has been observed that Windows might page out buffers
> underneath an instruction currently under emulation (hitting between two
> passes). If the first pass read a memory operand successfully, any
> subsequent pass needs to get to see the exact same value.
> 
> Introduce a cache to make sure above described assumption holds. This
> is a very simplistic implementation for now: Only exact matches are
> satisfied (no overlaps or partial reads or anything); this is sufficient
> for the immediate purpose of making re-execution an exact replay. The
> cache also won't be used just yet for guest page walks; that'll be the
> subject of a subsequent change.

a cache implies that the aforementioned two-pass problem is only
mitigated instead of completely fixed?

btw is there any performance impact from this patch?

> 
> With the cache being generally transparent to upper layers, but with it
> having limited capacity yet being required for correctness, certain
> users of hvm_copy_from_guest_*() need to disable caching temporarily,
> without invalidating the cache. Note that the adjustments here to
> hvm_hypercall() and hvm_task_switch() are benign at this point; they'll
> become relevant once we start to be able to emulate respective insns
> through the main emulator (and more changes will then likely be needed
> to nested code).
> 
> As to the actual data page in a problamtic scenario, there are a couple
> of aspects to take into consideration:
> - We must be talking about an insn accessing two locations (two memory
>   ones, one of which is MMIO, or a memory and an I/O one).
> - If the non I/O / MMIO side is being read, the re-read (if it occurs at
>   all) is having its result discarded, by taking the shortcut through
>   the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note
>   how, among all the re-issue sanity checks there, we avoid comparing
>   the actual data.
> - If the non I/O / MMIO side is being written, it is the OSes
>   responsibility to avoid actually moving page contents to disk while
>   there might still be a write access in flight - this is no different
>   in behavior from bare hardware.
> - Read-modify-write accesses are, as always, complicated, and while we
>   deal with them better nowadays than we did in the past, we're still
>   not quite there to guarantee hardware like behavior in all cases
>   anyway. Nothing is getting worse by the changes made here, afaict.
> 
> In __hvm_copy() also reduce p's scope and change its type to void *.
> 
> [1] Other than on actual hardware, actions like
>     XEN_DOMCTL_sethvmcontext, XEN_DOMCTL_setvcpucontext,
>     VCPUOP_initialise, INIT, or SIPI issued against the vCPU can occur
>     while the vCPU is blocked waiting for a device model to return data.
>     In such cases emulation now gets canceled, though, and hence re-
>     execution correctness is unaffected.
> 
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> TBD: In principle the caching here yields unnecessary the one used for
>      insn bytes (vio->mmio_insn{,_bytes}. However, to seed the cache
>      with the data SVM may have made available, we'd have to also know
>      the corresponding GPA. It's not safe, however, to re-walk the page
>      tables to find out, as the page tables may have changed in the
>      meantime. Therefore I guess we need to keep the duplicate
>      functionality for now. A possible solution to this could be to use
>      a physical-address-based cache for page table accesses (and looking
>      forward also e.g. SVM/VMX insn emulation), and a linear-address-
>      based one for all other reads.
> ---
> v5: Re-arrange bitfield. Use domain_crash() in hvmemul_write_cache().
>     Move hvmemul_{read,write}_cache() stubs to later patch. Also adjust
>     hvmemul_cancel(). Add / extend comments. Re-base.
> v4: Re-write for cache to become transparent to callers.
> v3: Add text about the actual data page to the description.
> v2: Re-base.
> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -28,6 +28,19 @@
>  #include <asm/iocap.h>
>  #include <asm/vm_event.h>
> 
> +struct hvmemul_cache
> +{
> +    /* The cache is disabled as long as num_ents > max_ents. */
> +    unsigned int num_ents;
> +    unsigned int max_ents;
> +    struct {
> +        paddr_t gpa:PADDR_BITS;
> +        unsigned int :BITS_PER_LONG - PADDR_BITS - 8;
> +        unsigned int size:8;
> +        unsigned long data;
> +    } ents[];
> +};
> +
>  static void hvmtrace_io_assist(const ioreq_t *p)
>  {
>      unsigned int size, event;
> @@ -136,6 +149,8 @@ void hvmemul_cancel(struct vcpu *v)
>      vio->mmio_access = (struct npfec){};
>      vio->mmio_retry = false;
>      vio->g2m_ioport = NULL;
> +
> +    hvmemul_cache_disable(v);
>  }
> 
>  static int hvmemul_do_io(
> @@ -1883,12 +1898,17 @@ static int hvmemul_rep_movs(
>          rc = HVMTRANS_okay;
>      }
>      else
> +    {
> +        unsigned int token = hvmemul_cache_disable(curr);
> +
>          /*
>           * We do a modicum of checking here, just for paranoia's sake and to
>           * definitely avoid copying an unitialised buffer into guest address
>           * space.
>           */
>          rc = hvm_copy_from_guest_phys(buf, sgpa, bytes);
> +        hvmemul_cache_restore(curr, token);
> +    }
> 
>      if ( rc == HVMTRANS_okay )
>          rc = hvm_copy_to_guest_phys(dgpa, buf, bytes, curr);
> @@ -2551,6 +2571,19 @@ static int _hvm_emulate_one(struct hvm_e
>      struct hvm_vcpu_io *vio = &curr->arch.hvm.hvm_io;
>      int rc;
> 
> +    /*
> +     * Enable caching if it's currently disabled, but leave the cache
> +     * untouched if it's already enabled, for re-execution to consume
> +     * entries populated by an earlier pass.
> +     */
> +    if ( vio->cache->num_ents > vio->cache->max_ents )
> +    {
> +        ASSERT(vio->io_req.state == STATE_IOREQ_NONE);
> +        vio->cache->num_ents = 0;
> +    }
> +    else
> +        ASSERT(vio->io_req.state == STATE_IORESP_READY);
> +
>      hvm_emulate_init_per_insn(hvmemul_ctxt, vio->mmio_insn,
>                                vio->mmio_insn_bytes);
> 
> @@ -2564,6 +2597,7 @@ static int _hvm_emulate_one(struct hvm_e
>      {
>          vio->mmio_cache_count = 0;
>          vio->mmio_insn_bytes = 0;
> +        hvmemul_cache_disable(curr);
>      }
>      else
>      {
> @@ -2856,6 +2890,123 @@ void hvm_dump_emulation_state(const char
>             hvmemul_ctxt->insn_buf);
>  }
> 
> +int hvmemul_cache_init(struct vcpu *v)
> +{
> +    /*
> +     * No insn can access more than 16 independent linear addresses
> (AVX512F
> +     * scatters/gathers being the worst). Each such linear range can span a
> +     * page boundary, i.e. may require two page walks. Account for each insn
> +     * byte individually, for simplicity.
> +     */
> +    const unsigned int nents = (CONFIG_PAGING_LEVELS + 1) *
> +                               (MAX_INST_LEN + 16 * 2);
> +    struct hvmemul_cache *cache = xmalloc_flex_struct(struct
> hvmemul_cache,
> +                                                      ents, nents);
> +
> +    if ( !cache )
> +        return -ENOMEM;
> +
> +    /* Cache is disabled initially. */
> +    cache->num_ents = nents + 1;
> +    cache->max_ents = nents;
> +
> +    v->arch.hvm.hvm_io.cache = cache;
> +
> +    return 0;
> +}
> +
> +unsigned int hvmemul_cache_disable(struct vcpu *v)
> +{
> +    struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache;
> +    unsigned int token = cache->num_ents;
> +
> +    cache->num_ents = cache->max_ents + 1;
> +
> +    return token;
> +}
> +
> +void hvmemul_cache_restore(struct vcpu *v, unsigned int token)
> +{
> +    struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache;
> +
> +    ASSERT(cache->num_ents > cache->max_ents);
> +    cache->num_ents = token;
> +}
> +
> +bool hvmemul_read_cache(const struct vcpu *v, paddr_t gpa,
> +                        void *buffer, unsigned int size)
> +{
> +    const struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache;
> +    unsigned int i;
> +
> +    /* Cache unavailable? */
> +    if ( cache->num_ents > cache->max_ents )
> +        return false;
> +
> +    while ( size > sizeof(cache->ents->data) )
> +    {
> +        i = gpa & (sizeof(cache->ents->data) - 1)
> +            ? -gpa & (sizeof(cache->ents->data) - 1)
> +            : sizeof(cache->ents->data);
> +        if ( !hvmemul_read_cache(v, gpa, buffer, i) )
> +            return false;
> +        gpa += i;
> +        buffer += i;
> +        size -= i;
> +    }
> +
> +    for ( i = 0; i < cache->num_ents; ++i )
> +        if ( cache->ents[i].gpa == gpa && cache->ents[i].size == size )
> +        {
> +            memcpy(buffer, &cache->ents[i].data, size);
> +            return true;
> +        }
> +
> +    return false;
> +}
> +
> +void hvmemul_write_cache(const struct vcpu *v, paddr_t gpa,
> +                         const void *buffer, unsigned int size)
> +{
> +    struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache;
> +    unsigned int i;
> +
> +    /* Cache unavailable? */
> +    if ( cache->num_ents > cache->max_ents )
> +        return;
> +
> +    while ( size > sizeof(cache->ents->data) )
> +    {
> +        i = gpa & (sizeof(cache->ents->data) - 1)
> +            ? -gpa & (sizeof(cache->ents->data) - 1)
> +            : sizeof(cache->ents->data);
> +        hvmemul_write_cache(v, gpa, buffer, i);
> +        gpa += i;
> +        buffer += i;
> +        size -= i;
> +    }
> +
> +    for ( i = 0; i < cache->num_ents; ++i )
> +        if ( cache->ents[i].gpa == gpa && cache->ents[i].size == size )
> +        {
> +            memcpy(&cache->ents[i].data, buffer, size);
> +            return;
> +        }
> +
> +    if ( unlikely(i >= cache->max_ents) )
> +    {
> +        domain_crash(v->domain);
> +        return;
> +    }
> +
> +    cache->ents[i].gpa  = gpa;
> +    cache->ents[i].size = size;
> +
> +    memcpy(&cache->ents[i].data, buffer, size);
> +
> +    cache->num_ents = i + 1;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -726,6 +726,8 @@ int hvm_domain_initialise(struct domain
>  /* This function and all its descendants need to be to be idempotent. */
>  void hvm_domain_relinquish_resources(struct domain *d)
>  {
> +    struct vcpu *v;
> +
>      if ( hvm_funcs.domain_relinquish_resources )
>          alternative_vcall(hvm_funcs.domain_relinquish_resources, d);
> 
> @@ -742,6 +744,9 @@ void hvm_domain_relinquish_resources(str
>      rtc_deinit(d);
>      pmtimer_deinit(d);
>      hpet_deinit(d);
> +
> +    for_each_vcpu ( d, v )
> +        hvmemul_cache_destroy(v);
>  }
> 
>  void hvm_domain_destroy(struct domain *d)
> @@ -1549,6 +1554,10 @@ int hvm_vcpu_initialise(struct vcpu *v)
> 
>      v->arch.hvm.inject_event.vector = HVM_EVENT_VECTOR_UNSET;
> 
> +    rc = hvmemul_cache_init(v);
> +    if ( rc )
> +        goto fail4;
> +
>      rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
>      if ( rc != 0 )
>          goto fail4;
> @@ -1584,6 +1593,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
>   fail5:
>      free_compat_arg_xlat(v);
>   fail4:
> +    hvmemul_cache_destroy(v);
>      hvm_funcs.vcpu_destroy(v);
>   fail3:
>      vlapic_destroy(v);
> @@ -2945,6 +2955,7 @@ void hvm_task_switch(
>      unsigned int eflags, new_cpl;
>      pagefault_info_t pfinfo;
>      int exn_raised, rc;
> +    unsigned int token = hvmemul_cache_disable(v);
>      struct tss32 tss;
> 
>      hvm_get_segment_register(v, x86_seg_gdtr, &gdt);
> @@ -3152,6 +3163,8 @@ void hvm_task_switch(
>   out:
>      hvm_unmap_entry(optss_desc);
>      hvm_unmap_entry(nptss_desc);
> +
> +    hvmemul_cache_restore(v, token);
>  }
> 
>  enum hvm_translation_result hvm_translate_get_page(
> @@ -3242,7 +3255,6 @@ static enum hvm_translation_result __hvm
>      gfn_t gfn;
>      struct page_info *page;
>      p2m_type_t p2mt;
> -    char *p;
>      int count, todo = size;
> 
>      ASSERT(is_hvm_vcpu(v));
> @@ -3290,11 +3302,17 @@ static enum hvm_translation_result __hvm
>              return HVMTRANS_need_retry;
>          }
> 
> -        p = __map_domain_page(page) + pgoff;
> -
> -        if ( flags & HVMCOPY_to_guest )
> +        if ( (flags & HVMCOPY_to_guest) ||
> +             !hvmemul_read_cache(v, gfn_to_gaddr(gfn) | pgoff, buf, count) )
>          {
> -            if ( p2m_is_discard_write(p2mt) )
> +            void *p = __map_domain_page(page) + pgoff;
> +
> +            if ( !(flags & HVMCOPY_to_guest) )
> +            {
> +                memcpy(buf, p, count);
> +                hvmemul_write_cache(v, gfn_to_gaddr(gfn) | pgoff, buf, 
> count);
> +            }
> +            else if ( p2m_is_discard_write(p2mt) )
>              {
>                  static unsigned long lastpage;
> 
> @@ -3311,13 +3329,9 @@ static enum hvm_translation_result __hvm
>                      memset(p, 0, count);
>                  paging_mark_pfn_dirty(v->domain, _pfn(gfn_x(gfn)));
>              }
> -        }
> -        else
> -        {
> -            memcpy(buf, p, count);
> -        }
> 
> -        unmap_domain_page(p);
> +            unmap_domain_page(p);
> +        }
> 
>          addr += count;
>          if ( buf )
> --- a/xen/arch/x86/hvm/hypercall.c
> +++ b/xen/arch/x86/hvm/hypercall.c
> @@ -22,6 +22,7 @@
>  #include <xen/hypercall.h>
>  #include <xen/nospec.h>
> 
> +#include <asm/hvm/emulate.h>
>  #include <asm/hvm/support.h>
> 
>  static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void)
> arg)
> @@ -159,6 +160,7 @@ int hvm_hypercall(struct cpu_user_regs *
>      struct domain *currd = curr->domain;
>      int mode = hvm_guest_x86_mode(curr);
>      unsigned long eax = regs->eax;
> +    unsigned int token;
> 
>      switch ( mode )
>      {
> @@ -183,7 +185,18 @@ int hvm_hypercall(struct cpu_user_regs *
>      }
> 
>      if ( (eax & 0x80000000) && is_viridian_domain(currd) )
> -        return viridian_hypercall(regs);
> +    {
> +        int ret;
> +
> +        /* See comment below. */
> +        token = hvmemul_cache_disable(curr);
> +
> +        ret = viridian_hypercall(regs);
> +
> +        hvmemul_cache_restore(curr, token);
> +
> +        return ret;
> +    }
> 
>      BUILD_BUG_ON(ARRAY_SIZE(hvm_hypercall_table) >
>                   ARRAY_SIZE(hypercall_args_table));
> @@ -202,6 +215,12 @@ int hvm_hypercall(struct cpu_user_regs *
>          return HVM_HCALL_completed;
>      }
> 
> +    /*
> +     * Caching is intended for instruction emulation only. Disable it
> +     * for any accesses by hypercall argument copy-in / copy-out.
> +     */
> +    token = hvmemul_cache_disable(curr);
> +
>      curr->hcall_preempted = false;
> 
>      if ( mode == 8 )
> @@ -295,6 +314,8 @@ int hvm_hypercall(struct cpu_user_regs *
>  #endif
>      }
> 
> +    hvmemul_cache_restore(curr, token);
> +
>      HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax);
> 
>      if ( curr->hcall_preempted )
> --- a/xen/arch/x86/hvm/intercept.c
> +++ b/xen/arch/x86/hvm/intercept.c
> @@ -20,6 +20,7 @@
>  #include <xen/types.h>
>  #include <xen/sched.h>
>  #include <asm/regs.h>
> +#include <asm/hvm/emulate.h>
>  #include <asm/hvm/hvm.h>
>  #include <asm/hvm/support.h>
>  #include <asm/hvm/domain.h>
> @@ -163,6 +164,9 @@ int hvm_process_io_intercept(const struc
>          {
>              if ( p->data_is_ptr )
>              {
> +                struct vcpu *curr = current;
> +                unsigned int token = hvmemul_cache_disable(curr);
> +
>                  data = 0;
>                  switch ( hvm_copy_from_guest_phys(&data, p->data + step * i,
>                                                    p->size) )
> @@ -179,9 +183,11 @@ int hvm_process_io_intercept(const struc
>                      ASSERT_UNREACHABLE();
>                      /* fall through */
>                  default:
> -                    domain_crash(current->domain);
> +                    domain_crash(curr->domain);
>                      return X86EMUL_UNHANDLEABLE;
>                  }
> +
> +                hvmemul_cache_restore(curr, token);
>              }
>              else
>                  data = p->data;
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -1040,6 +1040,8 @@ void svm_vmenter_helper(const struct cpu
>      struct vcpu *curr = current;
>      struct vmcb_struct *vmcb = curr->arch.hvm.svm.vmcb;
> 
> +    ASSERT(hvmemul_cache_disabled(curr));
> +
>      svm_asid_handle_vmrun();
> 
>      if ( unlikely(tb_init_done) )
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -35,6 +35,7 @@
>  #include <xen/irq.h>
>  #include <xen/vpci.h>
>  #include <public/hvm/ioreq.h>
> +#include <asm/hvm/emulate.h>
>  #include <asm/hvm/io.h>
>  #include <asm/hvm/vpic.h>
>  #include <asm/hvm/vlapic.h>
> @@ -607,6 +608,7 @@ void msix_write_completion(struct vcpu *
>      if ( !ctrl_address && snoop_addr &&
>           v->arch.hvm.hvm_io.msix_snoop_gpa )
>      {
> +        unsigned int token = hvmemul_cache_disable(v);
>          const struct msi_desc *desc;
>          uint32_t data;
> 
> @@ -621,6 +623,8 @@ void msix_write_completion(struct vcpu *
>                                        sizeof(data)) == HVMTRANS_okay &&
>               !(data & PCI_MSIX_VECTOR_BITMASK) )
>              ctrl_address = snoop_addr;
> +
> +        hvmemul_cache_restore(v, token);
>      }
> 
>      if ( !ctrl_address )
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -4362,6 +4362,8 @@ bool vmx_vmenter_helper(const struct cpu
>      struct hvm_vcpu_asid *p_asid;
>      bool_t need_flush;
> 
> +    ASSERT(hvmemul_cache_disabled(curr));
> +
>      /* Shadow EPTP can't be updated here because irqs are disabled */
>       if ( nestedhvm_vcpu_in_guestmode(curr) &&
> vcpu_nestedhvm(curr).stale_np2m )
>           return false;
> --- a/xen/include/asm-x86/hvm/emulate.h
> +++ b/xen/include/asm-x86/hvm/emulate.h
> @@ -13,6 +13,7 @@
>  #define __ASM_X86_HVM_EMULATE_H__
> 
>  #include <xen/err.h>
> +#include <xen/sched.h>
>  #include <asm/hvm/hvm.h>
>  #include <asm/x86_emulate.h>
> 
> @@ -97,6 +98,39 @@ int hvmemul_do_pio_buffer(uint16_t port,
>                            uint8_t dir,
>                            void *buffer);
> 
> +#ifdef CONFIG_HVM
> +/*
> + * The cache controlled by the functions below is not like an ordinary CPU
> + * cache, i.e. aiming to help performance, but a "secret store" which is
> + * needed for correctness.  The issue it helps addressing is the need for
> + * re-execution of an insn (after data was provided by a device model) to
> + * observe the exact same memory state, i.e. to specifically not observe any
> + * updates which may have occurred in the meantime by other agents.
> + * Therefore this cache gets
> + * - enabled when emulation of an insn starts,
> + * - disabled across processing secondary things like a hypercall resulting
> + *   from insn emulation,
> + * - disabled again when an emulated insn is known to not require any
> + *   further re-execution.
> + */
> +int __must_check hvmemul_cache_init(struct vcpu *v);
> +static inline void hvmemul_cache_destroy(struct vcpu *v)
> +{
> +    XFREE(v->arch.hvm.hvm_io.cache);
> +}
> +bool hvmemul_read_cache(const struct vcpu *, paddr_t gpa,
> +                        void *buffer, unsigned int size);
> +void hvmemul_write_cache(const struct vcpu *, paddr_t gpa,
> +                         const void *buffer, unsigned int size);
> +unsigned int hvmemul_cache_disable(struct vcpu *);
> +void hvmemul_cache_restore(struct vcpu *, unsigned int token);
> +/* For use in ASSERT()s only: */
> +static inline bool hvmemul_cache_disabled(struct vcpu *v)
> +{
> +    return hvmemul_cache_disable(v) == hvmemul_cache_disable(v);
> +}
> +#endif
> +
>  void hvm_dump_emulation_state(const char *loglvl, const char *prefix,
>                                struct hvm_emulate_ctxt *hvmemul_ctxt, int rc);
> 
> --- a/xen/include/asm-x86/hvm/vcpu.h
> +++ b/xen/include/asm-x86/hvm/vcpu.h
> @@ -77,6 +77,8 @@ struct hvm_vcpu_io {
>      /* For retries we shouldn't re-fetch the instruction. */
>      unsigned int mmio_insn_bytes;
>      unsigned char mmio_insn[16];
> +    struct hvmemul_cache *cache;
> +
>      /*
>       * For string instruction emulation we need to be able to signal a
>       * necessary retry through other than function return codes.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.