[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v6 24/29] xen/x86: allow HVM guests to use hypercalls to bring up vCPUs



>>> On 04.09.15 at 14:09, <roger.pau@xxxxxxxxxx> wrote:
> Allow the usage of the VCPUOP_initialise, VCPUOP_up, VCPUOP_down and
> VCPUOP_is_up hypercalls from HVM guests.
> 
> This patch introduces a new structure (vcpu_hvm_context) that should be used
> in conjuction with the VCPUOP_initialise hypercall in order to initialize
> vCPUs for HVM guests.
> 
> Signed-off-by: Roger Pau Monnà <roger.pau@xxxxxxxxxx>
> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

So this bi-modal thing doesn't look too bad, but a concern I have is
with the now different contexts used by XEN_DOMCTL_setvcpucontext
and VCPUOP_initialise.

> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -752,6 +752,30 @@ int arch_set_info_guest(
>      return 0;
>  }
>  
> +int arch_initialize_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)
> +{
> +    struct vcpu_guest_context *ctxt;
> +    struct domain *d = v->domain;
> +    int rc;
> +
> +    if ( (ctxt = alloc_vcpu_guest_context()) == NULL )
> +        return -ENOMEM;
> +
> +    if ( copy_from_guest(ctxt, arg, 1) )
> +    {
> +        free_vcpu_guest_context(ctxt);
> +        return -EFAULT;
> +    }
> +
> +    domain_lock(d);
> +    rc = v->is_initialised ? -EEXIST : arch_set_info_guest(v, ctxt);
> +    domain_unlock(d);
> +
> +    free_vcpu_guest_context(ctxt);
> +
> +    return rc;
> +}

I wonder whether this shouldn't instead be kept in common code,
with arch code calling it as needed (e.g. as default_initialize_vcpu()),
since afaict the code is now duplicate with the x86 side PV handling.

> @@ -1140,6 +1141,201 @@ int arch_set_info_guest(
>  #undef c
>  }
>  
> +/* Called by VCPUOP_initialise for HVM guests. */
> +static int arch_set_info_hvm_guest(struct vcpu *v, vcpu_hvm_context_t *ctx)
> +{
> +    struct cpu_user_regs *uregs = &v->arch.user_regs;
> +    struct segment_register cs, ds, ss, tr;
> +
> +#define SEG(s, r)                                                       \
> +    (struct segment_register){ .base = (r)->s ## _base,                 \
> +            .limit = (r)->s ## _limit, .attr.bytes = (r)->s ## _ar }
> +
> +    switch ( ctx->mode )
> +    {
> +    default:
> +        return -EINVAL;
> +
> +    case VCPU_HVM_MODE_16B:

I think "MODE" is misleading here, not just because of the register
size issue (see further down) but also because you don't seem to
enforce the respective mode to be chosen in cs_ar. I'm also missing
at least some simple consistency checks (like CS.DPL == SS.DPL,
rIP within CS limit, rSP within SS limit); leaving these to the first VM
entry would likely result in much harder to analyze issues in case of
bad input.

> +    {
> +        const struct vcpu_hvm_x86_16 *regs = &ctx->cpu_regs.x86_16;
> +
> +        uregs->rax    = regs->ax;
> +        uregs->rcx    = regs->cx;
> +        uregs->rdx    = regs->dx;
> +        uregs->rbx    = regs->bx;
> +        uregs->rsp    = regs->sp;
> +        uregs->rbp    = regs->bp;
> +        uregs->rsi    = regs->si;
> +        uregs->rdi    = regs->di;
> +        uregs->rip    = regs->ip;
> +        uregs->rflags = regs->flags;
> +
> +        v->arch.hvm_vcpu.guest_cr[0] = regs->cr0;
> +        v->arch.hvm_vcpu.guest_cr[4] = regs->cr4;
> +
> +        cs = SEG(cs, regs);
> +        ds = SEG(ds, regs);
> +        ss = SEG(ss, regs);
> +        tr = SEG(tr, regs);
> +    }
> +    break;
> +
> +    case VCPU_HVM_MODE_32B:
> +    {
> +        const struct vcpu_hvm_x86_32 *regs = &ctx->cpu_regs.x86_32;
> +
> +        uregs->rax    = regs->eax;
> +        uregs->rcx    = regs->ecx;
> +        uregs->rdx    = regs->edx;
> +        uregs->rbx    = regs->ebx;
> +        uregs->rsp    = regs->esp;
> +        uregs->rbp    = regs->ebp;
> +        uregs->rsi    = regs->esi;
> +        uregs->rdi    = regs->edi;
> +        uregs->rip    = regs->eip;
> +        uregs->rflags = regs->eflags;
> +
> +        v->arch.hvm_vcpu.guest_cr[0] = regs->cr0;
> +        v->arch.hvm_vcpu.guest_cr[3] = regs->cr3;
> +        v->arch.hvm_vcpu.guest_cr[4] = regs->cr4;
> +        v->arch.hvm_vcpu.guest_efer  = regs->efer;
> +
> +        cs = SEG(cs, regs);
> +        ds = SEG(ds, regs);
> +        ss = SEG(ss, regs);
> +        tr = SEG(tr, regs);
> +    }
> +    break;
> +
> +    case VCPU_HVM_MODE_64B:
> +    {
> +        const struct vcpu_hvm_x86_64 *regs = &ctx->cpu_regs.x86_64;
> +
> +        uregs->rax    = regs->rax;
> +        uregs->rcx    = regs->rcx;
> +        uregs->rdx    = regs->rdx;
> +        uregs->rbx    = regs->rbx;
> +        uregs->rsp    = regs->rsp;
> +        uregs->rbp    = regs->rbp;
> +        uregs->rsi    = regs->rsi;
> +        uregs->rdi    = regs->rdi;
> +        uregs->rip    = regs->rip;
> +        uregs->rflags = regs->rflags;
> +        uregs->r8     = regs->r8;
> +        uregs->r9     = regs->r9;
> +        uregs->r10    = regs->r10;
> +        uregs->r11    = regs->r11;
> +        uregs->r12    = regs->r12;
> +        uregs->r13    = regs->r13;
> +        uregs->r14    = regs->r14;
> +        uregs->r15    = regs->r15;
> +
> +        v->arch.hvm_vcpu.guest_cr[0] = regs->cr0;
> +        v->arch.hvm_vcpu.guest_cr[3] = regs->cr3;
> +        v->arch.hvm_vcpu.guest_cr[4] = regs->cr4;
> +        v->arch.hvm_vcpu.guest_efer  = regs->efer;
> +
> +        cs = SEG(cs, regs);
> +        ds = SEG(ds, regs);
> +        ss = SEG(ss, regs);
> +        tr = SEG(tr, regs);
> +    }
> +    break;
> +
> +    }
> +
> +    if ( !paging_mode_hap(v->domain) )
> +        v->arch.guest_table = pagetable_null();

A comment with the reason for this would be nice. I can't immediately
see why this is here.

> +    hvm_update_guest_cr(v, 0);
> +    hvm_update_guest_cr(v, 4);
> +
> +    if ( (ctx->mode == VCPU_HVM_MODE_32B) ||
> +         (ctx->mode == VCPU_HVM_MODE_64B) )
> +    {
> +        hvm_update_guest_cr(v, 3);
> +        hvm_update_guest_efer(v);
> +    }
> +
> +    if ( hvm_paging_enabled(v) && !paging_mode_hap(v->domain) )
> +    {
> +        /* Shadow-mode CR3 change. Check PDBR and update refcounts. */
> +        struct page_info *page = get_page_from_gfn(v->domain,
> +                                 v->arch.hvm_vcpu.guest_cr[3] >> PAGE_SHIFT,
> +                                 NULL, P2M_ALLOC);

P2M_ALLOC but not P2M_UNSHARE?

> +        if ( !page )
> +        {
> +            gdprintk(XENLOG_ERR, "Invalid CR3: %#lx\n",
> +                     v->arch.hvm_vcpu.guest_cr[3]);
> +            domain_crash(v->domain);
> +            return -EINVAL;
> +        }
> +
> +        v->arch.guest_table = pagetable_from_page(page);
> +    }
> +
> +    hvm_set_segment_register(v, x86_seg_cs, &cs);
> +    hvm_set_segment_register(v, x86_seg_ds, &ds);
> +    hvm_set_segment_register(v, x86_seg_ss, &ss);
> +    hvm_set_segment_register(v, x86_seg_tr, &tr);
> +
> +    /* Sync AP's TSC with BSP's. */
> +    v->arch.hvm_vcpu.cache_tsc_offset =
> +        v->domain->vcpu[0]->arch.hvm_vcpu.cache_tsc_offset;
> +    hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset,
> +                             v->domain->arch.hvm_domain.sync_tsc);
> +
> +    v->arch.hvm_vcpu.msr_tsc_adjust = 0;

The need for this one I also can't see right away - another comment
perhaps?

> +    paging_update_paging_modes(v);
> +
> +    v->is_initialised = 1;
> +    set_bit(_VPF_down, &v->pause_flags);

No VGCF_online equivalent?

> +    return 0;
> +#undef SEG

I think this should move up (right after the switch() using it).

> +#ifndef __XEN_PUBLIC_HVM_HVM_VCPU_H__
> +#define __XEN_PUBLIC_HVM_HVM_VCPU_H__
> +
> +#include "../xen.h"
> +
> +struct vcpu_hvm_x86_16 {
> +    uint16_t ax;
> +    uint16_t cx;
> +    uint16_t dx;
> +    uint16_t bx;
> +    uint16_t sp;
> +    uint16_t bp;
> +    uint16_t si;
> +    uint16_t di;
> +    uint16_t ip;
> +    uint16_t flags;
> +
> +    uint32_t cr0;
> +    uint32_t cr4;
> +
> +    uint32_t cs_base;
> +    uint32_t ds_base;
> +    uint32_t ss_base;
> +    uint32_t tr_base;
> +    uint32_t cs_limit;
> +    uint32_t ds_limit;
> +    uint32_t ss_limit;
> +    uint32_t tr_limit;
> +    uint16_t cs_ar;
> +    uint16_t ds_ar;
> +    uint16_t ss_ar;
> +    uint16_t tr_ar;
> +};

I doubt this is very useful: While one ought to be able to start a
guest in 16-bit mode, its GPRs still are supposed to be 32 bits wide.
The mode used really would depend on the cs_ar setting. (Having
the structure just for a 16-bit IP would seem insane.)

> +struct vcpu_hvm_x86_32 {
> +    uint32_t eax;
> +    uint32_t ecx;
> +    uint32_t edx;
> +    uint32_t ebx;
> +    uint32_t esp;
> +    uint32_t ebp;
> +    uint32_t esi;
> +    uint32_t edi;
> +    uint32_t eip;
> +    uint16_t eflags;

uint32_t for sure.

> +    uint32_t cr0;
> +    uint32_t cr3;
> +    uint32_t cr4;
> +    uint64_t efer;

What again was the point of having EFER here?

> +    uint32_t cs_base;
> +    uint32_t ds_base;
> +    uint32_t ss_base;

I continue to question why we have DS here, but not ES (and maybe
FS and GS too). I.e. either just CS and SS (which are architecturally
required) or at least all four traditional x86 segment registers. And
you're also clearly not targeting minimal state, or else there likely
wouldn't be a need for e.g. R8-R15 in the 64-bit variant.

> +struct vcpu_hvm_x86_64 {
> +    uint64_t rax;
> +    uint64_t rcx;
> +    uint64_t rdx;
> +    uint64_t rbx;
> +    uint64_t rsp;
> +    uint64_t rbp;
> +    uint64_t rsi;
> +    uint64_t rdi;
> +    uint64_t r8;
> +    uint64_t r9;
> +    uint64_t r10;
> +    uint64_t r11;
> +    uint64_t r12;
> +    uint64_t r13;
> +    uint64_t r14;
> +    uint64_t r15;
> +    uint64_t rip;
> +    uint64_t rflags;
> +
> +    uint64_t cr0;
> +    uint64_t cr3;
> +    uint64_t cr4;
> +    uint64_t efer;
> +
> +    uint32_t cs_base;
> +    uint32_t ds_base;
> +    uint32_t ss_base;
> +    uint32_t tr_base;
> +    uint32_t cs_limit;
> +    uint32_t ds_limit;
> +    uint32_t ss_limit;
> +    uint32_t tr_limit;

I can see the need for the TR ones here, but what's the point of the
CS, SS, and DS ones?

> +/*
> + * The layout of the _ar fields of the segment registers is the
> + * following:
> + *
> + * Bits [0,3]: type (bits 40-43).
> + * Bit      4: s    (descriptor type, bit 44).
> + * Bit  [5,6]: dpl  (descriptor privilege level, bits 45-46).
> + * Bit      7: p    (segment-present, bit 47).
> + * Bit      8: avl  (available for system software, bit 52).
> + * Bit      9: l    (64-bit code segment, bit 53).
> + * Bit     10: db   (meaning depends on the segment, bit 54).
> + * Bit     11: g    (granularity, bit 55)
> + *
> + * A more complete description of the meaning of this fields can be
> + * obtained from the Intel SDM, Volume 3, section 3.4.5.
> + */

Please make explicit what state bits 12-15 are expected to be in
(hopefully they're being checked to be zero rather than getting
ignored).

Please also clarify whether the limit specified for the various
segment registers is the one present in descriptor tables (subject
to scaling by the g bit) or (more likely) the byte granular one. In
any event I suppose certain restrictions apply.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.