Xen project Mailing List

Re: [Xen-devel] [PATCH v6 24/29] xen/x86: allow HVM guests to use hypercalls to bring up vCPUs

From: Roger Pau MonnÃ <roger.pau@xxxxxxxxxx>

Date: Mon, 28 Sep 2015 18:09:49 +0200

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Mon, 28 Sep 2015 16:09:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

El 21/09/15 a les 17.44, Jan Beulich ha escrit: >>>> On 04.09.15 at 14:09, <roger.pau@xxxxxxxxxx> wrote: >> Allow the usage of the VCPUOP_initialise, VCPUOP_up, VCPUOP_down and >> VCPUOP_is_up hypercalls from HVM guests. >> >> This patch introduces a new structure (vcpu_hvm_context) that should be used >> in conjuction with the VCPUOP_initialise hypercall in order to initialize >> vCPUs for HVM guests. >> >> Signed-off-by: Roger Pau MonnÃ <roger.pau@xxxxxxxxxx> >> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > > So this bi-modal thing doesn't look too bad, but a concern I have is > with the now different contexts used by XEN_DOMCTL_setvcpucontext > and VCPUOP_initialise. Yes, that's far from ideal. I was going to say that XEN_DOMCTL_{set/get}vcpucontext should return EOPNOTSUPP when executed against HVM guests, but that's going to break current toolstack code that relies on this in order to perform suspend/resume of HVM guests (and gdbsx would probably also be affected). If you feel that would be a better solution I could fix current tools code in order to use XEN_DOMCTL_{get/set}hvmcontext instead of XEN_DOMCTL_{set/get}vcpucontext when dealing with HVM guests. >> --- a/xen/arch/arm/domain.c >> +++ b/xen/arch/arm/domain.c >> @@ -752,6 +752,30 @@ int arch_set_info_guest( >> return 0; >> } >> >> +int arch_initialize_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg) >> +{ >> + struct vcpu_guest_context *ctxt; >> + struct domain *d = v->domain; >> + int rc; >> + >> + if ( (ctxt = alloc_vcpu_guest_context()) == NULL ) >> + return -ENOMEM; >> + >> + if ( copy_from_guest(ctxt, arg, 1) ) >> + { >> + free_vcpu_guest_context(ctxt); >> + return -EFAULT; >> + } >> + >> + domain_lock(d); >> + rc = v->is_initialised ? -EEXIST : arch_set_info_guest(v, ctxt); >> + domain_unlock(d); >> + >> + free_vcpu_guest_context(ctxt); >> + >> + return rc; >> +} > > I wonder whether this shouldn't instead be kept in common code, > with arch code calling it as needed (e.g. as default_initialize_vcpu()), > since afaict the code is now duplicate with the x86 side PV handling. Ack. >> @@ -1140,6 +1141,201 @@ int arch_set_info_guest( >> #undef c >> } >> >> +/* Called by VCPUOP_initialise for HVM guests. */ >> +static int arch_set_info_hvm_guest(struct vcpu *v, vcpu_hvm_context_t *ctx) >> +{ >> + struct cpu_user_regs *uregs = &v->arch.user_regs; >> + struct segment_register cs, ds, ss, tr; >> + >> +#define SEG(s, r) \ >> + (struct segment_register){ .base = (r)->s ## _base, \ >> + .limit = (r)->s ## _limit, .attr.bytes = (r)->s ## _ar } >> + >> + switch ( ctx->mode ) >> + { >> + default: >> + return -EINVAL; >> + >> + case VCPU_HVM_MODE_16B: > > I think "MODE" is misleading here, not just because of the register > size issue (see further down) but also because you don't seem to > enforce the respective mode to be chosen in cs_ar. I'm also missing > at least some simple consistency checks (like CS.DPL == SS.DPL, > rIP within CS limit, rSP within SS limit); leaving these to the first VM > entry would likely result in much harder to analyze issues in case of > bad input. See the comment about the 16bit registers in vcpu_hvm_context_t. >> + { >> + const struct vcpu_hvm_x86_16 *regs = &ctx->cpu_regs.x86_16; >> + >> + uregs->rax = regs->ax; >> + uregs->rcx = regs->cx; >> + uregs->rdx = regs->dx; >> + uregs->rbx = regs->bx; >> + uregs->rsp = regs->sp; >> + uregs->rbp = regs->bp; >> + uregs->rsi = regs->si; >> + uregs->rdi = regs->di; >> + uregs->rip = regs->ip; >> + uregs->rflags = regs->flags; >> + >> + v->arch.hvm_vcpu.guest_cr[0] = regs->cr0; >> + v->arch.hvm_vcpu.guest_cr[4] = regs->cr4; >> + >> + cs = SEG(cs, regs); >> + ds = SEG(ds, regs); >> + ss = SEG(ss, regs); >> + tr = SEG(tr, regs); >> + } >> + break; >> + >> + case VCPU_HVM_MODE_32B: >> + { >> + const struct vcpu_hvm_x86_32 *regs = &ctx->cpu_regs.x86_32; >> + >> + uregs->rax = regs->eax; >> + uregs->rcx = regs->ecx; >> + uregs->rdx = regs->edx; >> + uregs->rbx = regs->ebx; >> + uregs->rsp = regs->esp; >> + uregs->rbp = regs->ebp; >> + uregs->rsi = regs->esi; >> + uregs->rdi = regs->edi; >> + uregs->rip = regs->eip; >> + uregs->rflags = regs->eflags; >> + >> + v->arch.hvm_vcpu.guest_cr[0] = regs->cr0; >> + v->arch.hvm_vcpu.guest_cr[3] = regs->cr3; >> + v->arch.hvm_vcpu.guest_cr[4] = regs->cr4; >> + v->arch.hvm_vcpu.guest_efer = regs->efer; >> + >> + cs = SEG(cs, regs); >> + ds = SEG(ds, regs); >> + ss = SEG(ss, regs); >> + tr = SEG(tr, regs); >> + } >> + break; >> + >> + case VCPU_HVM_MODE_64B: >> + { >> + const struct vcpu_hvm_x86_64 *regs = &ctx->cpu_regs.x86_64; >> + >> + uregs->rax = regs->rax; >> + uregs->rcx = regs->rcx; >> + uregs->rdx = regs->rdx; >> + uregs->rbx = regs->rbx; >> + uregs->rsp = regs->rsp; >> + uregs->rbp = regs->rbp; >> + uregs->rsi = regs->rsi; >> + uregs->rdi = regs->rdi; >> + uregs->rip = regs->rip; >> + uregs->rflags = regs->rflags; >> + uregs->r8 = regs->r8; >> + uregs->r9 = regs->r9; >> + uregs->r10 = regs->r10; >> + uregs->r11 = regs->r11; >> + uregs->r12 = regs->r12; >> + uregs->r13 = regs->r13; >> + uregs->r14 = regs->r14; >> + uregs->r15 = regs->r15; >> + >> + v->arch.hvm_vcpu.guest_cr[0] = regs->cr0; >> + v->arch.hvm_vcpu.guest_cr[3] = regs->cr3; >> + v->arch.hvm_vcpu.guest_cr[4] = regs->cr4; >> + v->arch.hvm_vcpu.guest_efer = regs->efer; >> + >> + cs = SEG(cs, regs); >> + ds = SEG(ds, regs); >> + ss = SEG(ss, regs); >> + tr = SEG(tr, regs); >> + } >> + break; >> + >> + } >> + >> + if ( !paging_mode_hap(v->domain) ) >> + v->arch.guest_table = pagetable_null(); > > A comment with the reason for this would be nice. I can't immediately > see why this is here. guest_table contains uninitialized data at this point, which makes pagetable_is_null return false. Other places where guest_table is set check if previous guest_table is null or not, and if it's not null Xen will try to free it, causing a bug. hvm_vcpu_reset_state does something similar when setting the initial vCPU state. I will add a comment to point this out. >> + hvm_update_guest_cr(v, 0); >> + hvm_update_guest_cr(v, 4); >> + >> + if ( (ctx->mode == VCPU_HVM_MODE_32B) || >> + (ctx->mode == VCPU_HVM_MODE_64B) ) >> + { >> + hvm_update_guest_cr(v, 3); >> + hvm_update_guest_efer(v); >> + } >> + >> + if ( hvm_paging_enabled(v) && !paging_mode_hap(v->domain) ) >> + { >> + /* Shadow-mode CR3 change. Check PDBR and update refcounts. */ >> + struct page_info *page = get_page_from_gfn(v->domain, >> + v->arch.hvm_vcpu.guest_cr[3] >> PAGE_SHIFT, >> + NULL, P2M_ALLOC); > > P2M_ALLOC but not P2M_UNSHARE? This is a copy of what's done in hvm_set_cr3 when shadow mode is enabled. >> + if ( !page ) >> + { >> + gdprintk(XENLOG_ERR, "Invalid CR3: %#lx\n", >> + v->arch.hvm_vcpu.guest_cr[3]); >> + domain_crash(v->domain); >> + return -EINVAL; >> + } >> + >> + v->arch.guest_table = pagetable_from_page(page); >> + } >> + >> + hvm_set_segment_register(v, x86_seg_cs, &cs); >> + hvm_set_segment_register(v, x86_seg_ds, &ds); >> + hvm_set_segment_register(v, x86_seg_ss, &ss); >> + hvm_set_segment_register(v, x86_seg_tr, &tr); >> + >> + /* Sync AP's TSC with BSP's. */ >> + v->arch.hvm_vcpu.cache_tsc_offset = >> + v->domain->vcpu[0]->arch.hvm_vcpu.cache_tsc_offset; >> + hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset, >> + v->domain->arch.hvm_domain.sync_tsc); >> + >> + v->arch.hvm_vcpu.msr_tsc_adjust = 0; > > The need for this one I also can't see right away - another comment > perhaps? AFAICT we need to initialize this to a sane value, like it's done in hvm_vcpu_reset_state. >> + paging_update_paging_modes(v); >> + >> + v->is_initialised = 1; >> + set_bit(_VPF_down, &v->pause_flags); > > No VGCF_online equivalent? No, this interface doesn't allow to setup the vCPU state and start it at the same time. Users will have to call VCPUOP_up after VCPUOP_initialise in order to start the vCPU. vcpu_hvm_context doesn't even have the "flags" field any more, where this used to be set. >> + return 0; >> +#undef SEG > > I think this should move up (right after the switch() using it). Ok, I don't mind. >> +#ifndef __XEN_PUBLIC_HVM_HVM_VCPU_H__ >> +#define __XEN_PUBLIC_HVM_HVM_VCPU_H__ >> + >> +#include "../xen.h" >> + >> +struct vcpu_hvm_x86_16 { >> + uint16_t ax; >> + uint16_t cx; >> + uint16_t dx; >> + uint16_t bx; >> + uint16_t sp; >> + uint16_t bp; >> + uint16_t si; >> + uint16_t di; >> + uint16_t ip; >> + uint16_t flags; >> + >> + uint32_t cr0; >> + uint32_t cr4; >> + >> + uint32_t cs_base; >> + uint32_t ds_base; >> + uint32_t ss_base; >> + uint32_t tr_base; >> + uint32_t cs_limit; >> + uint32_t ds_limit; >> + uint32_t ss_limit; >> + uint32_t tr_limit; >> + uint16_t cs_ar; >> + uint16_t ds_ar; >> + uint16_t ss_ar; >> + uint16_t tr_ar; >> +}; > > I doubt this is very useful: While one ought to be able to start a > guest in 16-bit mode, its GPRs still are supposed to be 32 bits wide. > The mode used really would depend on the cs_ar setting. (Having > the structure just for a 16-bit IP would seem insane.) So, would you prefer to go just with the 64-bit structure and the mode is simply set by the value of the control registers / segment selectors? >> +struct vcpu_hvm_x86_32 { >> + uint32_t eax; >> + uint32_t ecx; >> + uint32_t edx; >> + uint32_t ebx; >> + uint32_t esp; >> + uint32_t ebp; >> + uint32_t esi; >> + uint32_t edi; >> + uint32_t eip; >> + uint16_t eflags; > > uint32_t for sure. Right. > >> + uint32_t cr0; >> + uint32_t cr3; >> + uint32_t cr4; >> + uint64_t efer; > > What again was the point of having EFER here? For the NX bit. >> + uint32_t cs_base; >> + uint32_t ds_base; >> + uint32_t ss_base; > > I continue to question why we have DS here, but not ES (and maybe > FS and GS too). I.e. either just CS and SS (which are architecturally > required) or at least all four traditional x86 segment registers. And > you're also clearly not targeting minimal state, or else there likely > wouldn't be a need for e.g. R8-R15 in the 64-bit variant. I'm fine with removing r8-15. Regarding the segment selectors, I don't have a problem with only allowing CS and SS to be set, or all of them including FS and GS. But I would like to get a consensus on this, we have already gone back and forth several times regarding how this structure should look like, and TBH, I was hoping that this was the last time. Andrew, Jan, what would you prefer, either DS is removed or ES, FS and GS are also added? >> +struct vcpu_hvm_x86_64 { >> + uint64_t rax; >> + uint64_t rcx; >> + uint64_t rdx; >> + uint64_t rbx; >> + uint64_t rsp; >> + uint64_t rbp; >> + uint64_t rsi; >> + uint64_t rdi; >> + uint64_t r8; >> + uint64_t r9; >> + uint64_t r10; >> + uint64_t r11; >> + uint64_t r12; >> + uint64_t r13; >> + uint64_t r14; >> + uint64_t r15; >> + uint64_t rip; >> + uint64_t rflags; >> + >> + uint64_t cr0; >> + uint64_t cr3; >> + uint64_t cr4; >> + uint64_t efer; >> + >> + uint32_t cs_base; >> + uint32_t ds_base; >> + uint32_t ss_base; >> + uint32_t tr_base; >> + uint32_t cs_limit; >> + uint32_t ds_limit; >> + uint32_t ss_limit; >> + uint32_t tr_limit; > > I can see the need for the TR ones here, but what's the point of the > CS, SS, and DS ones? In case the guest wants to start in 64bit compatibility mode? >> +/* >> + * The layout of the _ar fields of the segment registers is the >> + * following: >> + * >> + * Bits [0,3]: type (bits 40-43). >> + * Bit 4: s (descriptor type, bit 44). >> + * Bit [5,6]: dpl (descriptor privilege level, bits 45-46). >> + * Bit 7: p (segment-present, bit 47). >> + * Bit 8: avl (available for system software, bit 52). >> + * Bit 9: l (64-bit code segment, bit 53). >> + * Bit 10: db (meaning depends on the segment, bit 54). >> + * Bit 11: g (granularity, bit 55) >> + * >> + * A more complete description of the meaning of this fields can be >> + * obtained from the Intel SDM, Volume 3, section 3.4.5. >> + */ > > Please make explicit what state bits 12-15 are expected to be in > (hopefully they're being checked to be zero rather than getting > ignored). I will add a comment regarding the 12-15 bits. IMHO, that checking should be done in hvm_set_segment_register, and is not part of this patch. > Please also clarify whether the limit specified for the various > segment registers is the one present in descriptor tables (subject > to scaling by the g bit) or (more likely) the byte granular one. In > any event I suppose certain restrictions apply. Limit is not subject to the g bit, the limit set here it's the one that ends up present in the descriptor tables (or in this case in the cached part of the selectors). IMHO that's implicit, and it's what's done on bare metal. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.