[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH for-4.5 v8 4/7] xen: Add vmware_port support



>>> On 03.10.14 at 00:40, <dslutz@xxxxxxxxxxx> wrote:
> This is a new domain_create() flag, DOMCRF_vmware_port.  It is
> passed to domctl as XEN_DOMCTL_CDF_vmware_port.

Can you explain why a HVM param isn't suitable here?

> This is both a more complete support then in currently provided by
> QEMU and/or KVM and less.  The missing part requires QEMU changes
> and has been left out until the QEMU patches are accepted upstream.

I vaguely recall the question having been asked before, but I can't
find it to the answer to it: If qemu has support for this, why can't
you build on that rather than adding everything in the hypervisor?

> For AMD (svm) the max instruction length of 15 is hard coded.  This
> is because __get_instruction_length_from_list() has issues that when
> called from #GP handler NRIP is not available, or that NRIP may not
> be available at all on a particular HW, leading to the need read the
> instruction twice --- once in __get_instruction_length_from_list()
> and then again in vmport_gp_check(). Which is bad because memory may
> change between the reads.

I don't get the connection between the first sentence (which just
states an architectural fact) and the rest of this paragraph.

> @@ -2111,6 +2112,31 @@ svm_vmexit_do_vmsave(struct vmcb_struct *vmcb,
>       return;
>   }
> 
> +static void svm_vmexit_gp_intercept(struct cpu_user_regs *regs,
> +                                    struct vcpu *v)
> +{
> +    struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
> +    /*
> +     * Just use 15 for the instruction length; vmport_gp_check will
> +     * adjust it.  This is because
> +     * __get_instruction_length_from_list() has issues, and may
> +     * require a double read of the instruction bytes.  At some
> +     * point a new routine could be added that is based on the code
> +     * in vmport_gp_check with extensions to make it more general.
> +     * Since that routine is the only user of this code this can be
> +     * done later.
> +     */
> +    unsigned long inst_len = 15;

Surely this can be unsigned int? And the value be MAX_INST_LEN?

> --- /dev/null
> +++ b/xen/arch/x86/hvm/vmware/vmport.c
> @@ -0,0 +1,262 @@
> +/*
> + * HVM VMPORT emulation
> + *
> + * Copyright (C) 2012 Verizon Corporation
> + *
> + * This file is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License Version 2 (GPLv2)
> + * as published by the Free Software Foundation.
> + *
> + * This file is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details. <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/config.h>

No need for this.

> +#define MAX_INST_LEN 15

Please move SVM's identical definition into e.g. asm-x86/processor.h
or even x86_emulate/x86_emulate.h (so it can also be used in
x86_emulate/x86_emulate.c), and avoid adding another instance
here.

> +#ifndef NDEBUG
> +unsigned int opt_vmport_debug __read_mostly;
> +integer_param("vmport_debug", opt_vmport_debug);
> +#endif

If this was used anywhere, the variable ought to be static. But
since it seems unused, it ought to be dropped.

> +/* More VMware defines */
> +
> +#define VMWARE_GUI_AUTO_GRAB              0x001
> +#define VMWARE_GUI_AUTO_UNGRAB            0x002
> +#define VMWARE_GUI_AUTO_SCROLL            0x004
> +#define VMWARE_GUI_AUTO_RAISE             0x008
> +#define VMWARE_GUI_EXCHANGE_SELECTIONS    0x010
> +#define VMWARE_GUI_WARP_CURSOR_ON_UNGRAB  0x020
> +#define VMWARE_GUI_FULL_SCREEN            0x040
> +
> +#define VMWARE_GUI_TO_FULL_SCREEN         0x080
> +#define VMWARE_GUI_TO_WINDOW              0x100
> +
> +#define VMWARE_GUI_AUTO_RAISE_DISABLED    0x200
> +
> +#define VMWARE_GUI_SYNC_TIME              0x400

What do all of the above mean? Without any explanation it is
impossible to understand why reporting any of them set below
is correct/acceptable.

> +int vmport_ioport(int dir, uint32_t port, uint32_t bytes, uint32_t *val)
> +{
> +    struct cpu_user_regs *regs = guest_cpu_user_regs();
> +    uint16_t cmd = regs->rcx;

As you already have most other variables needed only inside the if()
below declared in that scope, please be consistent with this one.
Albeit the value of this variable is questionable anyway - it's being
used exactly once.

> +    int rc = X86EMUL_OKAY;
> +
> +    if ( regs->_eax == BDOOR_MAGIC )

With this, is handling other than 32-bit in/out really meaningful/
correct?

> +        case BDOOR_CMD_GETHWVERSION:
> +            /* vmware_hw */
> +            regs->_eax = 0;
> +            if ( is_hvm_vcpu(curr) )

Since you can't get here for PV, I can't see what you need this
conditional for.

> +            {
> +                struct hvm_domain *hd = &d->arch.hvm_domain;
> +
> +                regs->_eax = hd->params[HVM_PARAM_VMWARE_HW];
> +            }
> +            if ( !regs->_eax )
> +                regs->_eax = 4;  /* Act like version 4 */

Why version 4?

> +        case BDOOR_CMD_GETTIME:
> +            value = get_localtime_us(d) - d->time_offset_seconds * 
> 1000000ULL;
> +            /* hostUsecs */
> +            regs->_ebx = value % 1000000UL;
> +            /* hostSecs */
> +            regs->_eax = value / 1000000ULL;
> +            /* maxTimeLag */
> +            regs->_ecx = 1000000;
> +            /* offset to GMT in minutes */
> +            regs->_edx = d->time_offset_seconds / 60;
> +            break;
> +        case BDOOR_CMD_GETTIMEFULL:
> +            value = get_localtime_us(d) - d->time_offset_seconds * 
> 1000000ULL;
> +            /* ... */

???

> +            regs->_eax = BDOOR_MAGIC;

regs->_eax already has this value.

> +            /* hostUsecs */
> +            regs->_ebx =value / 1000000ULL;
> +            /* maxTimeLag */
> +            regs->_ecx = 1000000;
> +            break;

Perhaps this should share code with BDOOR_CMD_GETTIME; I have
to admit though that I can't make any sense of why the latter one
has a FULL suffix when it returns _less_ information.

> +        if ( dir == IOREQ_READ )
> +        {
> +            switch ( bytes )
> +            {
> +            case 1:
> +                regs->rax = (saved_rax & 0xffffff00) | (regs->rax & 0xff);
> +                break;
> +            case 2:
> +                regs->rax = (saved_rax & 0xffff0000) | (regs->rax & 0xffff);
> +                break;

Both of these zero the high 32 bits when they shouldn't. But also see
below.

> +            case 4:
> +                regs->rax = regs->_eax;
> +                break;
> +            }
> +            *val = regs->rax;
> +        }
> +        else
> +            regs->rax = saved_rax;

This is all rather dubious - instead of clobbering reg->rax within the
earlier switch, write the value to a local variable and then merge it
here. But as much as above, the question on what to do with
operand size being other than 32-bit - in particular for the cases
where other registers get modified - is relevant here too. Even more
so that the "port" function parameter isn't even checked (and hence
you'd also handle e.g. "in %dx,%al" with %dx being BDOOR_PORT
+ 1, 2, or 3 afaict).

> +int vmport_gp_check(struct cpu_user_regs *regs, struct vcpu *v,
> +                    unsigned long *inst_len, unsigned long inst_addr,
> +                    unsigned long ei1, unsigned long ei2)
> +{
> +    if ( !v->domain->arch.hvm_domain.is_vmware_port_enabled )
> +        return X86EMUL_VMPORT_NOT_ENABLED;
> +
> +    if ( *inst_len && *inst_len <= MAX_INST_LEN &&
> +         (regs->rdx & 0xffff) == BDOOR_PORT && ei1 == 0 && ei2 == 0 &&

regs->_edx may yield slightly better code; I wonder whether we
shouldn't extend __DECL_REG() to also give us ->dx (and maybe
even ->dl, ->dh, etc).

These ei1/ei2 checks belong in the callers imo - even if both SVM
and VMX happen to have them be zero in the cases you're
interested in, these are still vendor dependent values which
shouldn't be interpreted by vendor independent code.

> +         regs->_eax == BDOOR_MAGIC )
> +    {
> +        int i = 0;

unsigned int

> +        uint32_t val;
> +        uint32_t byte_cnt = hvm_guest_x86_mode(v);

Not the best variable name - x86emul uses op_bytes. Which gets me
to a fundamental question: With all this custom decoding and
linearizing of CS:RIP, did you investigate using x86emul with suitably
set up callbacks instead? That would e.g. at once make you properly
ignore benign instruction prefixes (you look for 0x66 only below).

> +        unsigned char bytes[MAX_INST_LEN];
> +        unsigned int fetch_len;
> +        int frc;
> +
> +        /* in or out are limited to 32bits */
> +        if ( byte_cnt > 4 )
> +            byte_cnt = 4;
> +
> +        /*
> +         * Fetch up to the next page break; we'll fetch from the
> +         * next page later if we have to.
> +         */
> +        fetch_len = min_t(unsigned int, *inst_len,
> +                          PAGE_SIZE - (inst_addr  & ~PAGE_MASK));
> +        frc = hvm_fetch_from_guest_virt_nofault(bytes, inst_addr, fetch_len,
> +                                                PFEC_page_present);
> +        if ( frc != HVMCOPY_okay )
> +        {
> +            gdprintk(XENLOG_WARNING,
> +                     "Bad instruction fetch at %#lx (frc=%d il=%lu fl=%u)\n",
> +                     (unsigned long) inst_addr, frc, *inst_len, fetch_len);

Pointless cast. But the value of log messages like this one is
questionable anyway.

> +            return X86EMUL_VMPORT_FETCH_ERROR_BYTE1;
> +        }
> +
> +        /* Check for operand size prefix */
> +        while ( (i < MAX_INST_LEN) && (bytes[i] == 0x66) )
> +        {
> +            i++;
> +            if ( i >= fetch_len )
> +            {
> +                frc = hvm_fetch_from_guest_virt_nofault(
> +                    &bytes[fetch_len], inst_addr + fetch_len,
> +                    MAX_INST_LEN - fetch_len, PFEC_page_present);
> +                if ( frc != HVMCOPY_okay )
> +                {
> +                    gdprintk(XENLOG_WARNING,
> +                             "Bad instruction fetch at %#lx + %#x 
> (frc=%d)\n",
> +                             inst_addr, fetch_len, frc);
> +                    return X86EMUL_VMPORT_FETCH_ERROR_BYTE2;
> +                }
> +                fetch_len = MAX_INST_LEN;
> +            }
> +        }
> +        *inst_len = i + 1;

i may be MAX_INST_LEN already when you get here.

> +
> +        /* Only adjust byte_cnt 1 time */
> +        if ( bytes[0] == 0x66 )     /* operand size prefix */
> +        {
> +            if ( byte_cnt == 4 )
> +                byte_cnt = 2;
> +            else
> +                byte_cnt = 4;
> +        }

Iirc REX.W set following 0x66 cancels the effect of the latter. Another
thing x86emul would be taking care of for you if you used it.

Also this byte_cnt handling isn't correct for the real and VM86 mode
cases (where hvm_guest_x86_mode() returns 0/1 respectively).

> +        if ( bytes[i] == 0xed )     /* in (%dx),%eax or in (%dx),%ax */
> +            return vmport_ioport(IOREQ_READ, BDOOR_PORT, byte_cnt, &val);
> +        else if ( bytes[i] == 0xec )     /* in (%dx),%al */
> +            return vmport_ioport(IOREQ_READ, BDOOR_PORT, 1, &val);
> +        else if ( bytes[i] == 0xef )     /* out %eax,(%dx) or out %ax,(%dx) 
> */
> +            return vmport_ioport(IOREQ_WRITE, BDOOR_PORT, byte_cnt, &val);
> +        else if ( bytes[i] == 0xee )     /* out %al,(%dx) */
> +            return vmport_ioport(IOREQ_WRITE, BDOOR_PORT, 1, &val);
> +        else
> +        {
> +            *inst_len = 0; /* This is unknown. */
> +            return X86EMUL_VMPORT_BAD_OPCODE;
> +        }

switch() please.

> +static void vmx_vmexit_gp_intercept(struct cpu_user_regs *regs,
> +                                    struct vcpu *v)
> +{
> +    unsigned long exit_qualification;
> +    unsigned long inst_len;
> +    unsigned long inst_addr = vmx_rip2pointer(regs, v);
> +    unsigned long ecode;
> +    int rc;
> +#ifndef NDEBUG
> +    unsigned long orig_inst_len;
> +    unsigned long vector;
> +
> +    __vmread(VM_EXIT_INTR_INFO, &vector);
> +    BUG_ON(!(vector & INTR_INFO_VALID_MASK));
> +    BUG_ON(!(vector & INTR_INFO_DELIVER_CODE_MASK));
> +#endif

If you use ASSERT() instead of BUG_ON(), I think you can avoid most
of this preprocessor conditional.

> +    __vmread(EXIT_QUALIFICATION, &exit_qualification);
> +    __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len);

get_instruction_length(). But is it architecturally defined that
#GP intercept vmexits actually set this field?

> @@ -2182,6 +2183,8 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs *regs,
>               if ( v->fpu_dirtied )
>                   nvcpu->nv_vmexit_pending = 1;
>           }
> +        else if ( vector == TRAP_gp_fault )
> +            nvcpu->nv_vmexit_pending = 1;

Doesn't that mean an unconditional vmexit even if the L1 hypervisor
didn't ask for such?

> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -25,7 +25,7 @@
>   #include <public/hvm/ioreq.h>
>   #include <public/event_channel.h>
> 
> -#define MAX_IO_HANDLER             16
> +#define MAX_IO_HANDLER             17

If you're really getting beyond 16 (which I don't see, I'm counting
14 current users) this should be bumped by more than just 1.

> +/*
> + * Additional return values from vmport_gp_check.
> + *
> + * Note: return values include:
> + *   X86EMUL_OKAY
> + *   X86EMUL_UNHANDLEABLE
> + *   X86EMUL_EXCEPTION
> + *   X86EMUL_RETRY
> + *   X86EMUL_CMPXCHG_FAILED
> + *
> + * The additional do not overlap any of the above.
> + */
> +#define X86EMUL_VMPORT_NOT_ENABLED              10
> +#define X86EMUL_VMPORT_FETCH_ERROR_BYTE1        11
> +#define X86EMUL_VMPORT_FETCH_ERROR_BYTE2        12
> +#define X86EMUL_VMPORT_BAD_OPCODE               13
> +#define X86EMUL_VMPORT_BAD_STATE                14

Going through the patch, you only ever return these, but never
check for any of them. Why do you add these in the first place,
risking future collisions even if there are none now?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.