[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 3/3] Add limited support of VMware's hyper-call



On 09/02/14 04:16, Jan Beulich wrote:
On 01.09.14 at 17:33,<dslutz@xxxxxxxxxxx>  wrote:
This is both a more complete support then in currently provided by
QEMU and/or KVM and less.  The missing part requires QEMU changes
and has been left out until the QEMU patches are accepted upstream.

VMware's hyper-call is also known as VMware Backdoor I/O Port.

Note: this support does not depend on vmware_hw being non-zero.

Summary is that VMware treats "IN EAX, DX" (or "OUT DX, EAX"; or
"inl %dx, %eax" in AT&T syntax ) to port 0x5658 specially.  Note:
since many operations return data in EAX, "OUT DX, EAX" does not
work for them on VMware, I did not support the "OUT DX, EAX", but it
would not be hard to add.

Also this instruction is allowed to be used from ring 3.  To
support this the vmexit for GP needs to be enabled.  I have not
fully tested that nested HVM is doing the right thing for this.

An open source example of using this is:

http://open-vm-tools.sourceforge.net/

Which only uses "IN EAX, DX".  Also

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=disp
layKC&externalId=1009458

Lists just "inl (%%dx)" (I assume this is AT&T syntax and is the
same as "inl %dx, %eax").

The support included is enough to allow VMware tools to install in a
HVM domU and provide guestinfo support.  guestinfo support is
provide by what is known as VMware RPC support.  This guestinfo
support is provided via libxc.  libxl support has not be written.

Note: VMware RPC support is only available on HVM domU.

This interface is an extension of __HYPERVISOR_HVM_op.  It was
picked because xc_get_hvm_param() also uses it and VMware guest
info is a lot like a hvm param.

The HVMOP_get_vmport_guest_info is used by two libxc functions,
xc_get_vmport_guest_info and xc_fetch_vmport_guest_info.
xc_fetch_vmport_guest_info is designed to be used to fetch all
currently set guestinfo values.

To save on hypervisor heap memory, the guestinfo support in done in
two sizes, normal and jumbo.  Normal is used to handle up to 128
byte values and jumbo is used to handle up to 4096 byte values.

Since all this is work is done when the guest is doing a single
instruction; it was designed to not use the hypervisor heap to
allocate the memory at this time.  Instead a few are allocated at
the create domain time and during the xen's hyper-call to get or set
them.  This was picked in that if a tool stack is using the VMware
guest info support, it should be using either of both of the get and
set.  And so in this case the guest should only see an out of memory
error when the compile max amount of hypervisor heap memory is in
use.

Doing it this way does lead to a lot of pointer use and many
sub structures.

If the domU is running VMware tools, then the "build version" of
the tools is also available via xc_get_HVM_param().  This also
enables the use of new triggers that will use the VMware hyper-call
to do some limited control of the domU.  The most useful are
poweroff and reboot.  Since a guest process needs to be running
for these to work, a tool stack should check that the build version
is non zero before assuming these will work.

The 2 hvm param's HVM_PARAM_VMPORT_BUILD_NUMBER_TIME and
HVM_PARAM_VMPORT_BUILD_NUMBER_VALUE are how "build version" is
accessed.  These 2 params are only allowed to be set to zero.  The
HVM_PARAM_VMPORT_BUILD_NUMBER_TIME can be used to track the last
time the VMware tools in the guest responded.  One such use would
be the health of the tools in the guest.  The hvm param
HVM_PARAM_VMPORT_RESET_TIME controls how often to request them in
seconds minus 1.  The minus 1 is to handle to 0 case.  I.E. the
fastest that can be selected is every second.  The default is 4
times a minute.

The VMware RPC support includes the notion of channels that are
opened, active and closed.  All RPC messages sent via a channel
starts with normal ASCII text.  The message some times does include
binary data.

Currently there are 2 protocols defined for VMware RPC.  They
determine the direction for data flow, domU to tool stack or
tool stack to domU.

There is no provided interrupt for VMware RPC.

The VMware's hyper-call state is included in live migration and
save/restore.  Because the max size of the VMware guestinfo is
large, then data is compressed and expanded in the
vmport_save_domain_ctxt and vmport_load_domain_ctxt.

For a debug=y build there is a new command line option
vmport_debug=.  It enabled output to the console of various
stages of handling the "IN EAX, DX" instruction.  Most uses
are the summary ones that show complete RPC actions.

Signed-off-by: Don Slutz<dslutz@xxxxxxxxxxx>
---
  tools/libxc/xc_domain.c                      |  115 +++
  tools/libxc/xenctrl.h                        |   24 +
  tools/misc/xen-hvmctx.c                      |  229 ++++
  tools/xentrace/formats                       |    8 +
  xen/arch/x86/domctl.c                        |   34 +
  xen/arch/x86/hvm/Makefile                    |    1 +
  xen/arch/x86/hvm/hvm.c                       |  480 ++++++++-
  xen/arch/x86/hvm/io.c                        |    1 +
  xen/arch/x86/hvm/svm/emulate.c               |    3 +
  xen/arch/x86/hvm/svm/svm.c                   |   79 ++
  xen/arch/x86/hvm/svm/vmcb.c                  |    1 +
  xen/arch/x86/hvm/vmport/Makefile             |    1 +
  xen/arch/x86/hvm/vmport/includeCheck.h       |   17 +
  xen/arch/x86/hvm/vmport/vmport.c             | 1436 ++++++++++++++++++++++++++
  xen/arch/x86/hvm/vmx/vmcs.c                  |    1 +
  xen/arch/x86/hvm/vmx/vmx.c                   |   90 ++
  xen/arch/x86/hvm/vmx/vvmx.c                  |   14 +
  xen/include/asm-x86/hvm/domain.h             |    4 +
  xen/include/asm-x86/hvm/svm/emulate.h        |    1 +
  xen/include/asm-x86/hvm/trace.h              |   31 +
  xen/include/asm-x86/hvm/vmport.h             |   90 ++
  xen/include/public/arch-x86/hvm/save.h       |   39 +-
  xen/include/public/arch-x86/hvm/vmporttype.h |  105 ++
  xen/include/public/domctl.h                  |    3 +
  xen/include/public/hvm/hvm_op.h              |   18 +
  xen/include/public/hvm/params.h              |    5 +-
  xen/include/public/trace.h                   |    7 +
  27 files changed, 2834 insertions(+), 3 deletions(-)
  create mode 100644 xen/arch/x86/hvm/vmport/Makefile
  create mode 100644 xen/arch/x86/hvm/vmport/includeCheck.h
  create mode 100644 xen/arch/x86/hvm/vmport/vmport.c
  create mode 100644 xen/include/asm-x86/hvm/vmport.h
  create mode 100644 xen/include/public/arch-x86/hvm/vmporttype.h
As I think was said on v1 already - this should be split into smaller
pieces if at all possible. I'm therefore not going to do a full review
at this time, just give a couple of remarks.

The last time (v1's) split was much worse (and so I think that "split into smaller" was not said). I could check be most were "combine this patch with that patch";
and more comment message.

Having been over this code many times in the last month, a possible
quick split (which would not delay v3 too long) would be:

1) Add simple vmport commands like BDOOR_CMD_GETVERSION
2) Add the hypervisor side of VMware guest info.
3) Add the hvm params that come from VMware guest info.
4) Add the hyper calls for libxc to handle VMware guest info.

I will attempt to split the commit message to follow this also. I expect that
there will be duplication of text in the commit messages.


And should I add the unit tests also (as optional)?


Anything you can provide helps.

@@ -579,6 +580,39 @@ long arch_do_domctl(
          }
          break;
+ case XEN_DOMCTL_SENDTRIGGER_VTPOWER:
+        {
+            ret = -EINVAL;
+            if ( is_hvm_domain(d) )
+            {
+                ret = 0;
+                vmport_ctrl_send(&d->arch.hvm_domain, "OS_Halt");
+            }
+        }
+        break;
+
+        case XEN_DOMCTL_SENDTRIGGER_VTREBOOT:
+        {
+            ret = -EINVAL;
+            if ( is_hvm_domain(d) )
+            {
+                ret = 0;
+                vmport_ctrl_send(&d->arch.hvm_domain, "OS_Reboot");
+            }
+        }
+        break;
+
+        case XEN_DOMCTL_SENDTRIGGER_VTPING:
+        {
+            ret = -EINVAL;
+            if ( is_hvm_domain(d) )
+            {
+                ret = 0;
+                vmport_ctrl_send(&d->arch.hvm_domain, "ping");
+            }
+        }
+        break;
Rather than adding three new domctls, wouldn't a single VMware
one with suitable sub-operations do?

As far as I can tell these are already a "sub-operation":


case XEN_DOMCTL_sendtrigger:
{
struct vcpu *v;

ret = -EINVAL;
if ( domctl->u.sendtrigger.vcpu >= MAX_VIRT_CPUS )
break;

ret = -ESRCH;
if ( domctl->u.sendtrigger.vcpu >= d->max_vcpus ||
(v = d->vcpu[domctl->u.sendtrigger.vcpu]) == NULL )
break;

switch ( domctl->u.sendtrigger.trigger )
{
case XEN_DOMCTL_SENDTRIGGER_NMI:
...


So not sure I can do any thing here.



@@ -1480,6 +1481,7 @@ int hvm_domain_initialise(struct domain *d)
      INIT_LIST_HEAD(&d->arch.hvm_domain.ioreq_server.list);
      spin_lock_init(&d->arch.hvm_domain.irq_lock);
      spin_lock_init(&d->arch.hvm_domain.uc_lock);
+    spin_lock_init(&d->arch.hvm_domain.vmport_lock);
INIT_LIST_HEAD(&d->arch.hvm_domain.msixtbl_list);
      spin_lock_init(&d->arch.hvm_domain.msixtbl_list_lock);
@@ -1492,11 +1494,36 @@ int hvm_domain_initialise(struct domain *d)
d->arch.hvm_domain.params = xzalloc_array(uint64_t, HVM_NR_PARAMS);
      d->arch.hvm_domain.io_handler = xmalloc(struct hvm_io_handler);
+    d->arch.hvm_domain.vmport_data = xzalloc(struct vmport_state);
      rc = -ENOMEM;
-    if ( !d->arch.hvm_domain.params || !d->arch.hvm_domain.io_handler )
+    if ( !d->arch.hvm_domain.params || !d->arch.hvm_domain.io_handler ||
+         !d->arch.hvm_domain.vmport_data )
          goto fail1;
      d->arch.hvm_domain.io_handler->num_slot = 0;
+ /*
+     * Any value is fine here. In fact a random number may better.
+     * It is used to help validate that a both sides are talking
+     * about the same channel.
+     */
+    d->arch.hvm_domain.vmport_data->open_cookie = 435;
+
+    d->arch.hvm_domain.vmport_data->used_guestinfo = 10;
+    for (rc = 0;
+         rc < d->arch.hvm_domain.vmport_data->used_guestinfo;
+         rc++)
+        d->arch.hvm_domain.vmport_data->guestinfo[rc] =
+            xzalloc(vmport_guestinfo_t);
+
+    d->arch.hvm_domain.vmport_data->used_guestinfo_jumbo = 2;
+    for (rc = 0;
+         rc < d->arch.hvm_domain.vmport_data->used_guestinfo_jumbo;
+         rc++)
+        d->arch.hvm_domain.vmport_data->guestinfo_jumbo[rc] =
+            xzalloc(vmport_guestinfo_jumbo_t);
+
+    vmport_flush(&d->arch.hvm_domain);
+
All this would very likely better go into a separate function placed in
vmport.c. Same for many of the helper functions further down.
Please realize that this file is already huge, so any new addition
having a reasonable other home would be nice to be kept out of
this file.

ok, Will move a all of this into vmport.c and add more functions to
handle the code I added here.


@@ -1520,6 +1548,7 @@ int hvm_domain_initialise(struct domain *d)
register_portio_handler(d, 0xe9, 1, hvm_print_line);
      register_portio_handler(d, 0xcf8, 4, hvm_access_cf8);
+    register_portio_handler(d, VMPORT_PORT, 4, vmport_ioport);
So from a cursory look I wasn't able to spot any code making sure
that port doesn't get used for e.g. a passed through device.

There is no such code.

And in any event I'm rather uncomfortable about this getting
enabled unconditionally, also due to (but not limited to this) the
up front allocation of various memory blocks that may end up
never being needed. The main issue I see with this approach is
that guests could end up being misguided into thinking they're
running on VMware when in fact they have code in place to
optimize when running on Xen. We had such detection ordering
issues with Linux checking for both Hyper-V and Xen.


Ok, I do feel strongly that this would need to be independent on
vmware_hw "version". And so I will add a new xl.cfg item to control this.

QEMU on 5/19/2014 added a way to turn it's version off:

+@item vmport=on|off
+Enables emulation of VMWare IO port, for vmmouse etc. (enabled by default)

So should I name it vmware_vmport or just vmport ( I lean to just vmport )?
and unlike QEMU I am fine with the default of off.

@@ -1532,6 +1561,17 @@ int hvm_domain_initialise(struct domain *d)
      stdvga_deinit(d);
      vioapic_deinit(d);
   fail1:
+    if ( d->arch.hvm_domain.vmport_data )
+    {
+        struct vmport_state *vs = d->arch.hvm_domain.vmport_data;
+        int idx;
+
+        for (idx = 0; idx < vs->used_guestinfo; idx++)
You'll have to go through and fix coding style issues.

Do to the many coding styles I have used in the last 20 years, I can have
style "blindness". I do not find a style issue here based on CODING_STYLE.
All hints are welcome.

If I look in tools/libxl/CODING_STYLE, my best guess is s/idx/i/.


--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -42,6 +42,7 @@
  #include <asm/hvm/vlapic.h>
  #include <asm/hvm/trace.h>
  #include <asm/hvm/emulate.h>
+#include <asm/hvm/vmport.h>
This again being the only change to this file - why?


Looks like a left over of when I was not using register_portio_handler()

Will fix.


@@ -106,6 +107,7 @@ MAKE_INSTR(VMSAVE, 3, 0x0f, 0x01, 0xdb);
  MAKE_INSTR(STGI,   3, 0x0f, 0x01, 0xdc);
  MAKE_INSTR(CLGI,   3, 0x0f, 0x01, 0xdd);
  MAKE_INSTR(INVLPGA,3, 0x0f, 0x01, 0xdf);
+MAKE_INSTR(IN,     1, 0xed);
This name is ambiguous.

There are 4 opcodes and 6 instructions. Not at all sure how to name
the one I need. Here is the info about the IN instruction.


IN—Input from Port
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
E4 ib IN AL, imm8 A Valid Valid Input byte from imm8 I/O
port address into AL.
E5 ib IN AX, imm8 A Valid Valid Input word from imm8 I/O
port address into AX.
E5 ib IN EAX, imm8 A Valid Valid Input dword from imm8 I/O
port address into EAX.
EC IN AL,DX B Valid Valid Input byte from I/O port in
DX into AL.
ED IN AX,DX B Valid Valid Input word from I/O port in
DX into AX.
ED IN EAX,DX B Valid Valid Input doubleword from I/O
port in DX into EAX.


So would "inl" be any better? or is inw_dx or inl_dx better?


+static void svm_vmexit_gp_intercept(struct cpu_user_regs *regs, struct vcpu *v)
+{
+    struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
+    unsigned long inst_len = __get_instruction_length(v, INSTR_IN);
+
+    regs->error_code = vmcb->exitinfo1;
+    if ( hvm_long_mode_enabled(v) )
+        HVMTRACE_LONG_C4D(TRAP_GP, TRAP_gp_fault, inst_len,
+                          regs->error_code,
+                          TRC_PAR_LONG(vmcb->exitinfo2));
+    else
+        HVMTRACE_C4D(TRAP_GP, TRAP_gp_fault, inst_len,
+                     regs->error_code, vmcb->exitinfo2);
+
+    if ( inst_len <= 1 && (regs->rdx & 0xffff) == VMPORT_PORT &&
What would inst_len being zero mean here?

It use to mean that AMD was missing the feature:


(XEN) [2014-08-14 20:17:28] - Next-RIP Saved on #VMEXIT

and so I needed to look at the opcode.

Now that I am using __get_instruction_length() I think it could be changed
to a 1. However I can only test on an AMD server that does have this
feature. Will switch to == 1 and hope for the best.

As far as I can tell __get_instruction_length() does not insure that the opcode
list passed is found, just that it might be.


+         vmcb->exitinfo2 == 0 && regs->error_code == 0 )
+    {
+        uint32_t magic = regs->rax;
+
+        if ( magic == VMPORT_MAGIC )
This ought to be folded with the previous if(), at once reducing
code redundancy further down in the function.

OK. But that does make the debug log message more complex to code.

+        {
+            unsigned char bytes[1] = { 0 };
Pointless initializer (and it being an array looks suspicious too).

Will drop the initializer. hvm_fetch_from_guest_virt_nofault() does take an array.
So it is better to say "&tmp, regs->rip, 1" then "bytes, regs->rip, 1" ?

Also by dropping the initializer, you are stating that
hvm_fetch_from_guest_virt_nofault() does initialize it's output on error (
because the debug log output does include bytes[0]).


@@ -2565,6 +2567,82 @@ static void vmx_idtv_reinject(unsigned long idtv_info)
      }
  }
+void do_gp_fault(struct cpu_user_regs *regs, struct vcpu *v)
A VMX-specific function shouldn't be named this way. Plus I can't see
why it's not static (in which case the name may actually be okay). And
then a lot of the code further down looks rather similar to SVM's -
please avoid such duplication.

I will look into having a common routine.


@@ -2675,6 +2753,15 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
                   && vector != TRAP_nmi
                   && vector != TRAP_machine_check )
              {
+                if ( vector == TRAP_gp_fault )
+                {
+                    VMPORT_DBG_LOG(VMPORT_LOG_REALMODE_GP,
+                                   "realmode gp: ip=%"PRIx64" ax=%"PRIx64
+                                   " bx=%"PRIx64" cx=%"PRIx64" dx=%"PRIx64
+                                   " si=%"PRIx64" di=%"PRIx64,
+                                   regs->rip, regs->rax, regs->rbx, regs->rcx,
+                                   regs->rdx, regs->rsi, regs->rdi);
+                }
Is this really useful (no matter that it'd guarded by a !NDEBUG-only
command line option?

I had found it so at the start. Not so sure it is now.

  Speaking of which - why can't you just define
a new DBG_LEVEL_VMPORT and use the existing HVM_DBG_LOG()?

I did define 23 bits to control various part of the debug. A lot of this
is because there are many times you go through the code. Basicly
a string is 1st passed a length, and the 4 bytes at a time until the
length is handled, followed by state changes...



@@ -2182,6 +2183,19 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs *regs,
              if ( v->fpu_dirtied )
                  nvcpu->nv_vmexit_pending = 1;
          }
+        else if ( vector == TRAP_gp_fault )
+        {
+#ifndef NDEBUG
+            struct cpu_user_regs *ur = guest_cpu_user_regs();
+#endif
+            VMPORT_DBG_LOG(VMPORT_LOG_VGP_UNKNOWN,
+                           "Unexpected gp: ip=%"PRIx64" ax=%"PRIx64
+                           " bx=%"PRIx64" cx=%"PRIx64" dx=%"PRIx64
+                           " si=%"PRIx64" di=%"PRIx64,
+                           ur->rip, ur->rax, ur->rbx, ur->rcx, ur->rdx,
+                           ur->rsi, ur->rdi);
The #endif above ought to go here for the code to look consistent.

Ok.

+#ifndef NDEBUG
+#define VMPORT_LOG_RPC             (1 << 0)
+#define VMPORT_LOG_RECV_STATUS     (1 << 1)
+#define VMPORT_LOG_SKIP_SEND       (1 << 2)
+#define VMPORT_LOG_SEND            (1 << 3)
+#define VMPORT_LOG_SEND_SIZE_ALL   (1 << 4)
+#define VMPORT_LOG_SEND_SIZE       (1 << 5)
+#define VMPORT_LOG_RECV_SIZE_ALL   (1 << 6)
+#define VMPORT_LOG_RECV_SIZE       (1 << 7)
+#define VMPORT_LOG_CLOSE           (1 << 8)
+#define VMPORT_LOG_OPEN            (1 << 9)
+#define VMPORT_LOG_FLUSH           (1 << 10)
+#define VMPORT_LOG_TRACE           (1 << 11)
+#define VMPORT_LOG_PING            (1 << 12)
+#define VMPORT_LOG_SWEEP           (1 << 13)
+#define VMPORT_LOG_BUILD           (1 << 14)
+
+#define VMPORT_LOG_ERROR           (1 << 16)
+
+#define VMPORT_LOG_INFO_GET        (1 << 17)
+#define VMPORT_LOG_INFO_SET        (1 << 18)
+
+#define VMPORT_LOG_SKIP_MESSAGE    (1 << 19)
+
+#define VMPORT_LOG_GP_UNKNOWN      (1 << 20)
+#define VMPORT_LOG_GP_NOT_VMWARE   (1 << 21)
+#define VMPORT_LOG_GP_FAIL_RD_INST (1 << 22)
+
+#define VMPORT_LOG_GP_VMWARE_AFTER (1 << 23)
+
+#define VMPORT_LOG_VGP_UNKNOWN     (1 << 24)
+#define VMPORT_LOG_REALMODE_GP     (1 << 27)
+
+#define VMPORT_LOG_VMWARE_AFTER    (1 << 28)
Ah, here we go (as to using HVM_DBG_LOG()): Isn't this _way_ too
fine grained?

It was not when I was checking out all the code. And I do leave a lot
of the debug code in. I can drop some of them, but I find it hard
to guess at a good subset.

I have found that VMPORT_LOG_RPC, VMPORT_LOG_SEND, VMPORT_LOG_ERROR,
and VMPORT_LOG_SKIP_SEND are the 4 I most commonly turn on now. That
amount of logging seems fine on a 9600 baud serial console.


VMPORT_LOG_SKIP_MESSAGE is useful to see all the not BDOOR_CMD_MESSAGE (the 
guest info
stuff mostly, or RPC as it is also called).


And all the ones with VMPORT_LOG_GP prefix are all about the #gp handling.


Since it now includes xentrace stuff, I could drop some of the ones like
VMPORT_LOG_VMWARE_AFTER


    -Don Slutz

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.