[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH RFC v3 6/6] HVM x86 deprivileged mode: Move VPIC to deprivileged mode



First steps of moving the VPIC into deprivileged mode.

For the VPIC, some of its functions are called from both privileged code and
deprivileged code. Some of these are also called from non-hvm domains. This
means that we cannot just convert the entire function to a depriv only one, but
need to handle this case. vpic_get_priority() shows one way of doing this but
there may be other ways. The main aim will be to minimise code duplication and
logic needed to determine where the call is coming from. This will not be a
unique problem so some thought will be needed as to the best way to resolve this
in general.

A clean up method handles a deprivileged mode domain crashing whilst holding
resources. For example, if we hold a lock or have allocated memory which Xen
will not clean up for us when we crash, we need to release these. Otherwise we
fail on ASSERT_NOT_IN_ATOMIC in vmx_asm_vmexit_handler due to an unreleased lock
and then panic. We could also leak memory if we allocate from a pool which Xen
does not clean up for us on crashing the domain.

TODO
----
Patches 5 & 6:
 - Fix the GCC switch statement issue which causes a page fault

Patch 6:
 - Fix vpic lock release on domain crash.
 - Finish moving parts of the VPIC into deprivileged mode

KNOWN ISSUES
------------
 - Page fault for vpic_ioport_write due to GCC switch statements placing the
   jump table in .rodata which is in the privileged mode area.

   This has been traced to the first of the switch statements in the function.
   Though other switches in that function may also be affected.
   Compiled using GCC 4.9.2-10.

   You can get the offset into this function by doing:
   (RIP - (depriv_vpic_ioport_write - __hvm_deprivileged_text_start))

   It appears to be a built-in default of GCC to put switch jump tables in
   .rodata or .text and there does not appear to be a way to change this
   (except to patch the compiler). Note that GCC will not necessarily allocate
   jump tables for each switch statment, it depends on the optimiser.

   Thus, when we relocate a deprivileged method containing code using a switch
   statement which GCC has created a jump table for, this leads to a page
   fault. This is because we have not mapped in the rodata section
   as we should not (depriv should not have access to it).

   A workaround would be to patch the generated assembly so that this table is
   moved into hvm_deprivileged.rodata. This can be done by adding,
   .section .hvm_deprivileged.rodata, around the generated table. We can then
   relocate

   Note that GCC is using RIP-relative addressing for this, so the offset
   of depriv .rodata to the depriv .text segment will need to be the same
   when it is mapped in.

Signed-off-by: Ben Catterall <Ben.Catterall@xxxxxxxxxx>
---
 xen/arch/x86/hvm/deprivileged.c         |  49 +++++++++++
 xen/arch/x86/hvm/deprivileged_syscall.c |   4 +-
 xen/arch/x86/hvm/vpic.c                 | 151 ++++++++++++++++++++++++++++----
 xen/arch/x86/traps.c                    |   5 +-
 xen/include/asm-x86/hvm/vcpu.h          |   2 +
 xen/include/xen/hvm/deprivileged.h      |   3 +
 6 files changed, 192 insertions(+), 22 deletions(-)

diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 5606f9a..9561054 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -20,7 +20,14 @@
 #include <xen/hvm/deprivileged.h>
 #include <xen/hvm/deprivileged_syscall.h>
 
+/* TODO: move to a better place than here */
+int depriv_vpic_ioport_write(unsigned long *ret_data_ptr,
+                             struct hvm_hw_vpic *vpic, int32_t addr,
+                             uint32_t val) DEPRIV_TEXT_SEGMENT;
+
+
 static depriv_syscall_t depriv_operation_table[] = {
+    DEPRIV_OPERATION(vpic_ioport_write, 4)
 };
 
 void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
@@ -641,6 +648,11 @@ int hvm_deprivileged_user_mode(unsigned long operation, 
register_t a,
      */
     if ( vcpu->arch.hvm_vcpu.depriv_destroy )
     {
+        /*
+         * Track which operation we are currently performing so that we can
+         * clean up if we have to crash the domain whilst doing it.
+         */
+        hvm_deprivileged_clean_up(vcpu, operation);
         domain_crash(vcpu->domain);
         return 1;
     }
@@ -763,3 +775,40 @@ int hvm_deprivileged_check_trap(const char* func_name)
 
     return 0;
 }
+
+/*
+ * Clean up when destroying the domain
+ * When we destroy the domain whilst performing a deprivilged mode operation,
+ * we need to make sure that we do not hold any locks or have any memory which
+ * we have allocated related to the deprivileged mode operation which will
+ * not be cleared up by Xen automatically as part of domain destruction.
+ *
+ * An example is when we crash whilst holding a lock, we need to release this
+ * lock.
+ */
+void hvm_deprivileged_clean_up(struct vcpu *vcpu, unsigned long op)
+{
+    struct hvm_hw_vpic *vpic;
+
+    /* The vpic lock is not released if we crash the domain. This means that 
the
+     * preempt count is not decremented so we fail on an ASSERT_NOT_IN_ATOMIC 
in
+     * vmx_asm_vmexit_handler on our way back to the guest. After this, we 
would
+     * test for a SOFTIRQ to deschedule and then destroy the guest. But, this
+     * fail results in a panic instead. The solution is to release this
+     * lock when we crash the domain and we have determined that we are
+     * performing a vpic  operation.
+     */
+    switch(op)
+    {
+    case DEPRIV_OPERATION_vpic_ioport_write:
+       /* We have cached the current vpic so that we can clean up */
+        vpic = (struct hvm_hw_vpic *)vcpu->arch.hvm_vcpu.depriv_cleanup_data;
+        printk("Cleaning up... %lx\n", (unsigned long)vpic);
+        spin_unlock(&container_of((vpic), struct hvm_domain,
+                                  vpic[!(vpic)->is_master])->irq_lock);
+        break;
+    default:
+        /* No clean up needed for this operation if not covered by the above */
+        break;
+    }
+}
diff --git a/xen/arch/x86/hvm/deprivileged_syscall.c 
b/xen/arch/x86/hvm/deprivileged_syscall.c
index 34dfee9..c98ff96 100644
--- a/xen/arch/x86/hvm/deprivileged_syscall.c
+++ b/xen/arch/x86/hvm/deprivileged_syscall.c
@@ -73,11 +73,11 @@
  * Used for handling a syscall from deprivileged mode or dispatching a
  * deprivileged mode operation.
  */
-
+int do_depriv_vpic_get_priority(struct hvm_hw_vpic *vpic, uint8_t mask);
 
 /* This table holds the functions which can be called from deprivileged mode. 
*/
 static depriv_syscall_t depriv_syscall_table[] = {
-
+    DEPRIV_SYSCALL(vpic_get_priority, 2),
 };
 
 /* Handle a syscall from deprivileged mode */
diff --git a/xen/arch/x86/hvm/vpic.c b/xen/arch/x86/hvm/vpic.c
index 7c2edc8..8939924 100644
--- a/xen/arch/x86/hvm/vpic.c
+++ b/xen/arch/x86/hvm/vpic.c
@@ -34,6 +34,8 @@
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/io.h>
 #include <asm/hvm/support.h>
+#include <xen/hvm/deprivileged.h>
+#include <xen/hvm/deprivileged_syscall.h>
 
 #define vpic_domain(v) (container_of((v), struct domain, \
                         arch.hvm_domain.vpic[!vpic->is_master]))
@@ -44,23 +46,63 @@
 #define vpic_is_locked(v) spin_is_locked(__vpic_lock(v))
 #define vpic_elcr_mask(v) (vpic->is_master ? (uint8_t)0xf8 : (uint8_t)0xde);
 
+/* DEPRIV operations */
+
+
 /* Return the highest priority found in mask. Return 8 if none. */
 #define VPIC_PRIO_NONE 8
-static int vpic_get_priority(struct hvm_hw_vpic *vpic, uint8_t mask)
+
+/*
+ * We need this as a separate stub because it is called from both deprivileged
+ * and privileged code. Now, when calling it from deprivileged code, all those
+ * code paths already have the lock so we don't need to get it, whereas
+ * privileged code paths may not already have the lock.
+ *
+ * NOTE: This is a general problem, BOTH deprivileged and privileged mode
+ *   sometimes access the same functions and we need to be aware of and then
+ *   handle this.
+ *
+ * As the lock is held in the parent structure of the vpic, and we have not
+ * mapped this into deprivileged memory (as it should not be), when we
+ * try to access it using the depriv vpic pointer, we will page fault.
+ * Thus, as we know we already have the lock, we can avoid testing this and
+ * so avoid the page fault.
+ *
+ * TODO:
+ * TBF, this is so small that it shouldn't be a syscall (was a test for a
+ * syscall). We should map it to depriv mode. However, it serves to demonstrate
+ * considerations which need to be taken so I'll leave it here for future
+ * implementors as an example. Hopefully it's helpful...
+ */
+static inline int vpic_get_priority_inline(struct hvm_hw_vpic *vpic,
+                                           uint8_t mask)
 {
     int prio;
 
-    ASSERT(vpic_is_locked(vpic));
-
     if ( mask == 0 )
         return VPIC_PRIO_NONE;
 
     /* prio = ffs(mask ROR vpic->priority_add); */
     asm ( "ror %%cl,%b1 ; rep; bsf %1,%0"
           : "=r" (prio) : "q" ((uint32_t)mask), "c" (vpic->priority_add) );
+
     return prio;
 }
 
+static int vpic_get_priority(struct hvm_hw_vpic *vpic, uint8_t mask)
+{
+    /* Privileged mode access needs to test the lock */
+    ASSERT(vpic_is_locked(vpic));
+    return vpic_get_priority_inline(vpic, mask);
+}
+
+/* syscall handler for the vpic */
+int do_depriv_vpic_get_priority(struct hvm_hw_vpic *vpic, uint8_t mask)
+{
+    /* deprivileged mode already has the lock: no need to assert it */
+    return vpic_get_priority_inline(vpic, mask);
+}
+
 /* Return the PIC's highest priority pending interrupt. Return -1 if none. */
 static int vpic_get_highest_priority_irq(struct hvm_hw_vpic *vpic)
 {
@@ -181,14 +223,84 @@ static int vpic_intack(struct hvm_hw_vpic *vpic)
     return irq;
 }
 
+int depriv_vpic_ioport_write(unsigned long *ret_data_ptr,
+                             struct hvm_hw_vpic *vpic, int32_t addr,
+                             uint32_t val) DEPRIV_TEXT_SEGMENT;
+
 static void vpic_ioport_write(
     struct hvm_hw_vpic *vpic, uint32_t addr, uint32_t val)
 {
-    int priority, cmd, irq;
-    uint8_t mask, unmasked = 0;
+    struct vcpu *vcpu = get_current();
+    void *p;
+    unsigned long *ret_data_ptr;
+    int ret, irq, unmasked;
+
+    /* Cache the current vpic so we can clean up */
+    vcpu->arch.hvm_vcpu.depriv_cleanup_data = vpic;
 
     vpic_lock(vpic);
 
+    p = hvm_deprivileged_copy_data_to(vcpu, vpic, sizeof(struct hvm_hw_vpic));
+    ret_data_ptr = (unsigned long*)((unsigned long)p +
+                                    sizeof(struct hvm_hw_vpic));
+    printk("Entering..%lx\n", (unsigned long)ret_data_ptr);
+    ret = depriv4(DEPRIV_OPERATION_vpic_ioport_write,
+                  (unsigned long)ret_data_ptr, (unsigned long)p, addr, val);
+
+    printk("BACK\n");
+    if ( ret )
+        return;
+
+    /*
+     * TODO:
+     * In general: we may have to deal with structures being updated after we
+     * do the copy, this is why we should move to aliasing where possible.
+     * As we hold a lock for the vpic here, it's not a problem in this case.
+     * Though, this lock may be for the other vpic so this _may_ actually be a
+     * problem, see vpic_lock internals...
+     */
+    hvm_deprivileged_copy_data_from(vcpu, vpic, p, sizeof(struct hvm_hw_vpic));
+    vcpu->arch.hvm_vcpu.depriv_data_offset = 0;
+
+    /*
+     * Additional return values are placed after our copied data. We can read
+     * them out from here and then we need to zero out the memory in the data
+     * page to prevent information leakage between deprivileged operations
+     */
+    irq = *ret_data_ptr;
+    *ret_data_ptr = 0;
+
+    unmasked = *(++ret_data_ptr);
+    *ret_data_ptr = 0;
+
+    /* Which operation to perform? */
+    if ( vcpu->arch.hvm_vcpu.depriv_return_code == 0 )
+    {
+        vpic_update_int_output(vpic);
+
+        vpic_unlock(vpic);
+
+        if ( unmasked )
+           pt_may_unmask_irq(vpic_domain(vpic), NULL);
+    }
+    else
+    {
+        /* Release lock and EOI the physical interrupt (if any). */
+        vpic_update_int_output(vpic);
+        vpic_unlock(vpic);
+        hvm_dpci_eoi(current->domain,
+                     hvm_isa_irq_to_gsi((addr >> 7) ? (irq|8) : irq),
+                     NULL);
+    }
+}
+
+int depriv_vpic_ioport_write(unsigned long *ret_data_ptr,
+                             struct hvm_hw_vpic *vpic, int32_t addr,
+                             uint32_t val)
+{
+    int priority, cmd, irq;
+    uint8_t mask, unmasked = 0;
+
     if ( (addr & 1) == 0 )
     {
         if ( val & 0x10 )
@@ -243,7 +355,12 @@ static void vpic_ioport_write(
                 mask = vpic->isr;
                 if ( vpic->special_mask_mode )
                     mask &= ~vpic->imr; /* SMM: ignore masked IRs. */
-                priority = vpic_get_priority(vpic, mask);
+
+                /* Deprivileged mode system call */
+                DEPRIV_SYSCALL_CALL2(DEPRIV_SYSCALL_vpic_get_priority,
+                                     priority, /* Return value */
+                                     vpic, mask);
+
                 if ( priority == VPIC_PRIO_NONE )
                     break;
                 irq = (priority + vpic->priority_add) & 7;
@@ -257,13 +374,12 @@ static void vpic_ioport_write(
                 vpic->isr &= ~(1 << irq);
                 if ( cmd == 7 )
                     vpic->priority_add = (irq + 1) & 7;
-                /* Release lock and EOI the physical interrupt (if any). */
-                vpic_update_int_output(vpic);
-                vpic_unlock(vpic);
-                hvm_dpci_eoi(current->domain,
-                             hvm_isa_irq_to_gsi((addr >> 7) ? (irq|8) : irq),
-                             NULL);
-                return; /* bail immediately */
+
+                /* Pass data back and indicate which operation we need */
+                *ret_data_ptr = irq;
+                *(++ret_data_ptr) = unmasked;
+
+                return 1; /* bail immediately */
             case 6: /* Set Priority                */
                 vpic->priority_add = (val + 1) & 7;
                 break;
@@ -301,12 +417,9 @@ static void vpic_ioport_write(
         }
     }
 
-    vpic_update_int_output(vpic);
-
-    vpic_unlock(vpic);
-
-    if ( unmasked )
-        pt_may_unmask_irq(vpic_domain(vpic), NULL);
+    /* Pass data back and indicate which operation we need */
+    *(++ret_data_ptr) = unmasked;
+    return 0;
 }
 
 static uint32_t vpic_ioport_read(struct hvm_hw_vpic *vpic, uint32_t addr)
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index f14a845..1a449a5 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1525,7 +1525,10 @@ void do_page_fault(struct cpu_user_regs *regs)
     perfc_incr(page_faults);
 
     /* If we get a page fault whilst in HVM deprivileged mode */
-    if( hvm_deprivileged_check_trap(__func__) )
+    if (is_hvm_deprivileged_vcpu() )
+        printk("Addr: %lx code: %d, rip: %lx\n", addr, error_code, regs->rip);
+
+    if ( hvm_deprivileged_check_trap(__func__) )
         return;
 
     if ( unlikely(fixup_page_fault(addr, regs) != 0) )
diff --git a/xen/include/asm-x86/hvm/vcpu.h b/xen/include/asm-x86/hvm/vcpu.h
index dcdecf1..482b6ae 100644
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -217,6 +217,8 @@ struct hvm_vcpu {
     unsigned long depriv_destroy;
     unsigned long depriv_watchdog_count;
     unsigned long depriv_return_code;
+    /* Pointer to data to help with clean up if we have to crash the domain */
+    void* depriv_cleanup_data;
     /* Offset into our data page where we can put data for depriv operations */
     unsigned long depriv_data_offset;
 
diff --git a/xen/include/xen/hvm/deprivileged.h 
b/xen/include/xen/hvm/deprivileged.h
index d7228be..c640788 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -115,6 +115,9 @@ void hvm_deprivileged_restore_stacks(void);
 /* Check if we are in deprivileged mode */
 int is_hvm_deprivileged_vcpu(void);
 
+/* Clean up when destroying the domain */
+void hvm_deprivileged_clean_up(struct vcpu *vcpu, unsigned long op);
+
 /* The ring 3 code */
 void hvm_deprivileged_ring3(void);
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.