[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path



 >>> On 12.12.12 at 12:35, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
> xen/arch/x86/crash.c            |  116 ++++++++++++++++++++++++++++++++++-----
>  xen/arch/x86/machine_kexec.c    |   19 ++++++
>  xen/arch/x86/x86_64/entry.S     |   34 +++++++++++
>  xen/include/asm-x86/desc.h      |   45 +++++++++++++++
>  xen/include/asm-x86/processor.h |    4 +
>  5 files changed, 203 insertions(+), 15 deletions(-)
> 
> 
> Experimentally, certain crash kernels will triple fault very early after
> starting if started with NMIs disabled.  This was discovered when
> experimenting with a debug keyhandler which deliberately created a
> reentrant NMI, causing stack corruption.
> 
> Because of this discovered bug, and that the future changes to the NMI
> handling will make the kexec path more fragile, take the time now to
> bullet-proof the kexec behaviour to be safer in more circumstances.
> 
> This patch adds three new low level routines:
>  * nmi_crash
>     This is a special NMI handler for using during a kexec crash.
>  * enable_nmis
>     This function enables NMIs by executing an iret-to-self, to
>     disengage the hardware NMI latch.
>  * trap_nop
>     This is a no op handler which irets immediately.  It is actually in
>     the middle of enable_nmis to reuse the iret instruction, without
>     having a single lone aligned iret inflating the code side.
> 
> And adds three new IDT entry helper routines:
>  * _write_gate_lower
>     This is a substitute for using cmpxchg16b to update a 128bit
>     structure at once.  It assumes that the top 64 bits are unchanged
>     (and ASSERT()s the fact) and performs a regular write on the lower
>     64 bits.
>  * _set_gate_lower
>     This is functionally equivalent to the already present _set_gate(),
>     except it uses _write_gate_lower rather than updating both 64bit
>     values.
>  * _update_gate_addr_lower
>     This is designed to update an IDT entry handler only, without
>     altering any other settings in the entry.  It also uses
>     _write_gate_lower.
> 
> The IDT entry helpers are required because:
>   * Is it unsafe to attempt a disable/update/re-enable cycle on the NMI
>     or MCE IDT entries.
>   * We need to be able to update NMI handlers without changing the IST
>     entry.
> 
> 
> As a result, the new behaviour of the kexec_crash path is:
> 
> nmi_shootdown_cpus() will:
> 
>  * Disable the crashing cpus NMI/MCE interrupt stack tables.
>     Disabling the stack tables removes race conditions which would lead
>     to corrupt exception frames and infinite loops.  As this pcpu is
>     never planning to execute a sysret back to a pv vcpu, the update is
>     safe from a security point of view.
> 
>  * Swap the NMI trap handlers.
>     The crashing pcpu gets the nop handler, to prevent it getting stuck in
>     an NMI context, causing a hang instead of crash.  The non-crashing
>     pcpus all get the nmi_crash handler which is designed never to
>     return.
> 
> do_nmi_crash() will:
> 
>  * Save the crash notes and shut the pcpu down.
>     There is now an extra per-cpu variable to prevent us from executing
>     this multiple times.  In the case where we reenter midway through,
>     attempt the whole operation again in preference to not completing
>     it in the first place.
> 
>  * Set up another NMI at the LAPIC.
>     Even when the LAPIC has been disabled, the ID and command registers
>     are still usable.  As a result, we can deliberately queue up a new
>     NMI to re-interrupt us later if NMIs get unlatched.  Because of the
>     call to __stop_this_cpu(), we have to hand craft self_nmi() to be
>     safe from General Protection Faults.
> 
>  * Fall into infinite loop.
> 
> machine_kexec() will:
> 
>   * Swap the MCE handlers to be a nop.
>      We cannot prevent MCEs from being delivered when we pass off to the
>      crash kernel, and the less Xen context is being touched the better.
> 
>   * Explicitly enable NMIs.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Acked-by: Jan Beulich <jbeulich@xxxxxxxx>

(but for committing I'd want at least one other knowledgeable
person's ack)

> --
> 
> Changes since v4:
>  * Change stale comments, spelling corrections and coding style changes.
> 
> Changes since v3:
>  * Added IDT entry helpers to safely update NMI/MCE entries.
> 
> Changes since v2:
> 
>  * Rework the alteration of the stack tables to completely remove the
>    possibility of a PV domain getting very lucky with the "NMI or MCE in
>    a 1 instruction race window on sysret" and managing to execute code
>    in the hypervisor context.
>  * Make use of set_ist() from the previous patch in the series to avoid
>    open-coding the IST manipulation.
> 
> Changes since v1:
>  * Reintroduce atomic_dec(&waiting_for_crash_ipi); which got missed
>    during the original refactoring.
>  * Fold trap_nop into the middle of enable_nmis to reuse the iret.
>  * Expand comments in areas as per Tim's suggestions.
> 
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/crash.c
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -32,41 +32,127 @@
>  
>  static atomic_t waiting_for_crash_ipi;
>  static unsigned int crashing_cpu;
> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>  
> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
> +/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing. 
> */
> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>  {
> -    /* Don't do anything if this handler is invoked on crashing cpu.
> -     * Otherwise, system will completely hang. Crashing cpu can get
> -     * an NMI if system was initially booted with nmi_watchdog parameter.
> +    int cpu = smp_processor_id();
> +
> +    /* nmi_shootdown_cpus() should ensure that this assertion is correct. 
> */
> +    ASSERT( cpu != crashing_cpu );
> +
> +    /* Save crash information and shut down CPU.  Attempt only once. */
> +    if ( ! this_cpu(crash_save_done) )
> +    {
> +        /* Disable the interrupt stack table for the MCE handler.  This
> +         * prevents race conditions between clearing MCIP and receving a
> +         * new MCE, during which the exception frame would be clobbered
> +         * and the MCE handler fall into an infinite loop.  We are soon
> +         * going to disable the NMI watchdog, so the loop would not be
> +         * caught.
> +         *
> +         * We do not need to change the NMI IST, as the nmi_crash
> +         * handler is immue to corrupt exception frames, by virtue of
> +         * being designed never to return.
> +         *
> +         * This update is safe from a security point of view, as this
> +         * pcpu is never going to try to sysret back to a PV vcpu.
> +         */
> +        set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
> +
> +        kexec_crash_save_cpu();
> +        __stop_this_cpu();
> +
> +        this_cpu(crash_save_done) = 1;
> +        atomic_dec(&waiting_for_crash_ipi);
> +    }
> +
> +    /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
> +     * back to its boot state, so we are unable to rely on the regular
> +     * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
> +     * (The likely scenario is that we have reverted from x2apic mode to
> +     * xapic, at which point #GPFs will occur if we use the apic_*
> +     * functions)
> +     *
> +     * The ICR and APIC ID of the LAPIC are still valid even during
> +     * software disable (Intel SDM Vol 3, 10.4.7.2).  As a result, we
> +     * can deliberately queue up another NMI at the LAPIC which will not
> +     * be delivered as the hardware NMI latch is currently in effect.
> +     * This means that if NMIs become unlatched (e.g. following a
> +     * non-fatal MCE), the LAPIC will force us back here rather than
> +     * wandering back into regular Xen code.
>       */
> -    if ( cpu == crashing_cpu )
> -        return 1;
> -    local_irq_disable();
> +    switch ( current_local_apic_mode() )
> +    {
> +        u32 apic_id;
>  
> -    kexec_crash_save_cpu();
> +    case APIC_MODE_X2APIC:
> +        apic_id = apic_rdmsr(APIC_ID);
>  
> -    __stop_this_cpu();
> +        apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | 
> ((u64)apic_id << 32));
> +        break;
>  
> -    atomic_dec(&waiting_for_crash_ipi);
> +    case APIC_MODE_XAPIC:
> +        apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
> +
> +        while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
> +            cpu_relax();
> +
> +        apic_mem_write(APIC_ICR2, apic_id << 24);
> +        apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
> +        break;
> +
> +    default:
> +        break;
> +    }
>  
>      for ( ; ; )
>          halt();
> -
> -    return 1;
>  }
>  
>  static void nmi_shootdown_cpus(void)
>  {
>      unsigned long msecs;
> +    int i, cpu = smp_processor_id();
>  
>      local_irq_disable();
>  
> -    crashing_cpu = smp_processor_id();
> +    crashing_cpu = cpu;
>      local_irq_count(crashing_cpu) = 0;
>  
>      atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
> -    /* Would it be better to replace the trap vector here? */
> -    set_nmi_callback(crash_nmi_callback);
> +
> +    /* Change NMI trap handlers.  Non-crashing pcpus get nmi_crash which
> +     * invokes do_nmi_crash (above), which cause them to write state and
> +     * fall into a loop.  The crashing pcpu gets the nop handler to
> +     * cause it to return to this function ASAP.
> +     */
> +    for ( i = 0; i < nr_cpu_ids; ++i )
> +        if ( idt_tables[i] )
> +        {
> +
> +            if ( i == cpu )
> +            {
> +                /* Disable the interrupt stack tables for this cpus MCE
> +                 * and NMI handlers, and alter the NMI handler to have
> +                 * no operation.  Disabling the stack tables prevents
> +                 * stack corruption race conditions, while changing the
> +                 * handler helps prevent cascading faults; we are
> +                 * certainly going to crash by this point.
> +                 *
> +                 * This update is safe from a security point of view, as
> +                 * this pcpu is never going to try to sysret back to a
> +                 * PV vcpu.
> +                 */
> +                _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
> +                set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE);
> +            }
> +            else
> +                /* Do not update stack table for other pcpus. */
> +                _update_gate_addr_lower(&idt_tables[i][TRAP_nmi], 
> &nmi_crash);
> +        }
> +
>      /* Ensure the new callback function is set before sending out the NMI. 
> */
>      wmb();
>  
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/machine_kexec.c
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -81,12 +81,31 @@ void machine_kexec(xen_kexec_image_t *im
>          .base = (unsigned long)(boot_cpu_gdt_table - 
> FIRST_RESERVED_GDT_ENTRY),
>          .limit = LAST_RESERVED_GDT_BYTE
>      };
> +    int i;
>  
>      /* We are about to permenantly jump out of the Xen context into the 
> kexec
>       * purgatory code.  We really dont want to be still servicing 
> interupts.
>       */
>      local_irq_disable();
>  
> +    /* Now regular interrupts are disabled, we need to reduce the impact
> +     * of interrupts not disabled by 'cli'.
> +     *
> +     * The NMI handlers have already been set up nmi_shootdown_cpus().  All
> +     * pcpus other than us have the nmi_crash handler, while we have the 
> nop
> +     * handler.
> +     *
> +     * The MCE handlers touch extensive areas of Xen code and data.  At 
> this
> +     * point, there is nothing we can usefully do, so set the nop handler.
> +     */
> +    for ( i = 0; i < nr_cpu_ids; ++i )
> +        if ( idt_tables[i] )
> +            _update_gate_addr_lower(&idt_tables[i][TRAP_machine_check], 
> &trap_nop);
> +
> +    /* Explicitly enable NMIs on this CPU.  Some crashdump kernels do
> +     * not like running with NMIs disabled. */
> +    enable_nmis();
> +
>      /*
>       * compat_machine_kexec() returns to idle pagetables, which requires us
>       * to be running on a static GDT mapping (idle pagetables have no GDT
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/x86_64/entry.S
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -635,11 +635,45 @@ ENTRY(nmi)
>          movl  $TRAP_nmi,4(%rsp)
>          jmp   handle_ist_exception
>  
> +ENTRY(nmi_crash)
> +        cli
> +        pushq $0
> +        movl $TRAP_nmi,4(%rsp)
> +        SAVE_ALL
> +        movq %rsp,%rdi
> +        callq do_nmi_crash /* Does not return */
> +        ud2
> +
>  ENTRY(machine_check)
>          pushq $0
>          movl  $TRAP_machine_check,4(%rsp)
>          jmp   handle_ist_exception
>  
> +/* Enable NMIs.  No special register assumptions. Only %rax is not 
> preserved. */
> +ENTRY(enable_nmis)
> +        movq  %rsp, %rax /* Grab RSP before pushing */
> +
> +        /* Set up stack frame */
> +        pushq $0               /* SS */
> +        pushq %rax             /* RSP */
> +        pushfq                 /* RFLAGS */
> +        pushq $__HYPERVISOR_CS /* CS */
> +        leaq  1f(%rip),%rax
> +        pushq %rax             /* RIP */
> +
> +        iretq /* Disable the hardware NMI latch */
> +1:
> +        retq
> +
> +/* No op trap handler.  Required for kexec crash path.  This is not
> + * declared with the ENTRY() macro to avoid wasted alignment space.
> + */
> +.globl trap_nop
> +trap_nop:
> +        iretq
> +
> +
> +
>  .section .rodata, "a", @progbits
>  
>  ENTRY(exception_table)
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/desc.h
> --- a/xen/include/asm-x86/desc.h
> +++ b/xen/include/asm-x86/desc.h
> @@ -106,6 +106,21 @@ typedef struct {
>      u64 a, b;
>  } idt_entry_t;
>  
> +/* Write the lower 64 bits of an IDT Entry. This relies on the upper 32
> + * bits of the address not changing, which is a safe assumption as all
> + * functions we are likely to load will live inside the 1GB
> + * code/data/bss address range.
> + *
> + * Ideally, we would use cmpxchg16b, but this is not supported on some
> + * old AMD 64bit capable processors, and has no safe equivalent.
> + */
> +static inline void _write_gate_lower(volatile idt_entry_t *gate,
> +                                     const idt_entry_t *new)
> +{
> +    ASSERT(gate->b == new->b);
> +    gate->a = new->a;
> +}
> +
>  #define _set_gate(gate_addr,type,dpl,addr)               \
>  do {                                                     \
>      (gate_addr)->a = 0;                                  \
> @@ -122,6 +137,36 @@ do {                                    
>          (1UL << 47);                                     \
>  } while (0)
>  
> +static inline void _set_gate_lower(idt_entry_t *gate, unsigned long type,
> +                                   unsigned long dpl, void *addr)
> +{
> +    idt_entry_t idte;
> +    idte.b = gate->b;
> +    idte.a =
> +        (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
> +        ((unsigned long)(dpl) << 45) |
> +        ((unsigned long)(type) << 40) |
> +        ((unsigned long)(addr) & 0xFFFFUL) |
> +        ((unsigned long)__HYPERVISOR_CS64 << 16) |
> +        (1UL << 47);
> +    _write_gate_lower(gate, &idte);
> +}
> +
> +/* Update the lower half handler of an IDT Entry, without changing any
> + * other configuration. */
> +static inline void _update_gate_addr_lower(idt_entry_t *gate, void *addr)
> +{
> +    idt_entry_t idte;
> +    idte.a = gate->a;
> +
> +    idte.b = ((unsigned long)(addr) >> 32);
> +    idte.a &= 0x0000FFFFFFFF0000ULL;
> +    idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
> +        ((unsigned long)(addr) & 0xFFFFUL);
> +
> +    _write_gate_lower(gate, &idte);
> +}
> +
>  #define _set_tssldt_desc(desc,addr,limit,type)           \
>  do {                                                     \
>      (desc)[0].b = (desc)[1].b = 0;                       \
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/processor.h
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -527,6 +527,7 @@ void do_ ## _name(struct cpu_user_regs *
>  DECLARE_TRAP_HANDLER(divide_error);
>  DECLARE_TRAP_HANDLER(debug);
>  DECLARE_TRAP_HANDLER(nmi);
> +DECLARE_TRAP_HANDLER(nmi_crash);
>  DECLARE_TRAP_HANDLER(int3);
>  DECLARE_TRAP_HANDLER(overflow);
>  DECLARE_TRAP_HANDLER(bounds);
> @@ -545,6 +546,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
>  DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
>  #undef DECLARE_TRAP_HANDLER
>  
> +void trap_nop(void);
> +void enable_nmis(void);
> +
>  void syscall_enter(void);
>  void sysenter_entry(void);
>  void sysenter_eflags_saved(void);



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.