Xen project Mailing List

Re: [Xen-devel] [PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path

To: "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Wed, 12 Dec 2012 12:50:22 +0000

Cc: Tim Deegan <tim@xxxxxxx>, Keir Fraser <keir@xxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Wed, 12 Dec 2012 12:50:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 12.12.12 at 12:35, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote: > xen/arch/x86/crash.c | 116 ++++++++++++++++++++++++++++++++++----- > xen/arch/x86/machine_kexec.c | 19 ++++++ > xen/arch/x86/x86_64/entry.S | 34 +++++++++++ > xen/include/asm-x86/desc.h | 45 +++++++++++++++ > xen/include/asm-x86/processor.h | 4 + > 5 files changed, 203 insertions(+), 15 deletions(-) > > > Experimentally, certain crash kernels will triple fault very early after > starting if started with NMIs disabled. This was discovered when > experimenting with a debug keyhandler which deliberately created a > reentrant NMI, causing stack corruption. > > Because of this discovered bug, and that the future changes to the NMI > handling will make the kexec path more fragile, take the time now to > bullet-proof the kexec behaviour to be safer in more circumstances. > > This patch adds three new low level routines: > * nmi_crash > This is a special NMI handler for using during a kexec crash. > * enable_nmis > This function enables NMIs by executing an iret-to-self, to > disengage the hardware NMI latch. > * trap_nop > This is a no op handler which irets immediately. It is actually in > the middle of enable_nmis to reuse the iret instruction, without > having a single lone aligned iret inflating the code side. > > And adds three new IDT entry helper routines: > * _write_gate_lower > This is a substitute for using cmpxchg16b to update a 128bit > structure at once. It assumes that the top 64 bits are unchanged > (and ASSERT()s the fact) and performs a regular write on the lower > 64 bits. > * _set_gate_lower > This is functionally equivalent to the already present _set_gate(), > except it uses _write_gate_lower rather than updating both 64bit > values. > * _update_gate_addr_lower > This is designed to update an IDT entry handler only, without > altering any other settings in the entry. It also uses > _write_gate_lower. > > The IDT entry helpers are required because: > * Is it unsafe to attempt a disable/update/re-enable cycle on the NMI > or MCE IDT entries. > * We need to be able to update NMI handlers without changing the IST > entry. > > > As a result, the new behaviour of the kexec_crash path is: > > nmi_shootdown_cpus() will: > > * Disable the crashing cpus NMI/MCE interrupt stack tables. > Disabling the stack tables removes race conditions which would lead > to corrupt exception frames and infinite loops. As this pcpu is > never planning to execute a sysret back to a pv vcpu, the update is > safe from a security point of view. > > * Swap the NMI trap handlers. > The crashing pcpu gets the nop handler, to prevent it getting stuck in > an NMI context, causing a hang instead of crash. The non-crashing > pcpus all get the nmi_crash handler which is designed never to > return. > > do_nmi_crash() will: > > * Save the crash notes and shut the pcpu down. > There is now an extra per-cpu variable to prevent us from executing > this multiple times. In the case where we reenter midway through, > attempt the whole operation again in preference to not completing > it in the first place. > > * Set up another NMI at the LAPIC. > Even when the LAPIC has been disabled, the ID and command registers > are still usable. As a result, we can deliberately queue up a new > NMI to re-interrupt us later if NMIs get unlatched. Because of the > call to __stop_this_cpu(), we have to hand craft self_nmi() to be > safe from General Protection Faults. > > * Fall into infinite loop. > > machine_kexec() will: > > * Swap the MCE handlers to be a nop. > We cannot prevent MCEs from being delivered when we pass off to the > crash kernel, and the less Xen context is being touched the better. > > * Explicitly enable NMIs. > > Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> Acked-by: Jan Beulich <jbeulich@xxxxxxxx> (but for committing I'd want at least one other knowledgeable person's ack) > -- > > Changes since v4: > * Change stale comments, spelling corrections and coding style changes. > > Changes since v3: > * Added IDT entry helpers to safely update NMI/MCE entries. > > Changes since v2: > > * Rework the alteration of the stack tables to completely remove the > possibility of a PV domain getting very lucky with the "NMI or MCE in > a 1 instruction race window on sysret" and managing to execute code > in the hypervisor context. > * Make use of set_ist() from the previous patch in the series to avoid > open-coding the IST manipulation. > > Changes since v1: > * Reintroduce atomic_dec(&waiting_for_crash_ipi); which got missed > during the original refactoring. > * Fold trap_nop into the middle of enable_nmis to reuse the iret. > * Expand comments in areas as per Tim's suggestions. > > diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/crash.c > --- a/xen/arch/x86/crash.c > +++ b/xen/arch/x86/crash.c > @@ -32,41 +32,127 @@ > > static atomic_t waiting_for_crash_ipi; > static unsigned int crashing_cpu; > +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done); > > -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu) > +/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing. > */ > +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs) > { > - /* Don't do anything if this handler is invoked on crashing cpu. > - * Otherwise, system will completely hang. Crashing cpu can get > - * an NMI if system was initially booted with nmi_watchdog parameter. > + int cpu = smp_processor_id(); > + > + /* nmi_shootdown_cpus() should ensure that this assertion is correct. > */ > + ASSERT( cpu != crashing_cpu ); > + > + /* Save crash information and shut down CPU. Attempt only once. */ > + if ( ! this_cpu(crash_save_done) ) > + { > + /* Disable the interrupt stack table for the MCE handler. This > + * prevents race conditions between clearing MCIP and receving a > + * new MCE, during which the exception frame would be clobbered > + * and the MCE handler fall into an infinite loop. We are soon > + * going to disable the NMI watchdog, so the loop would not be > + * caught. > + * > + * We do not need to change the NMI IST, as the nmi_crash > + * handler is immue to corrupt exception frames, by virtue of > + * being designed never to return. > + * > + * This update is safe from a security point of view, as this > + * pcpu is never going to try to sysret back to a PV vcpu. > + */ > + set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE); > + > + kexec_crash_save_cpu(); > + __stop_this_cpu(); > + > + this_cpu(crash_save_done) = 1; > + atomic_dec(&waiting_for_crash_ipi); > + } > + > + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC > + * back to its boot state, so we are unable to rely on the regular > + * apic_* functions, due to 'x2apic_enabled' being possibly wrong. > + * (The likely scenario is that we have reverted from x2apic mode to > + * xapic, at which point #GPFs will occur if we use the apic_* > + * functions) > + * > + * The ICR and APIC ID of the LAPIC are still valid even during > + * software disable (Intel SDM Vol 3, 10.4.7.2). As a result, we > + * can deliberately queue up another NMI at the LAPIC which will not > + * be delivered as the hardware NMI latch is currently in effect. > + * This means that if NMIs become unlatched (e.g. following a > + * non-fatal MCE), the LAPIC will force us back here rather than > + * wandering back into regular Xen code. > */ > - if ( cpu == crashing_cpu ) > - return 1; > - local_irq_disable(); > + switch ( current_local_apic_mode() ) > + { > + u32 apic_id; > > - kexec_crash_save_cpu(); > + case APIC_MODE_X2APIC: > + apic_id = apic_rdmsr(APIC_ID); > > - __stop_this_cpu(); > + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | > ((u64)apic_id << 32)); > + break; > > - atomic_dec(&waiting_for_crash_ipi); > + case APIC_MODE_XAPIC: > + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID)); > + > + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY ) > + cpu_relax(); > + > + apic_mem_write(APIC_ICR2, apic_id << 24); > + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL); > + break; > + > + default: > + break; > + } > > for ( ; ; ) > halt(); > - > - return 1; > } > > static void nmi_shootdown_cpus(void) > { > unsigned long msecs; > + int i, cpu = smp_processor_id(); > > local_irq_disable(); > > - crashing_cpu = smp_processor_id(); > + crashing_cpu = cpu; > local_irq_count(crashing_cpu) = 0; > > atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1); > - /* Would it be better to replace the trap vector here? */ > - set_nmi_callback(crash_nmi_callback); > + > + /* Change NMI trap handlers. Non-crashing pcpus get nmi_crash which > + * invokes do_nmi_crash (above), which cause them to write state and > + * fall into a loop. The crashing pcpu gets the nop handler to > + * cause it to return to this function ASAP. > + */ > + for ( i = 0; i < nr_cpu_ids; ++i ) > + if ( idt_tables[i] ) > + { > + > + if ( i == cpu ) > + { > + /* Disable the interrupt stack tables for this cpus MCE > + * and NMI handlers, and alter the NMI handler to have > + * no operation. Disabling the stack tables prevents > + * stack corruption race conditions, while changing the > + * handler helps prevent cascading faults; we are > + * certainly going to crash by this point. > + * > + * This update is safe from a security point of view, as > + * this pcpu is never going to try to sysret back to a > + * PV vcpu. > + */ > + _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop); > + set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE); > + } > + else > + /* Do not update stack table for other pcpus. */ > + _update_gate_addr_lower(&idt_tables[i][TRAP_nmi], > &nmi_crash); > + } > + > /* Ensure the new callback function is set before sending out the NMI. > */ > wmb(); > > diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/machine_kexec.c > --- a/xen/arch/x86/machine_kexec.c > +++ b/xen/arch/x86/machine_kexec.c > @@ -81,12 +81,31 @@ void machine_kexec(xen_kexec_image_t *im > .base = (unsigned long)(boot_cpu_gdt_table - > FIRST_RESERVED_GDT_ENTRY), > .limit = LAST_RESERVED_GDT_BYTE > }; > + int i; > > /* We are about to permenantly jump out of the Xen context into the > kexec > * purgatory code. We really dont want to be still servicing > interupts. > */ > local_irq_disable(); > > + /* Now regular interrupts are disabled, we need to reduce the impact > + * of interrupts not disabled by 'cli'. > + * > + * The NMI handlers have already been set up nmi_shootdown_cpus(). All > + * pcpus other than us have the nmi_crash handler, while we have the > nop > + * handler. > + * > + * The MCE handlers touch extensive areas of Xen code and data. At > this > + * point, there is nothing we can usefully do, so set the nop handler. > + */ > + for ( i = 0; i < nr_cpu_ids; ++i ) > + if ( idt_tables[i] ) > + _update_gate_addr_lower(&idt_tables[i][TRAP_machine_check], > &trap_nop); > + > + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do > + * not like running with NMIs disabled. */ > + enable_nmis(); > + > /* > * compat_machine_kexec() returns to idle pagetables, which requires us > * to be running on a static GDT mapping (idle pagetables have no GDT > diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/x86_64/entry.S > --- a/xen/arch/x86/x86_64/entry.S > +++ b/xen/arch/x86/x86_64/entry.S > @@ -635,11 +635,45 @@ ENTRY(nmi) > movl $TRAP_nmi,4(%rsp) > jmp handle_ist_exception > > +ENTRY(nmi_crash) > + cli > + pushq $0 > + movl $TRAP_nmi,4(%rsp) > + SAVE_ALL > + movq %rsp,%rdi > + callq do_nmi_crash /* Does not return */ > + ud2 > + > ENTRY(machine_check) > pushq $0 > movl $TRAP_machine_check,4(%rsp) > jmp handle_ist_exception > > +/* Enable NMIs. No special register assumptions. Only %rax is not > preserved. */ > +ENTRY(enable_nmis) > + movq %rsp, %rax /* Grab RSP before pushing */ > + > + /* Set up stack frame */ > + pushq $0 /* SS */ > + pushq %rax /* RSP */ > + pushfq /* RFLAGS */ > + pushq $__HYPERVISOR_CS /* CS */ > + leaq 1f(%rip),%rax > + pushq %rax /* RIP */ > + > + iretq /* Disable the hardware NMI latch */ > +1: > + retq > + > +/* No op trap handler. Required for kexec crash path. This is not > + * declared with the ENTRY() macro to avoid wasted alignment space. > + */ > +.globl trap_nop > +trap_nop: > + iretq > + > + > + > .section .rodata, "a", @progbits > > ENTRY(exception_table) > diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/desc.h > --- a/xen/include/asm-x86/desc.h > +++ b/xen/include/asm-x86/desc.h > @@ -106,6 +106,21 @@ typedef struct { > u64 a, b; > } idt_entry_t; > > +/* Write the lower 64 bits of an IDT Entry. This relies on the upper 32 > + * bits of the address not changing, which is a safe assumption as all > + * functions we are likely to load will live inside the 1GB > + * code/data/bss address range. > + * > + * Ideally, we would use cmpxchg16b, but this is not supported on some > + * old AMD 64bit capable processors, and has no safe equivalent. > + */ > +static inline void _write_gate_lower(volatile idt_entry_t *gate, > + const idt_entry_t *new) > +{ > + ASSERT(gate->b == new->b); > + gate->a = new->a; > +} > + > #define _set_gate(gate_addr,type,dpl,addr) \ > do { \ > (gate_addr)->a = 0; \ > @@ -122,6 +137,36 @@ do { > (1UL << 47); \ > } while (0) > > +static inline void _set_gate_lower(idt_entry_t *gate, unsigned long type, > + unsigned long dpl, void *addr) > +{ > + idt_entry_t idte; > + idte.b = gate->b; > + idte.a = > + (((unsigned long)(addr) & 0xFFFF0000UL) << 32) | > + ((unsigned long)(dpl) << 45) | > + ((unsigned long)(type) << 40) | > + ((unsigned long)(addr) & 0xFFFFUL) | > + ((unsigned long)__HYPERVISOR_CS64 << 16) | > + (1UL << 47); > + _write_gate_lower(gate, &idte); > +} > + > +/* Update the lower half handler of an IDT Entry, without changing any > + * other configuration. */ > +static inline void _update_gate_addr_lower(idt_entry_t *gate, void *addr) > +{ > + idt_entry_t idte; > + idte.a = gate->a; > + > + idte.b = ((unsigned long)(addr) >> 32); > + idte.a &= 0x0000FFFFFFFF0000ULL; > + idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) | > + ((unsigned long)(addr) & 0xFFFFUL); > + > + _write_gate_lower(gate, &idte); > +} > + > #define _set_tssldt_desc(desc,addr,limit,type) \ > do { \ > (desc)[0].b = (desc)[1].b = 0; \ > diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/processor.h > --- a/xen/include/asm-x86/processor.h > +++ b/xen/include/asm-x86/processor.h > @@ -527,6 +527,7 @@ void do_ ## _name(struct cpu_user_regs * > DECLARE_TRAP_HANDLER(divide_error); > DECLARE_TRAP_HANDLER(debug); > DECLARE_TRAP_HANDLER(nmi); > +DECLARE_TRAP_HANDLER(nmi_crash); > DECLARE_TRAP_HANDLER(int3); > DECLARE_TRAP_HANDLER(overflow); > DECLARE_TRAP_HANDLER(bounds); > @@ -545,6 +546,9 @@ DECLARE_TRAP_HANDLER(alignment_check); > DECLARE_TRAP_HANDLER(spurious_interrupt_bug); > #undef DECLARE_TRAP_HANDLER > > +void trap_nop(void); > +void enable_nmis(void); > + > void syscall_enter(void); > void sysenter_entry(void); > void sysenter_eflags_saved(void); _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.