[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] PAT-related crash booting Linux 4.4 + Xen 4.5 on VMware ESXi
Good question. I ran my tests again, and found I'd misinterpreted the Fusion behavior. On Fusion 8.1.1, MSR_IA32_CR_PAT returns a reasonable value: (XEN) Freed 308kB init memory. mapping kernel into physical memory cpu_has_pat=0 cpuid_edx(1)=f89cbf5 pat=65536 pat_init_cache_modes pat=50100070406 pat_init_cache_modes i=7 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=6 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=5 pat_val=5 cache=5 pat_init_cache_modes ok pat_init_cache_modes i=4 pat_val=1 cache=1 pat_init_cache_modes ok pat_init_cache_modes i=3 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=2 pat_val=7 cache=2 pat_init_cache_modes ok pat_init_cache_modes i=1 pat_val=4 cache=4 pat_init_cache_modes ok pat_init_cache_modes i=0 pat_val=6 cache=0 pat_init_cache_modes ok pat_init_cache_modes pat_msg=WB WT UC- UC WC WP UC UC about to get started... [ 0.000000] x86/PAT: Configuration [0-7]: WB WT UC- UC WC WP UC UC On ESXi 5.5.0, MSR_IA32_CR_PAT returns 0, and we are indeed hitting the BUG_ON in update_cache_mode_entry(): (XEN) Freed 312kB init memory. mapping kernel into physical memory cpu_has_pat=0 cpuid_edx(1)=f89cbf5 pat=65536 pat_init_cache_modes pat=0 pat_init_cache_modes i=7 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=6 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=5 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=4 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=3 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=2 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=1 pat_val=0 cache=3 pat_init_cache_modes ok pat_init_cache_modes i=0 pat_val=0 cache=3 (XEN) traps.c:459:d0v0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000] (XEN) domain_crash_sync called from entry.S: fault at ffff82d0802276c3 create_bounce_frame+0x12b/0x13a In both cases, the PAT CPUID feature bit is set, and cpu_has_pat is always 0 at this early point (so my RFC patch is wrong). The simplest fix is to call pat_init_cache_modes(pat) only if pat != 0. This is starting to look like the same logic that's in pat_bsp_init(), which doesn't seem to be called when booting on Xen. Should it be? Was Xen deliberately excluded from this PAT emulation change? https://groups.google.com/d/msg/linux.kernel/JoJKbCOxV0U/PM0I9d1v60kJ --Ed On Mon, May 23, 2016 at 1:13 PM, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx> wrote: > On 05/23/2016 10:15 AM, Konrad Rzeszutek Wilk wrote: >> On Fri, May 20, 2016 at 04:58:09PM -0700, Ed Swierk wrote: >>> (XEN) traps.c:459:d0v0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 >>> [ec=0000] >>> (XEN) domain_crash_sync called from entry.S: fault at ffff82d0802286c3 >>> create_bounce_frame+0x12b/0x13a >>> (XEN) Domain 0 (vcpu#0) crashed on cpu#0: >>> (XEN) ----[ Xen-4.5.4-pre x86_64 debug=n Not tainted ]---- >>> (XEN) CPU: 0 >>> (XEN) RIP: e033:[<ffffffff81053cbd>] >>> (XEN) RFLAGS: 0000000000000206 EM: 1 CONTEXT: pv guest (d0v0) >>> (XEN) rax: 0000000000000022 rbx: 00000000ffffffff rcx: 0000000000000000 >>> (XEN) rdx: 0000000000000022 rsi: 0000000000000003 rdi: 0000000000000000 >>> (XEN) rbp: ffffffff81b67ea8 rsp: ffffffff81b67e68 r8: 0000000000000001 >>> (XEN) r9: 0000000000000001 r10: ffffffff81b67f20 r11: 6c61765f74617020 >>> (XEN) r12: 0000000000000000 r13: 0000000000000003 r14: 0000000000000000 >>> (XEN) r15: ffffffff81b67ebb cr0: 000000008005003b cr4: 00000000001526b0 >>> (XEN) cr3: 00000001b16eb000 cr2: 0000000000000000 >>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 >>> (XEN) Guest stack trace from rsp=ffffffff81b67e68: >>> (XEN) 0000000000000000 6c61765f74617020 ffffffff81053cbd 000000010000e030 >>> (XEN) 0000000000010006 ffffffff81b67ea8 000000000000e02b ffffffff81b67f20 >>> (XEN) ffffffff81b67f10 ffffffff8105b339 55ffffff81b67f10 5520204355202043 >>> (XEN) 5520204355202043 5520204355202043 0020204355202043 0000000000000000 >>> (XEN) 0000000000000000 ffffffff81b67f38 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 ffffffff81b67ff0 ffffffff82010d0a 0000000000000000 >>> (XEN) 000306f200000000 fed8320300010800 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 ffffffff81b68008 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 00000000fffedb08 >>> (XEN) Domain 0 crashed: rebooting machine in 5 seconds. >>> (XEN) Resetting with ACPI MEMORY or I/O RESET_REG. >>> >>> The crash occurs in pat_init_cache_modes(), called by >>> xen_start_kernel(). The pat value from MSR_IA32_CR_PAT is 0. >>> Strangely, the same kernel and Xen boot just fine on VMware Fusion >>> 8.1.1, even though the MSR is 0 there as well. > > Are you hitting BUG_ON in update_cache_mode_entry()? I don't think I can > see how you can avoid it when MSR read returns 0. > > >>> >>> Anyway, guessing that it's pointless to call pat_init_cache_modes() >>> when the CPU doesn't support PAT, I added a check for cpu_has_pat. >>> This resolves the problem on ESXi and doesn't seem to break real >>> hardware, though I'm not sure how to verify PAT functionality. So >>> this is just an RFC. > > Can you start an HVM guest in Xen after your patch below? > >> Cc-ing maintainers. >>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c >>> index 9a29803..209f680 100644 >>> --- a/arch/x86/xen/enlighten.c >>> +++ b/arch/x86/xen/enlighten.c >>> @@ -1633,8 +1633,12 @@ asmlinkage __visible void __init >>> xen_start_kernel(void) >>> * Modify the cache mode translation tables to match Xen's PAT >>> * configuration. >>> */ >>> - rdmsrl(MSR_IA32_CR_PAT, pat); >>> - pat_init_cache_modes(pat); >>> + if (cpu_has_pat) { >>> + rdmsrl(MSR_IA32_CR_PAT, pat); >>> + pat_init_cache_modes(pat); >>> + } else { >>> + xen_raw_console_write("CPU does not support PAT\n"); >>> + } >>> >>> /* keep using Xen gdt for now; no urgent need to change it */ >>> > > This looks OK to me but I think we should first understand why you don't > crash on Fusion. > > Also, PAT initialization code has been rewritten in Linux (for 4.5?) so > I suspect this problem is only observed on earlier kernels. > > -boris > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |