Xen project Mailing List

[Xen-devel] 4.10.1 Xen crash and reboot

From: Andy Smith <andy@xxxxxxxxxxxxxx>

Date: Mon, 10 Dec 2018 15:58:41 +0000

Delivery-date: Mon, 10 Dec 2018 15:58:48 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Openpgp: id=BF15490B; url=http://strugglers.net/~andy/pubkey.asc

Hi, Up front information: Today one of my Xen hosts crashed with this logging on the serial: (XEN) ----[ Xen-4.10.1 x86_64 debug=n Not tainted ]---- (XEN) CPU: 15 (XEN) RIP: e008:[<ffff82d08033db45>] guest_4.o#shadow_set_l1e+0x75/0x6a0 (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor (d31v1) (XEN) rax: ffff82e01ecfae80 rbx: 0000000f67d74025 rcx: 0000000000000000 (XEN) rdx: ffff82e000000000 rsi: ffff81bfd79f12d8 rdi: 00000000ffffffff (XEN) rbp: 0000000000f67d74 rsp: ffff83202628fbd8 r8: 00000000010175c6 (XEN) r9: 0000000000000000 r10: ffff830079592000 r11: 0000000000000000 (XEN) r12: 0000000f67d74025 r13: ffff832020549000 r14: 0000000000f67d74 (XEN) r15: ffff81bfd79f12d8 cr0: 0000000080050033 cr4: 0000000000372660 (XEN) cr3: 0000001fd5b8d001 cr2: ffff81bfd79f12d8 (XEN) fsb: 00007faf3e71f700 gsb: 0000000000000000 gss: ffff88007f300000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen code around <ffff82d08033db45> (guest_4.o#shadow_set_l1e+0x75/0x6a0): (XEN) 0f 20 0f 85 23 01 00 00 <4d> 8b 37 4c 39 f3 0f 84 97 01 00 00 49 89 da 89 (XEN) Xen stack trace from rsp=ffff83202628fbd8: (XEN) 0000000f67d74000 00000000010175c6 0000000000000000 ffff832000000002 (XEN) ffff830079592000 ffff832020549000 ffff81bfd79f12d8 ffff83202628fef8 (XEN) 00000000010175c6 0000000000f67d74 ffff830079592000 ffff82d08033fc82 (XEN) 8000000fad0dc125 00007faf3e25bba0 ffff832020549600 0000000000f67d74 (XEN) 0000000000f67d74 0000000000f67d74 ffff83202628fd70 ffff83202628fd20 (XEN) 00000007faf3e25b 00000000000000c0 ffff82d0805802c0 0000000220549000 (XEN) 00000000000007f8 00000000000005e0 0000000000000f88 ffff82d0805802c0 (XEN) 00000000010175c6 00007faf3e25bba0 00000000000002d8 000000000000005b (XEN) ffff81c0dfebcf88 01ff82d000000000 0000000f67d74025 ffff82d000000000 (XEN) ffff832020549000 000000010000000d ffff83202628ffff ffff83202628fd20 (XEN) 00000000000000e9 00007faf3e25bba0 0000000f472df067 0000000f49296067 (XEN) 0000000f499f1067 0000000f67d74125 0000000000f498cf 0000000000f472df (XEN) 0000000000f49296 0000000000f499f1 0000000000000015 ffffffffffffffff (XEN) ffff82e03fab71a0 ffff830079593000 ffff82d0803557eb ffff82d08020bf4a (XEN) 0000000000000000 ffff830079592000 ffff832020549000 ffff83202628fef8 (XEN) 0000000000000002 ffff82d08034e9b0 0000000000633400 ffff82d08034a330 (XEN) ffff830079592000 ffff83202628ffff ffff830079592000 ffff82d08034eaae (XEN) ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907 (XEN) ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907 (XEN) ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907 (XEN) Xen call trace: (XEN) [<ffff82d08033db45>] guest_4.o#shadow_set_l1e+0x75/0x6a0 (XEN) [<ffff82d08033fc82>] guest_4.o#sh_page_fault__guest_4+0x8f2/0x2060 (XEN) [<ffff82d0803557eb>] common_interrupt+0x9b/0x120 (XEN) [<ffff82d08020bf4a>] evtchn_check_pollers+0x1a/0xb0 (XEN) [<ffff82d08034e9b0>] do_iret+0/0x1a0 (XEN) [<ffff82d08034a330>] toggle_guest_pt+0x30/0x160 (XEN) [<ffff82d08034eaae>] do_iret+0xfe/0x1a0 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d080355907>] handle_exception+0x8f/0xf9 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d080355907>] handle_exception+0x8f/0xf9 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d080355907>] handle_exception+0x8f/0xf9 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d080355907>] handle_exception+0x8f/0xf9 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d080355907>] handle_exception+0x8f/0xf9 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d080355907>] handle_exception+0x8f/0xf9 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d0802a16b2>] do_page_fault+0x1a2/0x4e0 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d080355907>] handle_exception+0x8f/0xf9 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d080355907>] handle_exception+0x8f/0xf9 (XEN) [<ffff82d080355913>] handle_exception+0x9b/0xf9 (XEN) [<ffff82d0803559d9>] x86_64/entry.S#handle_exception_saved+0x68/0x94 (XEN) (XEN) Pagetable walk from ffff81bfd79f12d8: (XEN) L4[0x103] = 8000001fd5b8d063 ffffffffffffffff (XEN) L3[0x0ff] = 0000000000000000 ffffffffffffffff (XEN) (XEN) Reboot in five seconds... (XEN) Resetting with ACPI MEMORY or I/O RESET_REG. The same host also crashed about 2 weeks ago but I had nothing in place to record the serial console so I have no logs. There has also been one other host crash on a different host but again no information collected. Longer background: Around the weekend of 18 November I deployed a hypervisor built from staging-4.10 plus the outstanding XSA patches including XSA-273 which I had up until then held off on. As described in: https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg02811.html within a few days I began noticing sporadic memory corruption issues in some guests, we established there was a bug in the L1TF fixes, and I was able to avoid the problem in affected guests by making sure to upgrade their guest kernels so they have Linux's L1TF fixes. During first reboot into that hypervisor one of my hosts crashed and rebooted, but it went by too fast for me to get any information and there wasn't enough scrollback on the serial console. Since then, a different host has crashed and rebooted twice. The first time I have managed to log it is above. I don't think it's a hardware fault, or at least if it is it is only being tickled by something added recently. I have absolutely no idea it is the case but I can't help feeling it's going to be related to L1TF again. Do my logs above help at all? Is it worth me trying to work out what d31 was at the time and taking a closer look at that? Production system, problem that occurs weeks apart… could be a bit tricky to get to the bottom of. The host is a Debian jessie dom0 running kernel version linux-image-3.16.0-7-amd64 3.16.59-1. The hardware is a single socket Xeon D-1540. The xl info is: host : hobgoblin release : 3.16.0-7-amd64 version : #1 SMP Debian 3.16.59-1 (2018-10-03) machine : x86_64 nr_cpus : 16 max_cpu_id : 15 nr_nodes : 1 cores_per_socket : 8 threads_per_core : 2 cpu_mhz : 2000 hw_caps : bfebfbff:77fef3ff:2c100800:00000121:00000001:001cbfbb:00000000:00000100 virt_caps : hvm hvm_directio total_memory : 130969 free_memory : 4646 sharing_freed_memory : 0 sharing_used_memory : 0 outstanding_claims : 0 free_cpus : 0 xen_major : 4 xen_minor : 10 xen_extra : .1 xen_version : 4.10.1 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : fe50b33b07fd447949-x86: write to correct variable in parse_pv_l xen_commandline : placeholder dom0_mem=2048M dom0_max_vcpus=2 com1=115200,8n1,0x2f8,10 console=com1,vga ucode=scan serial_tx_buffer=256k cc_compiler : gcc (Debian 4.9.2-10+deb8u1) 4.9.2 cc_compile_by : andy cc_compile_domain : prymar56.org cc_compile_date : Wed Nov 7 16:52:19 UTC 2018 build_id : 091f7ab43ab0b6ef9208a2e593c35496517fbe91 xend_config_format : 4 Are there any other hypervisor command line options that would be beneficial to set for next time? Unfortunately unless we are very sure to get somewhere, or I can isolate a guest that is triggering this and put it on test hardware, I don't really want to keep rebooting this system. But I can set something so it boots into it next time. Thanks, Andy _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.