[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] 4.10.1 Xen crash and reboot


  • To: xen-devel@xxxxxxxxxxxxxxxxxxxx
  • From: Andy Smith <andy@xxxxxxxxxxxxxx>
  • Date: Fri, 21 Dec 2018 18:55:38 +0000
  • Delivery-date: Fri, 21 Dec 2018 18:55:42 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: id=BF15490B; url=http://strugglers.net/~andy/pubkey.asc

Hello,

And again today:

(XEN) ----[ Xen-4.10.3-pre  x86_64  debug=n   Not tainted ]----
(XEN) CPU:    4
(XEN) RIP:    e008:[<ffff82d08033f50b>] 
guest_4.o#sh_page_fault__guest_4+0x70b/0x2060
(XEN) RFLAGS: 0000000000010203   CONTEXT: hypervisor (d61v1)
(XEN) rax: 000000c422641dd0   rbx: ffff832005c49000   rcx: ffff81c0e0600000
(XEN) rdx: 0000000000000000   rsi: ffff832005c49000   rdi: 000000c422641dd0
(XEN) rbp: ffff81c0e0601880   rsp: ffff83207e607c38   r8:  0000000000000310
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: ffff83207e607ef8   r13: 0000000000f9cea7   r14: 0000000000000000
(XEN) r15: ffff830079592000   cr0: 0000000080050033   cr4: 0000000000372660
(XEN) cr3: 0000001ffab1a001   cr2: ffff81c0e0601880
(XEN) fsb: 00007f89c67fc700   gsb: 0000000000000000   gss: ffff88007f300000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d08033f50b> 
(guest_4.o#sh_page_fault__guest_4+0x70b/0x2060):
(XEN)  49 c1 e8 1e 4a 8d 2c c1 <48> 8b 4d 00 f6 c1 01 0f 84 f8 06 00 00 48 c1 e1
(XEN) Xen stack trace from rsp=ffff83207e607c38:
(XEN)    ffff830f68748208 000000c422641dd0 ffff832005c49600 0000000000f9cea7
(XEN)    ffff832005c49660 ffff832005c49000 ffff83207e607d70 ffff83207e607d20
(XEN)    000000000c422641 0000000000000090 ffff82d0805802c0 0000000205c49000
(XEN)    0000000000000008 0000000000000880 0000000000000898 ffff82d0805802c0
(XEN)    0000000001fd58a1 0000000001ffab1a 0000000000000208 0000000000000041
(XEN)    8000000f9cea7825 01ff82d000000000 000000000000000d ffff82d000000000
(XEN)    ffff832005c49000 000000010000000d ffff83207e607fff ffff83207e607d20
(XEN)    00000000000000a1 000000c422641dd0 0000000f86569067 0000000f86544067
(XEN)    0000000f68748067 8000000f9cea7925 0000000000f87171 0000000000f86569
(XEN)    0000000000f86544 0000000000f68748 0000000000000005 ffffffffffffffff
(XEN)    ffff82e03ff56340 ffff832005c49000 0000000500007ff0 0000000000000000
(XEN)    ffff83207e607e18 ffff830079592000 ffff832005c49000 ffff83207e607ef8
(XEN)    000000c422641dd0 ffff82d08034e4b0 0000000000000000 ffff82d080349e20
(XEN)    0000000000000000 ffff83207e607fff ffff830079592000 ffff82d08034e5ae
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN)    ffff82d080354913 ffff83207e607ef8 ffff830079592000 000000c422641dd0
(XEN)    ffff832005c49000 0000000000000004 0000000000000000 ffff82d0802a1842
(XEN)    ffff82d080354913 ffff82d080354907 ffff82d080354913 ffff82d080354907
(XEN) Xen call trace:
(XEN)    [<ffff82d08033f50b>] guest_4.o#sh_page_fault__guest_4+0x70b/0x2060
(XEN)    [<ffff82d08034e4b0>] do_iret+0/0x1a0
(XEN)    [<ffff82d080349e20>] toggle_guest_pt+0x30/0x160
(XEN)    [<ffff82d08034e5ae>] do_iret+0xfe/0x1a0
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d0802a1842>] do_page_fault+0x1a2/0x4e0
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d080354907>] handle_exception+0x8f/0xf9
(XEN)    [<ffff82d080354913>] handle_exception+0x9b/0xf9
(XEN)    [<ffff82d0803549d9>] x86_64/entry.S#handle_exception_saved+0x68/0x94
(XEN)
(XEN) Pagetable walk from ffff81c0e0601880:
(XEN)  L4[0x103] = 8000001ffab1a063 ffffffffffffffff
(XEN)  L3[0x103] = 8000001ffab1a063 ffffffffffffffff
(XEN)  L2[0x103] = 8000001ffab1a063 ffffffffffffffff
(XEN)  L1[0x001] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 4:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff81c0e0601880
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.

Host has now rebooted into a hypervisor with pcid=0 command line.

I note that:

(XEN) RFLAGS: 0000000000010203   CONTEXT: hypervisor (d61v1)

and the previous incident (below):

(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d31v1)

These are the same guest.

Is it worth me moving this guest to a test host without pcid=0 to
see if it crashes it, meanwhile keeping production hosts with
pcid=0? And then putting pcid=0 on the test host to see if it
survives longer?

This will take quite a long time to gain confidence of, since the
incidents are about 2 weeks apart each time.

Thanks,
Andy

On Mon, Dec 10, 2018 at 03:58:41PM +0000, Andy Smith wrote:
> Hi,
> 
> Up front information:
> 
> Today one of my Xen hosts crashed with this logging on the serial:
> 
> (XEN) ----[ Xen-4.10.1  x86_64  debug=n   Not tainted ]----
> (XEN) CPU:    15
> (XEN) RIP:    e008:[<ffff82d08033db45>] guest_4.o#shadow_set_l1e+0x75/0x6a0
> (XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d31v1)
> (XEN) rax: ffff82e01ecfae80   rbx: 0000000f67d74025   rcx: 0000000000000000
> (XEN) rdx: ffff82e000000000   rsi: ffff81bfd79f12d8   rdi: 00000000ffffffff
> (XEN) rbp: 0000000000f67d74   rsp: ffff83202628fbd8   r8:  00000000010175c6
> (XEN) r9:  0000000000000000   r10: ffff830079592000   r11: 0000000000000000
> (XEN) r12: 0000000f67d74025   r13: ffff832020549000   r14: 0000000000f67d74
> (XEN) r15: ffff81bfd79f12d8   cr0: 0000000080050033   cr4: 0000000000372660
> (XEN) cr3: 0000001fd5b8d001   cr2: ffff81bfd79f12d8
> (XEN) fsb: 00007faf3e71f700   gsb: 0000000000000000   gss: ffff88007f300000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen code around <ffff82d08033db45> 
> (guest_4.o#shadow_set_l1e+0x75/0x6a0):
> (XEN)  0f 20 0f 85 23 01 00 00 <4d> 8b 37 4c 39 f3 0f 84 97 01 00 00 49 89 da 
> 89
> (XEN) Xen stack trace from rsp=ffff83202628fbd8:
> (XEN)    0000000f67d74000 00000000010175c6 0000000000000000 ffff832000000002
> (XEN)    ffff830079592000 ffff832020549000 ffff81bfd79f12d8 ffff83202628fef8
> (XEN)    00000000010175c6 0000000000f67d74 ffff830079592000 ffff82d08033fc82
> (XEN)    8000000fad0dc125 00007faf3e25bba0 ffff832020549600 0000000000f67d74
> (XEN)    0000000000f67d74 0000000000f67d74 ffff83202628fd70 ffff83202628fd20
> (XEN)    00000007faf3e25b 00000000000000c0 ffff82d0805802c0 0000000220549000
> (XEN)    00000000000007f8 00000000000005e0 0000000000000f88 ffff82d0805802c0
> (XEN)    00000000010175c6 00007faf3e25bba0 00000000000002d8 000000000000005b
> (XEN)    ffff81c0dfebcf88 01ff82d000000000 0000000f67d74025 ffff82d000000000
> (XEN)    ffff832020549000 000000010000000d ffff83202628ffff ffff83202628fd20
> (XEN)    00000000000000e9 00007faf3e25bba0 0000000f472df067 0000000f49296067
> (XEN)    0000000f499f1067 0000000f67d74125 0000000000f498cf 0000000000f472df
> (XEN)    0000000000f49296 0000000000f499f1 0000000000000015 ffffffffffffffff
> (XEN)    ffff82e03fab71a0 ffff830079593000 ffff82d0803557eb ffff82d08020bf4a
> (XEN)    0000000000000000 ffff830079592000 ffff832020549000 ffff83202628fef8
> (XEN)    0000000000000002 ffff82d08034e9b0 0000000000633400 ffff82d08034a330
> (XEN)    ffff830079592000 ffff83202628ffff ffff830079592000 ffff82d08034eaae
> (XEN)    ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907
> (XEN)    ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907
> (XEN)    ffff82d080355913 ffff82d080355907 ffff82d080355913 ffff82d080355907
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08033db45>] guest_4.o#shadow_set_l1e+0x75/0x6a0
> (XEN)    [<ffff82d08033fc82>] guest_4.o#sh_page_fault__guest_4+0x8f2/0x2060
> (XEN)    [<ffff82d0803557eb>] common_interrupt+0x9b/0x120
> (XEN)    [<ffff82d08020bf4a>] evtchn_check_pollers+0x1a/0xb0
> (XEN)    [<ffff82d08034e9b0>] do_iret+0/0x1a0
> (XEN)    [<ffff82d08034a330>] toggle_guest_pt+0x30/0x160
> (XEN)    [<ffff82d08034eaae>] do_iret+0xfe/0x1a0
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d0802a16b2>] do_page_fault+0x1a2/0x4e0
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d080355907>] handle_exception+0x8f/0xf9
> (XEN)    [<ffff82d080355913>] handle_exception+0x9b/0xf9
> (XEN)    [<ffff82d0803559d9>] x86_64/entry.S#handle_exception_saved+0x68/0x94
> (XEN) 
> (XEN) Pagetable walk from ffff81bfd79f12d8:
> (XEN)  L4[0x103] = 8000001fd5b8d063 ffffffffffffffff
> (XEN)  L3[0x0ff] = 0000000000000000 ffffffffffffffff
> (XEN) 
> (XEN) Reboot in five seconds...
> (XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
> 
> The same host also crashed about 2 weeks ago but I had nothing in
> place to record the serial console so I have no logs. There has also
> been one other host crash on a different host but again no
> information collected.
> 
> Longer background:
> 
> Around the weekend of 18 November I deployed a hypervisor built from
> staging-4.10 plus the outstanding XSA patches including XSA-273
> which I had up until then held off on.
> 
> As described in:
> 
>     https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg02811.html
> 
> within a few days I began noticing sporadic memory corruption issues
> in some guests, we established there was a bug in the L1TF fixes,
> and I was able to avoid the problem in affected guests by making
> sure to upgrade their guest kernels so they have Linux's L1TF fixes.
> 
> During first reboot into that hypervisor one of my hosts crashed and
> rebooted, but it went by too fast for me to get any information and
> there wasn't enough scrollback on the serial console.
> 
> Since then, a different host has crashed and rebooted twice. The
> first time I have managed to log it is above.
> 
> I don't think it's a hardware fault, or at least if it is it is only
> being tickled by something added recently. I have absolutely no idea
> it is the case but I can't help feeling it's going to be related to
> L1TF again.
> 
> Do my logs above help at all?
> 
> Is it worth me trying to work out what d31 was at the time and
> taking a closer look at that?
> 
> Production system, problem that occurs weeks apart… could be a bit
> tricky to get to the bottom of.
> 
> The host is a Debian jessie dom0 running kernel version
> linux-image-3.16.0-7-amd64 3.16.59-1. The hardware is a single
> socket Xeon D-1540. The xl info is:
> 
> host                   : hobgoblin
> release                : 3.16.0-7-amd64
> version                : #1 SMP Debian 3.16.59-1 (2018-10-03)
> machine                : x86_64
> nr_cpus                : 16
> max_cpu_id             : 15
> nr_nodes               : 1
> cores_per_socket       : 8
> threads_per_core       : 2
> cpu_mhz                : 2000
> hw_caps                : 
> bfebfbff:77fef3ff:2c100800:00000121:00000001:001cbfbb:00000000:00000100
> virt_caps              : hvm hvm_directio
> total_memory           : 130969
> free_memory            : 4646
> sharing_freed_memory   : 0
> sharing_used_memory    : 0
> outstanding_claims     : 0
> free_cpus              : 0
> xen_major              : 4
> xen_minor              : 10
> xen_extra              : .1
> xen_version            : 4.10.1
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
> hvm-3.0-x86_32p hvm-3.0-x86_64 
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : fe50b33b07fd447949-x86: write to correct variable in 
> parse_pv_l
> xen_commandline        : placeholder dom0_mem=2048M dom0_max_vcpus=2 
> com1=115200,8n1,0x2f8,10 console=com1,vga ucode=scan serial_tx_buffer=256k
> cc_compiler            : gcc (Debian 4.9.2-10+deb8u1) 4.9.2
> cc_compile_by          : andy
> cc_compile_domain      : prymar56.org
> cc_compile_date        : Wed Nov  7 16:52:19 UTC 2018
> build_id               : 091f7ab43ab0b6ef9208a2e593c35496517fbe91
> xend_config_format     : 4
> 
> Are there any other hypervisor command line options that would be
> beneficial to set for next time? Unfortunately unless we are very
> sure to get somewhere, or I can isolate a guest that is triggering
> this and put it on test hardware, I don't really want to keep
> rebooting this system. But I can set something so it boots into it
> next time.
> 
> Thanks,
> Andy

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.