[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Random crash during guest start on x86


  • To: Bertrand Marquis <Bertrand.Marquis@xxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 25 Jul 2022 18:16:00 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=G3nTpai3WeGvgri+l+g5JnF3+TSlBuZUsjH2tkTfrDY=; b=jp16fZxwWcy7nJQa4UFo3EPYLrc4VNQdxuhFayUOWqcY6z107tV3zgOn7WFgYCSqXveNNAqfOWMYdKO0+OP31ZQwlH6baURDbDnoDOQrfhpW/IMtL+5q/YMQsv5n4sYn8jyFkd9Am9TKTu6Q2hpljZZlwnjushDh3++Qxap0WqOnW5pD2IdKrfE5agwFanDoEyrl/HLB1zeWk1UXeySvtUBd6rNfXyinwzWud9VwpnDSJu17g9CFNawHVDRjPJFVwXuhcJGLKCgW6tjQcTJiJ1cBTUzN/INYlDghHu5HniRi9t8gEA3sTOZGjUx8fBNTmEQciNVlIl4IHHTkn9wQvg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=GYV9Jcvl4ByIgm1wGejF8a9+71qljUKI01wDc88Lg1+nyade4t7lyn3cyPEMjIY09h9O0qIqHrSmMT7Q7HfqZwNmNWUu2Oy5ZJaJFW5nOG+RezUMJ+zT6cluJoIlMK/QUoYnzu8xMCAGecvE6eS6XBOSwbPZMPnUSUtGKlNqeMa4+CR5sop50b1/JeLPOSeCpwwB/ZST6xE8ZKJMbtrNCIlDZgYQNDfk+R+s5M42QeavNVharOvYRv8kuIUCawsYDZPrBsKlP7coiqHn5QK0pl7eKSFRnTqfo/Qih3Kdt2/i6ZgVWY2/1ALrmR/5awtPTZ/oFvoAviZ6nmLMyK0Q3g==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Mon, 25 Jul 2022 16:16:12 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 25.07.2022 17:51, Bertrand Marquis wrote:
> On our CI we have randomly a crash during guest boot on x86.

Afaict of a PV guest.

> We are running on qemu x86_64 using Xen staging.

Which may introduce unusual timing. An issue never hit on actual hardware
_may_ (but doesn't have to be) one in qemu itself.

> The crash is happening randomly (something like 1 out of 20 times).
> 
> This is always happening on the first guest we start, we never got it after 
> first guest was successfully started.
> 
> Please tell me if you need any other info.
> 
> Here is the guest kernel log:
>[...]
> [    6.679020] general protection fault, maybe for address 0x8800: 0000 [#1] 
> PREEMPT SMP NOPTI
> [    6.679020] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.17.6 #1
> [    6.679020] RIP: e030:error_entry+0xaf/0xe0
> [    6.679020] Code: 29 89 c8 48 39 84 24 88 00 00 00 74 15 48 81 bc 24 88 00 
> 00 00 63 10 e0 81 75 03 0f 01 f8 90 90 90 c3 48 89 8c 24 88 00 00 00 <0f> 01 
> f8 90 90 90 eb 11 0f 20 d8 90 90 90 90 90 48 25 ff e7 ff ff

This is SWAPGS, which supposedly a PV guest should never hit. Data further
down suggests the kernel is still in the process of patching alternatives,
which may be the reason for the insn to still be there (being at a point
where exceptions are still unexpected).

> [    6.679020] RSP: e02b:ffffffff82803a90 EFLAGS: 00000046
> [    6.679020] RAX: 0000000000008800 RBX: 0000000000000000 RCX: 
> ffffffff81e00fa7
> [    6.679020] RDX: 0000000000000000 RSI: ffffffff81e009f8 RDI: 
> 00000000000000eb
> [    6.679020] RBP: 0000000000000000 R08: 0000000000000000 R09: 
> 0000000000000000
> [    6.679020] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [    6.679020] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> [    6.679020] FS:  0000000000000000(0000) GS:ffff88801f200000(0000) 
> knlGS:0000000000000000
> [    6.679020] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    6.679020] CR2: 0000000000000000 CR3: 000000000280c000 CR4: 
> 0000000000050660
> [    6.679020] Call Trace:
> [    6.679020]  <TASK>
> [    6.679020] RIP: e030:native_irq_return_iret+0x0/0x2
> [    6.679020] Code: 41 5d 41 5c 5d 5b 41 5b 41 5a 41 59 41 58 58 59 5a 5e 5f 
> 48 83 c4 08 eb 0a 0f 1f 00 90 66 0f 1f 44 00 00 f6 44 24 20 04 75 02 <48> cf 
> 57 0f 01 f8 eb 12 0f 20 df 90 90 90 90 90 48 81 e7 ff e7 ff
> [    6.679020] RSP: e02b:ffffffff82803b48 EFLAGS: 00000046 ORIG_RAX: 
> 000000000000e030
> [    6.679020] RAX: 0000000000008800 RBX: ffffffff82803be0 RCX: 
> ffffffff81e00f95
> [    6.679020] RDX: ffffffff81e00f94 RSI: ffffffff81e00f95 RDI: 
> 00000000000000eb
> [    6.679020] RBP: 00000000000000eb R08: 0000000090001f0f R09: 
> 0000000000000007
> [    6.679020] R10: ffffffff81e00f94 R11: ffffffff8285a6c0 R12: 
> 0000000000000000
> [    6.679020] R13: ffffffff81e00f94 R14: 0000000000000006 R15: 
> 0000000000000006
> [    6.679020]  ? asm_exc_general_protection+0x8/0x30
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1b/0x27
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1b/0x27
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1c/0x27
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1b/0x27
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1c/0x27
> [    6.679020] RIP: e030:insn_get_opcode.part.0+0xab/0x180
> [    6.679020] Code: 00 00 8b 43 4c a9 c0 07 08 00 0f 84 bf 00 00 00 c6 43 1c 
> 01 31 c0 5b 5d c3 83 e2 03 be 01 00 00 00 eb b7 89 ef e8 65 e4 ff ff <89> 43 
> 4c a8 30 75 21 e9 8e 00 00 00 0f b6 7b 03 40 84 ff 75 73 8b
> [    6.679020] RSP: e02b:ffffffff82803b70 EFLAGS: 00000246
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1b/0x27
> [    6.679020]  insn_get_modrm+0x6c/0x120
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1b/0x27
> [    6.679020]  insn_get_sib+0x40/0x80
> [    6.679020]  insn_get_displacement+0x82/0x100
> [    6.679020]  insn_decode+0xf8/0x100
> [    6.679020]  optimize_nops+0x60/0x1e0
> [    6.679020]  ? rcu_nmi_exit+0x2b/0x140
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1b/0x27
> [    6.679020]  ? native_iret+0x3/0x7
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1c/0x27
> [    6.679020]  ? restore_regs_and_return_to_kernel+0x1b/0x27
> [    6.679020]  apply_alternatives+0x165/0x2e0

I have to admit that I'm a little lost with these "modern" stack traces,
where contexts apparently switch without being clearly annotated. It is
looking a little as if a #GP fault was happening somewhere here (hence
the asm_exc_general_protection further up), but I cannot work out where
(what insn) that would have come from.

You may want to add some debugging code to the hypervisor to tell you
where exactly that #GP (if there is one in the first place) originates
from. With that it may then become a little more clear what's actually
going on (and why the behavior is random).

As a final remark - you've Cc-ed the x86 hypervisor maintainers, but at
least from the data which is available so far this is more likely a
kernel issue. So kernel folks might be of more help ...

Jan

> [    6.679020]  ? insn_get_opcode.part.0+0xab/0x180
> [    6.679020]  ? insn_get_modrm+0x6c/0x120
> [    6.679020]  ? rcu_nmi_enter+0x2b/0xf0
> [    6.679020]  ? rcu_nmi_exit+0x2b/0x140
> [    6.679020]  ? irqentry_exit+0x29/0x30
> [    6.679020]  ? exc_xen_hypervisor_callback+0x8/0x10
> [    6.679020]  ? relocate_restore_code+0x22a/0x240
> [    6.679020]  alternative_instructions+0x8b/0x13a
> [    6.679020]  check_bugs+0xa87/0xadf
> [    6.679020]  ? __get_locked_pte+0xa8/0xf0
> [    6.679020]  start_kernel+0x64c/0x680
> [    6.679020]  xen_start_kernel+0x592/0x59f
> [    6.679020]  startup_xen+0x3e/0x3e
> [    6.679020]  </TASK>
> [    6.679020] Modules linked in:
> [    6.679020] ---[ end trace 0000000000000000 ]---
> [    6.679020] RIP: e030:error_entry+0xaf/0xe0
> [    6.679020] Code: 29 89 c8 48 39 84 24 88 00 00 00 74 15 48 81 bc 24 88 00 
> 00 00 63 10 e0 81 75 03 0f 01 f8 90 90 90 c3 48 89 8c 24 88 00 00 00 <0f> 01 
> f8 90 90 90 eb 11 0f 20 d8 90 90 90 90 90 48 25 ff e7 ff ff
> [    6.679020] RSP: e02b:ffffffff82803a90 EFLAGS: 00000046
> [    6.679020] RAX: 0000000000008800 RBX: 0000000000000000 RCX: 
> ffffffff81e00fa7
> [    6.679020] RDX: 0000000000000000 RSI: ffffffff81e009f8 RDI: 
> 00000000000000eb
> [    6.679020] RBP: 0000000000000000 R08: 0000000000000000 R09: 
> 0000000000000000
> [    6.679020] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [    6.679020] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> [    6.679020] FS:  0000000000000000(0000) GS:ffff88801f200000(0000) 
> knlGS:0000000000000000
> [    6.679020] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    6.679020] CR2: 0000000000000000 CR3: 000000000280c000 CR4: 
> 0000000000050660
> [    6.679020] Kernel panic - not syncing: Attempted to kill the idle task!
> 
> Cheers
> Bertrand
> 




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.