Xen project Mailing List

Hello,

I actually planned today to reboot all nodes of my 6-nodes ganeti cluster after a Debian apt-get update/upgrade. Unfortunately after initiating the first gnt-node migrate to push a node's instances to the secondary node in order to reboot that specific node, I noticed that every single instance crashed. I only used mirrored instances with DRBD exactly for this kind of purpose: avoid any kind of downtime of my instances when doing such admin work.

In the past I already did the very same procedure and did not encounter any problems so my hypothesis here would be that there might be a bug in the specific Linux kernel version my Debian nodes are currently using (before the apt-get/upgrade). My nodes are currently using Debian 7.5 and the Linux kernel running is 3.2.57-3+deb7u1. Also I am using xen as hypervisor and the package version would be 4.1.4-3+deb7u1 and finally the ganeti version would be 2.9.6. I tried a second node and exactly the same happened so I aborted the cluster upgrade/reboot process here leaving 2 nodes updated to Debian 7.8 and 4 nodes untouched at Debian 7.5. For your reference I have pasted below the kernel output of an instance which crashed (using gnt-instance console).

Did anyone see this behaviour already? What could be wrong here? Any clues would be appreciated and if you need more info simply ask.

Best regards
John

P.S. My initial post can be found on the ganeti group here https://groups.google.com/forum/#!topic/ganeti/W9LJeD8cLxc I was told that the Xen mailing list would be a better place to post my issue.

[Â 223.547286] PM: early restore of devices complete after 0.021 msecs
[Â 223.557790] invalid opcode: 0000 [#1] SMP
[Â 223.557798] CPU 0
[Â 223.557801] Modules linked in: ext4 crc16 jbd2 mbcache dm_mod md_mod xen_netfront xen_blkfront
[Â 223.557813]
[Â 223.557817] Pid: 18, comm: kworker/0:1 Not tainted 3.2.0-4-amd64 #1 Debian 3.2.57-3+deb7u1
[Â 223.557825] RIP: e030:[<ffffffff81243dbd>]Â [<ffffffff81243dbd>] arch_get_random_long+0x5/0x15
[Â 223.557837] RSP: e02b:ffff88003e27fbe8Â EFLAGS: 00010286
[Â 223.557842] RAX: 00000000cf22cf22 RBX: ffff88003e27fc58 RCX: 0000000000000000
[Â 223.557847] RDX: 000000000000000a RSI: 000000005d86b6d3 RDI: ffff88003e27fc50
[Â 223.557852] RBP: ffff88003e27fbf8 R08: 000000009a26ca0b R09: 00000000430e4169
[Â 223.557857] R10: 0000000079074851 R11: 00000000136505cd R12: ffff88003e27fd0e
[Â 223.557862] R13: ffffffff816529f4 R14: ffff88003e27fc38 R15: 0000000000000200
[Â 223.557870] FS:Â 00007f78832ac720(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[Â 223.557876] CS:Â e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[Â 223.557881] CR2: 0000000000000000 CR3: 0000000003649000 CR4: 0000000000002660
[Â 223.557888] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Â 223.557893] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[Â 223.557899] Process kworker/0:1 (pid: 18, threadinfo ffff88003e27e000, task ffff88003e224080)
[Â 223.557904] Stack:
[Â 223.557907]Â ffffffff8124410b 0000000000000000 0000000000000000 0000000000000000
[Â 223.557915]Â 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[Â 223.557923]Â 0000000000000000 0000000000000000 a756f87accce6bf9 54e33a9acf22cf22
[Â 223.557930] Call Trace:
[Â 223.557936]Â [<ffffffff8124410b>] ? extract_buf+0xdf/0x153
[Â 223.557942]Â [<ffffffff8124461a>] ? extract_entropy+0x75/0x12b
[Â 223.557951]Â [<ffffffff812b3468>] ? rt_cache_invalidate+0x17/0x3b
[Â 223.557958]Â [<ffffffff81036628>] ? should_resched+0x5/0x23
[Â 223.557965]Â [<ffffffff8134e81c>] ? _cond_resched+0x7/0x1c
[Â 223.557971]Â [<ffffffff812b4c8a>] ? rt_cache_flush+0xe/0x3b
[Â 223.557978]Â [<ffffffff812e35df>] ? fib_netdev_event+0x9c/0xac
[Â 223.557986]Â [<ffffffff81352b41>] ? notifier_call_chain+0x2e/0x5b
[Â 223.557994]Â [<ffffffff8128faa8>] ? netdev_state_change+0x1a/0x2c
[Â 223.558001]Â [<ffffffff8129d532>] ? linkwatch_do_dev+0x9a/0xa8
[Â 223.558006]Â [<ffffffff8129d7d4>] ? __linkwatch_run_queue+0x10e/0x150
[Â 223.558012]Â [<ffffffff8129d834>] ? linkwatch_event+0x1e/0x25
[Â 223.558020]Â [<ffffffff8105b5cf>] ? process_one_work+0x161/0x269
[Â 223.558026]Â [<ffffffff8105c598>] ? worker_thread+0xc2/0x145
[Â 223.558031]Â [<ffffffff8105c4d6>] ? manage_workers.isra.25+0x15b/0x15b
[Â 223.558037]Â [<ffffffff8105f6d9>] ? kthread+0x76/0x7e
[Â 223.558045]Â [<ffffffff81356cb4>] ? kernel_thread_helper+0x4/0x10
[Â 223.558050]Â [<ffffffff81354d73>] ? int_ret_from_sys_call+0x7/0x1b
[Â 223.558056]Â [<ffffffff8134fe7c>] ? retint_restore_args+0x5/0x6
[Â 223.558062]Â [<ffffffff81356cb0>] ? gs_change+0x13/0x13
[Â 223.558066] Code: 43 81 48 89 e9 48 89 df e8 13 e9 e8 ff 83 f8 01 19 d2 f7 d2 83 e2 f5 5b 5d 89 d0 41 5c c3 e8 d2 02 dd ff 66 90 c3 ba 0a 00 00 00 <48> 0f c7 f0 72 04 ff ca 75 f6 48 89 07 89 d0 c3 41 57 49 89 ca
[Â 223.558108] RIPÂ [<ffffffff81243dbd>] arch_get_random_long+0x5/0x15
[Â 223.558114]Â RSP <ffff88003e27fbe8>
[Â 223.558122] ---[ end trace 72eaa08f794af2c9 ]---
[Â 223.558170] BUG: unable to handle kernel paging request at fffffffffffffff8
[Â 223.558177] IP: [<ffffffff8105f8f2>] kthread_data+0x7/0xc
[Â 223.558184] PGD 1607067 PUD 1608067 PMD 0
[Â 223.558190] Oops: 0000 [#2] SMP
[Â 223.558195] CPU 0
[Â 223.558197] Modules linked in: ext4 crc16 jbd2 mbcache dm_mod md_mod xen_netfront xen_blkfront
[Â 223.558208]
[Â 223.558212] Pid: 18, comm: kworker/0:1 Tainted: GÂÂÂÂÂ DÂÂÂÂÂ 3.2.0-4-amd64 #1 Debian 3.2.57-3+deb7u1
[Â 223.558219] RIP: e030:[<ffffffff8105f8f2>]Â [<ffffffff8105f8f2>] kthread_data+0x7/0xc
[Â 223.558227] RSP: e02b:ffff88003e27f950Â EFLAGS: 00010002
[Â 223.558232] RAX: 0000000000000000 RBX: ffff88003fc13780 RCX: 0000000000000000
[Â 223.558237] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003e224080
[Â 223.558242] RBP: 0000000000000000 R08: 0000000000000400 R09: ffffffff8123931e
[Â 223.558247] R10: dead000000200200 R11: ffffffff8123931e R12: ffff88003e27fa20
[Â 223.558252] R13: ffff88003e1b3740 R14: 0000000000000000 R15: ffff88003e224380
[Â 223.558260] FS:Â 00007f78832ac720(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[Â 223.558266] CS:Â e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[Â 223.558271] CR2: fffffffffffffff8 CR3: 0000000003649000 CR4: 0000000000002660
[Â 223.558276] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Â 223.558281] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[Â 223.558287] Process kworker/0:1 (pid: 18, threadinfo ffff88003e27e000, task ffff88003e224080)
[Â 223.558292] Stack:
[Â 223.558295]Â ffffffff8105c8c4 ffff88003fc13780 ffff88003e224080 ffff88003e27fa20
[Â 223.558303]Â ffffffff8134e300 ffff88003fc0edc0 ffffffff81095bdb 0000000000013780
[Â 223.558310]Â ffff88003e27ffd8 ffff88003e27ffd8 ffff88003e224080 ffffffff81095117
[Â 223.558318] Call Trace:
[Â 223.558323]Â [<ffffffff8105c8c4>] ? wq_worker_sleeping+0xb/0x6f
[Â 223.558328]Â [<ffffffff8134e300>] ? __schedule+0x138/0x610
[Â 223.558334]Â [<ffffffff81095bdb>] ? __call_rcu+0x11d/0x12c
[Â 223.558341]Â [<ffffffff81095117>] ? arch_local_irq_restore+0x7/0x8
[Â 223.558348]Â [<ffffffff81048cfa>] ? release_task+0x31b/0x331
[Â 223.558354]Â [<ffffffff81036628>] ? should_resched+0x5/0x23
[Â 223.558359]Â [<ffffffff8104a423>] ? do_exit+0x711/0x713
[Â 223.558365]Â [<ffffffff81071057>] ? arch_local_irq_disable+0x7/0x8
[Â 223.558372]Â [<ffffffff8134fb77>] ? _raw_spin_unlock_irqrestore+0xe/0xf
[Â 223.558378]Â [<ffffffff8135098e>] ? oops_end+0xb1/0xb6
[Â 223.558384]Â [<ffffffff8100e961>] ? do_invalid_op+0x87/0x91
[Â 223.558390]Â [<ffffffff81243dbd>] ? arch_get_random_long+0x5/0x15
[Â 223.558396]Â [<ffffffff810072b8>] ? get_phys_to_machine+0x16/0x58
[Â 223.558403]Â [<ffffffff81004c0a>] ? pfn_to_mfn+0x12/0x27
[Â 223.558408]Â [<ffffffff81004c32>] ? phys_to_machine+0x13/0x1c
[Â 223.558414]Â [<ffffffff81003f67>] ? arch_local_irq_restore+0x7/0x8
[Â 223.558419]Â [<ffffffff81004105>] ? xen_mc_flush+0x124/0x153
[Â 223.558425]Â [<ffffffff81356b2b>] ? invalid_op+0x1b/0x20
[Â 223.558430]Â [<ffffffff81243dbd>] ? arch_get_random_long+0x5/0x15
[Â 223.558435]Â [<ffffffff8124410b>] ? extract_buf+0xdf/0x153
[Â 223.558441]Â [<ffffffff8124461a>] ? extract_entropy+0x75/0x12b
[Â 223.558447]Â [<ffffffff812b3468>] ? rt_cache_invalidate+0x17/0x3b
[Â 223.558452]Â [<ffffffff81036628>] ? should_resched+0x5/0x23
[Â 223.558457]Â [<ffffffff8134e81c>] ? _cond_resched+0x7/0x1c
[Â 223.558463]Â [<ffffffff812b4c8a>] ? rt_cache_flush+0xe/0x3b
[Â 223.558468]Â [<ffffffff812e35df>] ? fib_netdev_event+0x9c/0xac
[Â 223.558474]Â [<ffffffff81352b41>] ? notifier_call_chain+0x2e/0x5b
[Â 223.558480]Â [<ffffffff8128faa8>] ? netdev_state_change+0x1a/0x2c
[Â 223.558485]Â [<ffffffff8129d532>] ? linkwatch_do_dev+0x9a/0xa8
[Â 223.558491]Â [<ffffffff8129d7d4>] ? __linkwatch_run_queue+0x10e/0x150
[Â 223.558497]Â [<ffffffff8129d834>] ? linkwatch_event+0x1e/0x25
[Â 223.558502]Â [<ffffffff8105b5cf>] ? process_one_work+0x161/0x269
[Â 223.558508]Â [<ffffffff8105c598>] ? worker_thread+0xc2/0x145
[Â 223.558514]Â [<ffffffff8105c4d6>] ? manage_workers.isra.25+0x15b/0x15b
[Â 223.558519]Â [<ffffffff8105f6d9>] ? kthread+0x76/0x7e
[Â 223.558525]Â [<ffffffff81356cb4>] ? kernel_thread_helper+0x4/0x10
[Â 223.558530]Â [<ffffffff81354d73>] ? int_ret_from_sys_call+0x7/0x1b
[Â 223.558536]Â [<ffffffff8134fe7c>] ? retint_restore_args+0x5/0x6
[Â 223.558542]Â [<ffffffff81356cb0>] ? gs_change+0x13/0x13
[Â 223.558546] Code: 3f 48 c1 e5 03 48 c1 e0 06 48 8d b0 e0 5d 40 81 48 29 ee e8 11 32 fe ff 81 4b 14 00 00 00 04 41 59 5b 5d c3 48 8b 87 a8 02 00 00 <48> 8b 40 f8 c3 48 3b 3d ea c7 72 00 75 08 0f bf 87 72 06 00 00
[Â 223.558587] RIPÂ [<ffffffff8105f8f2>] kthread_data+0x7/0xc
[Â 223.558594]Â RSP <ffff88003e27f950>
[Â 223.558597] CR2: fffffffffffffff8
[Â 223.558600] ---[ end trace 72eaa08f794af2ca ]---
[Â 223.558604] Fixing recursive fault but reboot is needed!

[Xen-users] domU kernel crash after live migration on Debian 7.5