[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-users] domU kernel crash after live migration on Debian 7.5
Hello, P.S. My initial post can be found on the ganeti group here https://groups.google.com/forum/#!topic/ganeti/W9LJeD8cLxc I was told that the Xen mailing list would be a better place to post my issue.I actually planned today to reboot all nodes of my 6-nodes ganeti cluster after a Debian apt-get update/upgrade. Unfortunately after initiating the first gnt-node migrate to push a node's instances to the secondary node in order to reboot that specific node, I noticed that every single instance crashed. I only used mirrored instances with DRBD exactly for this kind of purpose: avoid any kind of downtime of my instances when doing such admin work. In the past I already did the very same procedure and did not encounter any problems so my hypothesis here would be that there might be a bug in the specific Linux kernel version my Debian nodes are currently using (before the apt-get/upgrade). My nodes are currently using Debian 7.5 and the Linux kernel running is 3.2.57-3+deb7u1. Also I am using xen as hypervisor and the package version would be 4.1.4-3+deb7u1 and finally the ganeti version would be 2.9.6. I tried a second node and exactly the same happened so I aborted the cluster upgrade/reboot process here leaving 2 nodes updated to Debian 7.8 and 4 nodes untouched at Debian 7.5. For your reference I have pasted below the kernel output of an instance which crashed (using gnt-instance console). Did anyone see this behaviour already? What could be wrong here? Any clues would be appreciated and if you need more info simply ask. Best regards John [Â 223.547286] PM: early restore of devices complete after 0.021 msecs [Â 223.557790] invalid opcode: 0000 [#1] SMP [Â 223.557798] CPU 0 [Â 223.557801] Modules linked in: ext4 crc16 jbd2 mbcache dm_mod md_mod xen_netfront xen_blkfront [Â 223.557813] [Â 223.557817] Pid: 18, comm: kworker/0:1 Not tainted 3.2.0-4-amd64 #1 Debian 3.2.57-3+deb7u1 [Â 223.557825] RIP: e030:[<ffffffff81243dbd>]Â [<ffffffff81243dbd>] arch_get_random_long+0x5/0x15 [Â 223.557837] RSP: e02b:ffff88003e27fbe8Â EFLAGS: 00010286 [Â 223.557842] RAX: 00000000cf22cf22 RBX: ffff88003e27fc58 RCX: 0000000000000000 [Â 223.557847] RDX: 000000000000000a RSI: 000000005d86b6d3 RDI: ffff88003e27fc50 [Â 223.557852] RBP: ffff88003e27fbf8 R08: 000000009a26ca0b R09: 00000000430e4169 [Â 223.557857] R10: 0000000079074851 R11: 00000000136505cd R12: ffff88003e27fd0e [Â 223.557862] R13: ffffffff816529f4 R14: ffff88003e27fc38 R15: 0000000000000200 [Â 223.557870] FS:Â 00007f78832ac720(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [Â 223.557876] CS:Â e033 DS: 0000 ES: 0000 CR0: 000000008005003b [Â 223.557881] CR2: 0000000000000000 CR3: 0000000003649000 CR4: 0000000000002660 [Â 223.557888] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [Â 223.557893] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [Â 223.557899] Process kworker/0:1 (pid: 18, threadinfo ffff88003e27e000, task ffff88003e224080) [Â 223.557904] Stack: [Â 223.557907]Â ffffffff8124410b 0000000000000000 0000000000000000 0000000000000000 [Â 223.557915]Â 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [Â 223.557923]Â 0000000000000000 0000000000000000 a756f87accce6bf9 54e33a9acf22cf22 [Â 223.557930] Call Trace: [Â 223.557936]Â [<ffffffff8124410b>] ? extract_buf+0xdf/0x153 [Â 223.557942]Â [<ffffffff8124461a>] ? extract_entropy+0x75/0x12b [Â 223.557951]Â [<ffffffff812b3468>] ? rt_cache_invalidate+0x17/0x3b [Â 223.557958]Â [<ffffffff81036628>] ? should_resched+0x5/0x23 [Â 223.557965]Â [<ffffffff8134e81c>] ? _cond_resched+0x7/0x1c [Â 223.557971]Â [<ffffffff812b4c8a>] ? rt_cache_flush+0xe/0x3b [Â 223.557978]Â [<ffffffff812e35df>] ? fib_netdev_event+0x9c/0xac [Â 223.557986]Â [<ffffffff81352b41>] ? notifier_call_chain+0x2e/0x5b [Â 223.557994]Â [<ffffffff8128faa8>] ? netdev_state_change+0x1a/0x2c [Â 223.558001]Â [<ffffffff8129d532>] ? linkwatch_do_dev+0x9a/0xa8 [Â 223.558006]Â [<ffffffff8129d7d4>] ? __linkwatch_run_queue+0x10e/0x150 [Â 223.558012]Â [<ffffffff8129d834>] ? linkwatch_event+0x1e/0x25 [Â 223.558020]Â [<ffffffff8105b5cf>] ? process_one_work+0x161/0x269 [Â 223.558026]Â [<ffffffff8105c598>] ? worker_thread+0xc2/0x145 [Â 223.558031]Â [<ffffffff8105c4d6>] ? manage_workers.isra.25+0x15b/0x15b [Â 223.558037]Â [<ffffffff8105f6d9>] ? kthread+0x76/0x7e [Â 223.558045]Â [<ffffffff81356cb4>] ? kernel_thread_helper+0x4/0x10 [Â 223.558050]Â [<ffffffff81354d73>] ? int_ret_from_sys_call+0x7/0x1b [Â 223.558056]Â [<ffffffff8134fe7c>] ? retint_restore_args+0x5/0x6 [Â 223.558062]Â [<ffffffff81356cb0>] ? gs_change+0x13/0x13 [Â 223.558066] Code: 43 81 48 89 e9 48 89 df e8 13 e9 e8 ff 83 f8 01 19 d2 f7 d2 83 e2 f5 5b 5d 89 d0 41 5c c3 e8 d2 02 dd ff 66 90 c3 ba 0a 00 00 00 <48> 0f c7 f0 72 04 ff ca 75 f6 48 89 07 89 d0 c3 41 57 49 89 ca [Â 223.558108] RIPÂ [<ffffffff81243dbd>] arch_get_random_long+0x5/0x15 [Â 223.558114]Â RSP <ffff88003e27fbe8> [Â 223.558122] ---[ end trace 72eaa08f794af2c9 ]--- [Â 223.558170] BUG: unable to handle kernel paging request at fffffffffffffff8 [Â 223.558177] IP: [<ffffffff8105f8f2>] kthread_data+0x7/0xc [Â 223.558184] PGD 1607067 PUD 1608067 PMD 0 [Â 223.558190] Oops: 0000 [#2] SMP [Â 223.558195] CPU 0 [Â 223.558197] Modules linked in: ext4 crc16 jbd2 mbcache dm_mod md_mod xen_netfront xen_blkfront [Â 223.558208] [Â 223.558212] Pid: 18, comm: kworker/0:1 Tainted: GÂÂÂÂÂ DÂÂÂÂÂ 3.2.0-4-amd64 #1 Debian 3.2.57-3+deb7u1 [Â 223.558219] RIP: e030:[<ffffffff8105f8f2>]Â [<ffffffff8105f8f2>] kthread_data+0x7/0xc [Â 223.558227] RSP: e02b:ffff88003e27f950Â EFLAGS: 00010002 [Â 223.558232] RAX: 0000000000000000 RBX: ffff88003fc13780 RCX: 0000000000000000 [Â 223.558237] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003e224080 [Â 223.558242] RBP: 0000000000000000 R08: 0000000000000400 R09: ffffffff8123931e [Â 223.558247] R10: dead000000200200 R11: ffffffff8123931e R12: ffff88003e27fa20 [Â 223.558252] R13: ffff88003e1b3740 R14: 0000000000000000 R15: ffff88003e224380 [Â 223.558260] FS:Â 00007f78832ac720(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [Â 223.558266] CS:Â e033 DS: 0000 ES: 0000 CR0: 000000008005003b [Â 223.558271] CR2: fffffffffffffff8 CR3: 0000000003649000 CR4: 0000000000002660 [Â 223.558276] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [Â 223.558281] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [Â 223.558287] Process kworker/0:1 (pid: 18, threadinfo ffff88003e27e000, task ffff88003e224080) [Â 223.558292] Stack: [Â 223.558295]Â ffffffff8105c8c4 ffff88003fc13780 ffff88003e224080 ffff88003e27fa20 [Â 223.558303]Â ffffffff8134e300 ffff88003fc0edc0 ffffffff81095bdb 0000000000013780 [Â 223.558310]Â ffff88003e27ffd8 ffff88003e27ffd8 ffff88003e224080 ffffffff81095117 [Â 223.558318] Call Trace: [Â 223.558323]Â [<ffffffff8105c8c4>] ? wq_worker_sleeping+0xb/0x6f [Â 223.558328]Â [<ffffffff8134e300>] ? __schedule+0x138/0x610 [Â 223.558334]Â [<ffffffff81095bdb>] ? __call_rcu+0x11d/0x12c [Â 223.558341]Â [<ffffffff81095117>] ? arch_local_irq_restore+0x7/0x8 [Â 223.558348]Â [<ffffffff81048cfa>] ? release_task+0x31b/0x331 [Â 223.558354]Â [<ffffffff81036628>] ? should_resched+0x5/0x23 [Â 223.558359]Â [<ffffffff8104a423>] ? do_exit+0x711/0x713 [Â 223.558365]Â [<ffffffff81071057>] ? arch_local_irq_disable+0x7/0x8 [Â 223.558372]Â [<ffffffff8134fb77>] ? _raw_spin_unlock_irqrestore+0xe/0xf [Â 223.558378]Â [<ffffffff8135098e>] ? oops_end+0xb1/0xb6 [Â 223.558384]Â [<ffffffff8100e961>] ? do_invalid_op+0x87/0x91 [Â 223.558390]Â [<ffffffff81243dbd>] ? arch_get_random_long+0x5/0x15 [Â 223.558396]Â [<ffffffff810072b8>] ? get_phys_to_machine+0x16/0x58 [Â 223.558403]Â [<ffffffff81004c0a>] ? pfn_to_mfn+0x12/0x27 [Â 223.558408]Â [<ffffffff81004c32>] ? phys_to_machine+0x13/0x1c [Â 223.558414]Â [<ffffffff81003f67>] ? arch_local_irq_restore+0x7/0x8 [Â 223.558419]Â [<ffffffff81004105>] ? xen_mc_flush+0x124/0x153 [Â 223.558425]Â [<ffffffff81356b2b>] ? invalid_op+0x1b/0x20 [Â 223.558430]Â [<ffffffff81243dbd>] ? arch_get_random_long+0x5/0x15 [Â 223.558435]Â [<ffffffff8124410b>] ? extract_buf+0xdf/0x153 [Â 223.558441]Â [<ffffffff8124461a>] ? extract_entropy+0x75/0x12b [Â 223.558447]Â [<ffffffff812b3468>] ? rt_cache_invalidate+0x17/0x3b [Â 223.558452]Â [<ffffffff81036628>] ? should_resched+0x5/0x23 [Â 223.558457]Â [<ffffffff8134e81c>] ? _cond_resched+0x7/0x1c [Â 223.558463]Â [<ffffffff812b4c8a>] ? rt_cache_flush+0xe/0x3b [Â 223.558468]Â [<ffffffff812e35df>] ? fib_netdev_event+0x9c/0xac [Â 223.558474]Â [<ffffffff81352b41>] ? notifier_call_chain+0x2e/0x5b [Â 223.558480]Â [<ffffffff8128faa8>] ? netdev_state_change+0x1a/0x2c [Â 223.558485]Â [<ffffffff8129d532>] ? linkwatch_do_dev+0x9a/0xa8 [Â 223.558491]Â [<ffffffff8129d7d4>] ? __linkwatch_run_queue+0x10e/0x150 [Â 223.558497]Â [<ffffffff8129d834>] ? linkwatch_event+0x1e/0x25 [Â 223.558502]Â [<ffffffff8105b5cf>] ? process_one_work+0x161/0x269 [Â 223.558508]Â [<ffffffff8105c598>] ? worker_thread+0xc2/0x145 [Â 223.558514]Â [<ffffffff8105c4d6>] ? manage_workers.isra.25+0x15b/0x15b [Â 223.558519]Â [<ffffffff8105f6d9>] ? kthread+0x76/0x7e [Â 223.558525]Â [<ffffffff81356cb4>] ? kernel_thread_helper+0x4/0x10 [Â 223.558530]Â [<ffffffff81354d73>] ? int_ret_from_sys_call+0x7/0x1b [Â 223.558536]Â [<ffffffff8134fe7c>] ? retint_restore_args+0x5/0x6 [Â 223.558542]Â [<ffffffff81356cb0>] ? gs_change+0x13/0x13 [Â 223.558546] Code: 3f 48 c1 e5 03 48 c1 e0 06 48 8d b0 e0 5d 40 81 48 29 ee e8 11 32 fe ff 81 4b 14 00 00 00 04 41 59 5b 5d c3 48 8b 87 a8 02 00 00 <48> 8b 40 f8 c3 48 3b 3d ea c7 72 00 75 08 0f bf 87 72 06 00 00 [Â 223.558587] RIPÂ [<ffffffff8105f8f2>] kthread_data+0x7/0xc [Â 223.558594]Â RSP <ffff88003e27f950> [Â 223.558597] CR2: fffffffffffffff8 [Â 223.558600] ---[ end trace 72eaa08f794af2ca ]--- [Â 223.558604] Fixing recursive fault but reboot is needed! _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |