[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Cpu on/offlining crash with core scheduling
On 29/04/2020 09:09, Jürgen Groß wrote: > On 27.04.20 15:49, Sergey Dyasli wrote: >> Hi Juergen, >> >> When I'm testing vcpu pinning with something like: >> >> # xl vcpu-pin 0 0 2 >> # xen-hptool cpu-offline 3 >> >> (offline / online CPUs {2,3} if the above is successful) >> >> I'm reliably seeing the following crash on the latest staging: >> >> (XEN) Watchdog timer detects that CPU1 is stuck! >> (XEN) ----[ Xen-4.14-unstable x86_64 debug=y Not tainted ]---- >> (XEN) CPU: 1 >> (XEN) RIP: e008:[<ffff82d08025266d>] >> common/sched/core.c#sched_wait_rendezvous_in+0x16c/0x385 >> (XEN) RFLAGS: 0000000000000002 CONTEXT: hypervisor >> (XEN) rax: 000000000000f001 rbx: ffff82d0805c9118 rcx: ffff83085e750301 >> (XEN) rdx: 0000000000000001 rsi: ffff83086499b972 rdi: ffff83085e7503a6 >> (XEN) rbp: ffff83085e7dfe28 rsp: ffff83085e7dfdd8 r8: ffff830864985440 >> (XEN) r9: ffff83085e714068 r10: 0000000000000014 r11: 00000056b6a1aab2 >> (XEN) r12: ffff83086499e490 r13: ffff82d0805f26e0 r14: ffff83085e7503a0 >> (XEN) r15: 0000000000000001 cr0: 0000000080050033 cr4: 0000000000362660 >> (XEN) cr3: 0000000823a8e000 cr2: 00006026000f6fc0 >> (XEN) fsb: 0000000000000000 gsb: ffff888138dc0000 gss: 0000000000000000 >> (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008 >> (XEN) Xen code around <ffff82d08025266d> >> (common/sched/core.c#sched_wait_rendezvous_in+0x16c/0x385): >> (XEN) 4c 89 f7 e8 dc a5 fd ff <4b> 8b 44 fd 00 48 8b 04 18 4c 3b 70 10 0f >> 85 3f >> (XEN) Xen stack trace from rsp=ffff83085e7dfdd8: >> (XEN) 00000056b42128a6 ffff83086499ff30 ffff83086498a000 ffff83085e7dfe48 >> (XEN) 0000000100000001 00000056b42128a6 ffff83086499e490 0000000000000000 >> (XEN) 0000000000000001 0000000000000001 ffff83085e7dfe78 ffff82d080252ae8 >> (XEN) ffff83086498a000 0000000180230434 ffff83085e7503a0 ffff82d0805ceb00 >> (XEN) ffffffffffffffff ffff82d0805cea80 0000000000000000 ffff82d0805dea80 >> (XEN) ffff83085e7dfeb0 ffff82d08022c232 0000000000000001 ffff82d0805ceb00 >> (XEN) 0000000000000001 0000000000000001 0000000000000001 ffff83085e7dfec0 >> (XEN) ffff82d08022c2cd ffff83085e7dfef0 ffff82d08031cae9 ffff83086498a000 >> (XEN) ffff83086498a000 0000000000000001 0000000000000001 ffff83085e7dfde8 >> (XEN) ffff88813021d700 ffff88813021d700 0000000000000000 0000000000000000 >> (XEN) 0000000000000007 ffff88813021d700 0000000000000246 0000000000007ff0 >> (XEN) 0000000000000000 000000000001ca00 0000000000000000 ffffffff810013aa >> (XEN) ffffffff8203d210 deadbeefdeadf00d deadbeefdeadf00d 0000010000000000 >> (XEN) ffffffff810013aa 000000000000e033 0000000000000246 ffffc900400dfeb0 >> (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 >> (XEN) 0000000000000000 0000e01000000001 ffff83086498a000 00000037e43bd000 >> (XEN) 0000000000362660 0000000000000000 8000000864980002 0000060100000000 >> (XEN) 0000000000000000 >> (XEN) Xen call trace: >> (XEN) [<ffff82d08025266d>] R >> common/sched/core.c#sched_wait_rendezvous_in+0x16c/0x385 >> (XEN) [<ffff82d080252ae8>] F common/sched/core.c#sched_slave+0x262/0x31e >> (XEN) [<ffff82d08022c232>] F common/softirq.c#__do_softirq+0x8a/0xbc >> (XEN) [<ffff82d08022c2cd>] F do_softirq+0x13/0x15 >> (XEN) [<ffff82d08031cae9>] F arch/x86/domain.c#idle_loop+0x57/0xa7 >> (XEN) >> (XEN) CPU0 @ e008:ffff82d08022c2b7 (process_pending_softirqs+0x53/0x56) >> (XEN) CPU4 @ e008:ffff82d08022bc40 >> (common/rcupdate.c#rcu_process_callbacks+0x22e/0x24b) >> (XEN) CPU2 @ e008:ffff82d08022c26f (process_pending_softirqs+0xb/0x56) >> (XEN) CPU7 @ e008:ffff82d08022bc40 >> (common/rcupdate.c#rcu_process_callbacks+0x22e/0x24b) >> (XEN) CPU3 @ e008:ffff82d08022bc40 >> (common/rcupdate.c#rcu_process_callbacks+0x22e/0x24b) >> (XEN) CPU5 @ e008:ffff82d08022cc34 (_spin_lock+0x4d/0x62) >> (XEN) CPU6 @ e008:ffff82d08022c264 (process_pending_softirqs+0/0x56) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 1: >> (XEN) FATAL TRAP: vector = 2 (nmi) >> (XEN) [error_code=0000] , IN INTERRUPT CONTEXT >> (XEN) **************************************** >> (XEN) >> (XEN) Reboot in five seconds... >> (XEN) Executing kexec image on cpu1 >> (XEN) Shot down all CPUs >> >> >> Is this something you can reproduce? > > Yes, I was able to hit this. > > Attached patch is fixing it for me. Could you give it a try? The patch fixes the immediate issue: Tested-by: Sergey Dyasli <sergey.dyasli@xxxxxxxxxx> Thanks! However, when running the following script: while :; do xen-hptool cpu-offline 3; xen-hptool cpu-offline 2; xen-hptool cpu-online 3; xen-hptool cpu-online 2; sleep 0.1; done there was some weirdness with the utility on some invocations: xen-hptool: symbol lookup error: /lib64/libxenctrl.so.4.14: undefined symbol: xc__hypercall_buffer_free Segmentation fault (core dumped) xen-hptool: symbol lookup error: /lib64/libxenctrl.so.4.14: undefined symbol: xc__hypercall_bounce_post xen-hptool: relocation error: /lib64/libxenctrl.so.4.14: symbol xencall_free_buffer, version VERS_1.0 not defined in file libxencall.so.1 with link time reference And after a while it all ended up in: [ 634.817181] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060 [ 634.817197] PGD 67866067 P4D 67866067 PUD 4cb6067 PMD 0 [ 634.817208] Oops: 0000 [#1] SMP NOPTI [ 634.817215] CPU: 6 PID: 17284 Comm: xen-hptool Tainted: G O 4.19.0+1 #1 [ 634.817224] Hardware name: Supermicro MBI-6119G-T4/B2SS1-F, BIOS 2.0a 06/10/2017 [ 634.817237] RIP: e030:wq_worker_waking_up+0xd/0x30 [ 634.817301] Code: 59 fb ff ff b8 01 00 00 00 48 83 c4 08 c3 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 89 f3 e8 53 51 00 00 <f7> 40 60 c8 01 00 00 75 10 48 8b 40 40 39 58 04 75 09 f0 ff 80 00 [ 634.817322] RSP: e02b:ffffc90044117c58 EFLAGS: 00010002 [ 634.817329] RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffff888138d21700 [ 634.817338] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff88812a8dba00 [ 634.817347] RBP: ffff888138d21700 R08: ffff88812a8dba80 R09: 0000000000000000 [ 634.817357] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 634.817366] R13: ffffc90044117c98 R14: 0000000000000000 R15: 0000000000000004 [ 634.817386] FS: 00007f175d011740(0000) GS:ffff888138d80000(0000) knlGS:0000000000000000 [ 634.817394] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 634.817399] CR2: 0000000000000060 CR3: 0000000067974000 CR4: 0000000000040660 [ 634.817410] Call Trace: [ 634.817417] ttwu_do_activate+0x5f/0x80 [ 634.817422] try_to_wake_up+0x1e1/0x450 [ 634.817427] __queue_work+0x116/0x360 [ 634.817432] queue_work_on+0x24/0x40 [ 634.817438] pty_write+0x8f/0xa0 [ 634.817443] n_tty_write+0x1c5/0x480 [ 634.817448] ? do_wait_intr_irq+0xa0/0xa0 [ 634.817452] tty_write+0x154/0x2c0 [ 634.817457] ? process_echoes+0x70/0x70 [ 634.817462] __vfs_write+0x36/0x1a0 [ 634.817468] ? do_vfs_ioctl+0xa9/0x630 [ 634.817472] vfs_write+0xad/0x1a0 [ 634.817477] ksys_write+0x52/0xc0 [ 634.817482] do_syscall_64+0x4e/0x100 [ 634.817488] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 634.817494] RIP: 0033:0x7f175c0b9cd0 [ 634.817499] Code: 73 01 c3 48 8b 0d c0 61 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d cd c2 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ee cb 01 00 48 89 04 24 [ 634.817514] RSP: 002b:00007ffc6651bfd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 634.817521] RAX: ffffffffffffffda RBX: 000000000000001b RCX: 00007f175c0b9cd0 [ 634.817528] RDX: 000000000000001b RSI: 00007f175d021000 RDI: 0000000000000001 [ 634.817535] RBP: 00007f175d021000 R08: 0a796c6c75667373 R09: 00007f175c01716d [ 634.817542] R10: 00007ffc6651c0a0 R11: 0000000000000246 R12: 00007f175c391400 [ 634.817548] R13: 000000000000001b R14: 0000000000000d70 R15: 00007f175c38c858 [ 634.817556] Modules linked in: nfsv3 nfs_acl nfs lockd grace fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter dm_multipath sunrpc dm_mod intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd ipmi_si cryptd glue_helper ipmi_devintf ipmi_msghandler mei_me mei intel_rapl_perf sg intel_pch_thermal ie31200_edac i2c_i801 video backlight acpi_power_meter xen_wdt ip_tables x_tables hid_generic usbhid hid sd_mod ahci libahci xhci_pci libata xhci_hcd intel_ish_ipc igb(O) intel_ishtp scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua [ 634.817636] scsi_mod ipv6 crc_ccitt [ 634.817642] CR2: 0000000000000060 [ 634.817647] ---[ end trace b370af17485413d2 ]--- [ 634.872560] RIP: e030:wq_worker_waking_up+0xd/0x30 [ 634.872566] Code: 59 fb ff ff b8 01 00 00 00 48 83 c4 08 c3 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 89 f3 e8 53 51 00 00 <f7> 40 60 c8 01 00 00 75 10 48 8b 40 40 39 58 04 75 09 f0 ff 80 00 [ 634.872582] RSP: e02b:ffffc90044117c58 EFLAGS: 00010002 [ 634.872587] RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffff888138d21700 [ 634.872594] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff88812a8dba00 [ 634.872601] RBP: ffff888138d21700 R08: ffff88812a8dba80 R09: 0000000000000000 [ 634.872608] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 634.872614] R13: ffffc90044117c98 R14: 0000000000000000 R15: 0000000000000004 [ 634.872627] FS: 00007f175d011740(0000) GS:ffff888138d80000(0000) knlGS:0000000000000000 [ 634.872634] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 634.872640] CR2: 0000000000000060 CR3: 0000000067974000 CR4: 0000000000040660 -- Thanks, Sergey
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |