[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Cpu on/offlining crash with core scheduling


  • To: Jürgen Groß <jgross@xxxxxxxx>
  • From: Sergey Dyasli <sergey.dyasli@xxxxxxxxxx>
  • Date: Wed, 29 Apr 2020 09:15:33 +0000
  • Accept-language: en-GB, en-US
  • Authentication-results: esa1.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none; spf=None smtp.pra=sergey.dyasli@xxxxxxxxxx; spf=Pass smtp.mailfrom=sergey.dyasli@xxxxxxxxxx; spf=None smtp.helo=postmaster@xxxxxxxxxxxxxxx
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Sergey Dyasli <sergey.dyasli@xxxxxxxxxx>
  • Delivery-date: Wed, 29 Apr 2020 09:15:51 +0000
  • Ironport-sdr: 1ooR5Dv6leMjWF2Ggpi5ZAC8aL9Kfj5/egqHpjXwJw3dPbBh2UWYua8G8/MnB5lZX/oAVA5Vip wRgMhYRVcwXLcCiNbw+B0N4QKZ0xvoCH8nkdCOF+Irc0LcyULnzIS6n3d67v/53vU4O+IwRD9K IXec7U3xcjxeiicUNz4OcQSI/sdhXILLRu/isjT17zQXjSM+mrMbOpwQsIMBCbf9h9+S8meuet umwQBAdI5Hkt5BYMklRZs7X00Ll+QtM7xhFNuGwtKlJXLaivPrnPGYoOzOdmTXGhoFteEHb8s3 FyI=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHWHJp8dUHIzvCapEuuomD+efJ+EaiPoGGAgAAihQA=
  • Thread-topic: Cpu on/offlining crash with core scheduling

On 29/04/2020 09:09, Jürgen Groß wrote:
> On 27.04.20 15:49, Sergey Dyasli wrote:
>> Hi Juergen,
>>
>> When I'm testing vcpu pinning with something like:
>>
>>       # xl vcpu-pin 0 0 2
>>       # xen-hptool cpu-offline 3
>>
>>       (offline / online CPUs {2,3} if the above is successful)
>>
>> I'm reliably seeing the following crash on the latest staging:
>>
>> (XEN) Watchdog timer detects that CPU1 is stuck!
>> (XEN) ----[ Xen-4.14-unstable  x86_64  debug=y   Not tainted ]----
>> (XEN) CPU:    1
>> (XEN) RIP:    e008:[<ffff82d08025266d>] 
>> common/sched/core.c#sched_wait_rendezvous_in+0x16c/0x385
>> (XEN) RFLAGS: 0000000000000002   CONTEXT: hypervisor
>> (XEN) rax: 000000000000f001   rbx: ffff82d0805c9118   rcx: ffff83085e750301
>> (XEN) rdx: 0000000000000001   rsi: ffff83086499b972   rdi: ffff83085e7503a6
>> (XEN) rbp: ffff83085e7dfe28   rsp: ffff83085e7dfdd8   r8:  ffff830864985440
>> (XEN) r9:  ffff83085e714068   r10: 0000000000000014   r11: 00000056b6a1aab2
>> (XEN) r12: ffff83086499e490   r13: ffff82d0805f26e0   r14: ffff83085e7503a0
>> (XEN) r15: 0000000000000001   cr0: 0000000080050033   cr4: 0000000000362660
>> (XEN) cr3: 0000000823a8e000   cr2: 00006026000f6fc0
>> (XEN) fsb: 0000000000000000   gsb: ffff888138dc0000   gss: 0000000000000000
>> (XEN) ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
>> (XEN) Xen code around <ffff82d08025266d> 
>> (common/sched/core.c#sched_wait_rendezvous_in+0x16c/0x385):
>> (XEN)  4c 89 f7 e8 dc a5 fd ff <4b> 8b 44 fd 00 48 8b 04 18 4c 3b 70 10 0f 
>> 85 3f
>> (XEN) Xen stack trace from rsp=ffff83085e7dfdd8:
>> (XEN)    00000056b42128a6 ffff83086499ff30 ffff83086498a000 ffff83085e7dfe48
>> (XEN)    0000000100000001 00000056b42128a6 ffff83086499e490 0000000000000000
>> (XEN)    0000000000000001 0000000000000001 ffff83085e7dfe78 ffff82d080252ae8
>> (XEN)    ffff83086498a000 0000000180230434 ffff83085e7503a0 ffff82d0805ceb00
>> (XEN)    ffffffffffffffff ffff82d0805cea80 0000000000000000 ffff82d0805dea80
>> (XEN)    ffff83085e7dfeb0 ffff82d08022c232 0000000000000001 ffff82d0805ceb00
>> (XEN)    0000000000000001 0000000000000001 0000000000000001 ffff83085e7dfec0
>> (XEN)    ffff82d08022c2cd ffff83085e7dfef0 ffff82d08031cae9 ffff83086498a000
>> (XEN)    ffff83086498a000 0000000000000001 0000000000000001 ffff83085e7dfde8
>> (XEN)    ffff88813021d700 ffff88813021d700 0000000000000000 0000000000000000
>> (XEN)    0000000000000007 ffff88813021d700 0000000000000246 0000000000007ff0
>> (XEN)    0000000000000000 000000000001ca00 0000000000000000 ffffffff810013aa
>> (XEN)    ffffffff8203d210 deadbeefdeadf00d deadbeefdeadf00d 0000010000000000
>> (XEN)    ffffffff810013aa 000000000000e033 0000000000000246 ffffc900400dfeb0
>> (XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
>> (XEN)    0000000000000000 0000e01000000001 ffff83086498a000 00000037e43bd000
>> (XEN)    0000000000362660 0000000000000000 8000000864980002 0000060100000000
>> (XEN)    0000000000000000
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d08025266d>] R 
>> common/sched/core.c#sched_wait_rendezvous_in+0x16c/0x385
>> (XEN)    [<ffff82d080252ae8>] F common/sched/core.c#sched_slave+0x262/0x31e
>> (XEN)    [<ffff82d08022c232>] F common/softirq.c#__do_softirq+0x8a/0xbc
>> (XEN)    [<ffff82d08022c2cd>] F do_softirq+0x13/0x15
>> (XEN)    [<ffff82d08031cae9>] F arch/x86/domain.c#idle_loop+0x57/0xa7
>> (XEN)
>> (XEN) CPU0 @ e008:ffff82d08022c2b7 (process_pending_softirqs+0x53/0x56)
>> (XEN) CPU4 @ e008:ffff82d08022bc40 
>> (common/rcupdate.c#rcu_process_callbacks+0x22e/0x24b)
>> (XEN) CPU2 @ e008:ffff82d08022c26f (process_pending_softirqs+0xb/0x56)
>> (XEN) CPU7 @ e008:ffff82d08022bc40 
>> (common/rcupdate.c#rcu_process_callbacks+0x22e/0x24b)
>> (XEN) CPU3 @ e008:ffff82d08022bc40 
>> (common/rcupdate.c#rcu_process_callbacks+0x22e/0x24b)
>> (XEN) CPU5 @ e008:ffff82d08022cc34 (_spin_lock+0x4d/0x62)
>> (XEN) CPU6 @ e008:ffff82d08022c264 (process_pending_softirqs+0/0x56)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 1:
>> (XEN) FATAL TRAP: vector = 2 (nmi)
>> (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
>> (XEN) ****************************************
>> (XEN)
>> (XEN) Reboot in five seconds...
>> (XEN) Executing kexec image on cpu1
>> (XEN) Shot down all CPUs
>>
>>
>> Is this something you can reproduce?
> 
> Yes, I was able to hit this.
> 
> Attached patch is fixing it for me. Could you give it a try?

The patch fixes the immediate issue:

        Tested-by: Sergey Dyasli <sergey.dyasli@xxxxxxxxxx>
        
Thanks!

However, when running the following script:

        while :; do xen-hptool cpu-offline 3; xen-hptool cpu-offline 2; 
xen-hptool cpu-online 3; xen-hptool cpu-online 2; sleep 0.1; done
        
there was some weirdness with the utility on some invocations:

        xen-hptool: symbol lookup error: /lib64/libxenctrl.so.4.14: undefined 
symbol: xc__hypercall_buffer_free
        Segmentation fault (core dumped)
        xen-hptool: symbol lookup error: /lib64/libxenctrl.so.4.14: undefined 
symbol: xc__hypercall_bounce_post
        xen-hptool: relocation error: /lib64/libxenctrl.so.4.14: symbol 
xencall_free_buffer, version VERS_1.0 not defined in file libxencall.so.1 with 
link time reference
        
And after a while it all ended up in:

[  634.817181] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000060
[  634.817197] PGD 67866067 P4D 67866067 PUD 4cb6067 PMD 0
[  634.817208] Oops: 0000 [#1] SMP NOPTI
[  634.817215] CPU: 6 PID: 17284 Comm: xen-hptool Tainted: G           O      
4.19.0+1 #1
[  634.817224] Hardware name: Supermicro MBI-6119G-T4/B2SS1-F, BIOS 2.0a 
06/10/2017
[  634.817237] RIP: e030:wq_worker_waking_up+0xd/0x30
[  634.817301] Code: 59 fb ff ff b8 01 00 00 00 48 83 c4 08 c3 0f 1f 44 00 00 
66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 89 f3 e8 53 51 00 00 <f7> 40 60 
c8 
01 00 00 75 10 48 8b 40 40 39 58 04 75 09 f0 ff 80 00
[  634.817322] RSP: e02b:ffffc90044117c58 EFLAGS: 00010002
[  634.817329] RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffff888138d21700
[  634.817338] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff88812a8dba00
[  634.817347] RBP: ffff888138d21700 R08: ffff88812a8dba80 R09: 0000000000000000
[  634.817357] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  634.817366] R13: ffffc90044117c98 R14: 0000000000000000 R15: 0000000000000004
[  634.817386] FS:  00007f175d011740(0000) GS:ffff888138d80000(0000) 
knlGS:0000000000000000
[  634.817394] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  634.817399] CR2: 0000000000000060 CR3: 0000000067974000 CR4: 0000000000040660
[  634.817410] Call Trace:
[  634.817417]  ttwu_do_activate+0x5f/0x80
[  634.817422]  try_to_wake_up+0x1e1/0x450
[  634.817427]  __queue_work+0x116/0x360
[  634.817432]  queue_work_on+0x24/0x40
[  634.817438]  pty_write+0x8f/0xa0
[  634.817443]  n_tty_write+0x1c5/0x480
[  634.817448]  ? do_wait_intr_irq+0xa0/0xa0
[  634.817452]  tty_write+0x154/0x2c0
[  634.817457]  ? process_echoes+0x70/0x70
[  634.817462]  __vfs_write+0x36/0x1a0
[  634.817468]  ? do_vfs_ioctl+0xa9/0x630
[  634.817472]  vfs_write+0xad/0x1a0
[  634.817477]  ksys_write+0x52/0xc0
[  634.817482]  do_syscall_64+0x4e/0x100
[  634.817488]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  634.817494] RIP: 0033:0x7f175c0b9cd0
[  634.817499] Code: 73 01 c3 48 8b 0d c0 61 2d 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 0f 1f 44 00 00 83 3d cd c2 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 
f0 
ff ff 73 31 c3 48 83 ec 08 e8 ee cb 01 00 48 89 04 24
[  634.817514] RSP: 002b:00007ffc6651bfd8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[  634.817521] RAX: ffffffffffffffda RBX: 000000000000001b RCX: 00007f175c0b9cd0
[  634.817528] RDX: 000000000000001b RSI: 00007f175d021000 RDI: 0000000000000001
[  634.817535] RBP: 00007f175d021000 R08: 0a796c6c75667373 R09: 00007f175c01716d
[  634.817542] R10: 00007ffc6651c0a0 R11: 0000000000000246 R12: 00007f175c391400
[  634.817548] R13: 000000000000001b R14: 0000000000000d70 R15: 00007f175c38c858
[  634.817556] Modules linked in: nfsv3 nfs_acl nfs lockd grace fscache 
bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh 
nf_nat_ipv6 
nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT 
nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 
libcrc32c iptable_filter dm_multipath sunrpc dm_mod intel_powerclamp 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 
crypto_simd 
ipmi_si cryptd glue_helper ipmi_devintf ipmi_msghandler mei_me mei 
intel_rapl_perf sg intel_pch_thermal ie31200_edac i2c_i801 video backlight 
acpi_power_meter 
xen_wdt ip_tables x_tables hid_generic usbhid hid sd_mod ahci libahci xhci_pci 
libata xhci_hcd intel_ish_ipc igb(O) intel_ishtp scsi_dh_rdac scsi_dh_hp_sw 
scsi_dh_emc scsi_dh_alua
[  634.817636]  scsi_mod ipv6 crc_ccitt
[  634.817642] CR2: 0000000000000060
[  634.817647] ---[ end trace b370af17485413d2 ]---
[  634.872560] RIP: e030:wq_worker_waking_up+0xd/0x30
[  634.872566] Code: 59 fb ff ff b8 01 00 00 00 48 83 c4 08 c3 0f 1f 44 00 00 
66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 89 f3 e8 53 51 00 00 <f7> 40 60 
c8 
01 00 00 75 10 48 8b 40 40 39 58 04 75 09 f0 ff 80 00
[  634.872582] RSP: e02b:ffffc90044117c58 EFLAGS: 00010002
[  634.872587] RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffff888138d21700
[  634.872594] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff88812a8dba00
[  634.872601] RBP: ffff888138d21700 R08: ffff88812a8dba80 R09: 0000000000000000
[  634.872608] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  634.872614] R13: ffffc90044117c98 R14: 0000000000000000 R15: 0000000000000004
[  634.872627] FS:  00007f175d011740(0000) GS:ffff888138d80000(0000) 
knlGS:0000000000000000
[  634.872634] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  634.872640] CR2: 0000000000000060 CR3: 0000000067974000 CR4: 0000000000040660

--
Thanks,
Sergey




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.