[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic



On Wed, May 29, 2024 at 03:08:49PM +0200, Jürgen Groß wrote:
> On 29.05.24 14:46, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote:
> > > On 28.05.24 13:22, Roger Pau Monné wrote:
> > > > Hello,
> > > > 
> > > > When the stop_machine_run() call in cpu_down() fails and calls the CPU
> > > > notifier CPU_DOWN_FAILED hook the following assert triggers in the
> > > > scheduling code:
> > > > 
> > > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at 
> > > > common/sched/cred1
> > > > ----[ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C    ]----
> > > > CPU:    0
> > > > RIP:    e008:[<ffff82d040248299>] 
> > > > common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
> > > > RFLAGS: 0000000000010093   CONTEXT: hypervisor
> > > > rax: 0000000000000000   rbx: ffff83202ecc2f80   rcx: ffff83202f3e64c0
> > > > rdx: 0000000000000001   rsi: 0000000000000002   rdi: ffff83202ecc2f88
> > > > rbp: ffff83203ffffd58   rsp: ffff83203ffffd30   r8:  0000000000000000
> > > > r9:  ffff83202f3e6e01   r10: 0000000000000000   r11: 0f0f0f0f0f0f0f0f
> > > > r12: ffff83202ecb80b0   r13: 0000000000000001   r14: 0000000000000282
> > > > r15: ffff83202ecbbf00   cr0: 000000008005003b   cr4: 00000000007526e0
> > > > cr3: 00000000574c2000   cr2: 0000000000000000
> > > > fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
> > > > ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> > > > Xen code around <ffff82d040248299> 
> > > > (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177):
> > > >    fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 
> > > > 75 2e
> > > > Xen stack trace from rsp=ffff83203ffffd30:
> > > >      ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006
> > > >      0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000
> > > >      0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628
> > > >      ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0
> > > >      ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001
> > > >      00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500
> > > >      ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30
> > > >      ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000
> > > >      ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68
> > > >      ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80
> > > >      ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631
> > > >      0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210
> > > >      ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000
> > > >      0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0
> > > >      ffff82d040324391 0000000000000000 0000000000000000 0000000000000000
> > > >      0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > >      0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > >      0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > >      0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > >      0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > > Xen call trace:
> > > >      [<ffff82d040248299>] R 
> > > > common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
> > > >      [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58
> > > >      [<ffff82d04021f303>] F 
> > > > common/sched/cpupool.c#cpu_callback+0xfb/0x466
> > > >      [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96
> > > >      [<ffff82d040204fd5>] F 
> > > > common/cpu.c#cpu_notifier_call_chain+0x1b/0x36
> > > >      [<ffff82d0402054f0>] F cpu_down+0xa7/0x143
> > > >      [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27
> > > >      [<ffff82d0402071aa>] F 
> > > > common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd
> > > >      [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf
> > > >      [<ffff82d040234631>] F do_tasklet+0x5b/0x8d
> > > >      [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6
> > > >      [<ffff82d040324391>] F continue_running+0x5b/0x5d
> > > > 
> > > > 
> > > > ****************************************
> > > > Panic on CPU 0:
> > > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at 
> > > > common/sched/credit2.c:4111
> > > > ****************************************
> > > > 
> > > > The issue seems to be that since the CPU hasn't been removed, it's
> > > > still part of prv->initialized and the assert in csched2_free_pdata()
> > > > called as part of free_cpu_rm_data() triggers.
> > > > 
> > > > It's easy to reproduce by substituting the stop_machine_run() call in
> > > > cpu_down() with an error.
> > > 
> > > Could you please give the attached patch a try?
> > 
> > I still get the following assert:
> 
> Oh, silly me. Without core scheduling active nr_sr_unused will be 0 all
> the time. :-(
> 
> Next try.

I'm afraid I have a new trace for you:

Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at 
common/sched/credit2.c:3987
----[ Xen-4.19-unstable  x86_64  debug=y  Not tainted ]----
CPU:    0
RIP:    e008:[<ffff82d040247d27>] 
common/sched/credit2.c#csched2_switch_sched+0x115/0x339
RFLAGS: 0000000000010093   CONTEXT: hypervisor
rax: 000000000000c000   rbx: 0000000000000001   rcx: ffff82d0405e6500
rdx: 0000004feee13000   rsi: 0000000000000004   rdi: ffff83202ecc2f88
rbp: ffff83203ffffc80   rsp: ffff83203ffffc38   r8:  0000000000000000
r9:  ffff83202ecbbf01   r10: 0000000000000000   r11: 0f0f0f0f0f0f0f0f
r12: ffff83202ecc2f80   r13: ffff83402ca50100   r14: ffff83402ca50140
r15: ffff83202ecc2f88   cr0: 000000008005003b   cr4: 00000000007526e0
cr3: 00000000574c2000   cr2: 0000000000000000
fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
Xen code around <ffff82d040247d27> 
(common/sched/credit2.c#csched2_switch_sched+0x115/0x339):
 7c ff ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 41 8b 56 30 89 de 48 8d 3d e8 00 1a
Xen stack trace from rsp=ffff83203ffffc38:
   ffff83203ffffc48 ffff82d0402332ba ffff83203ffffc68 ffff82d04023343d
   0000000000000001 ffff82d0405cf398 ffff83402ca50100 ffff82d0405e6500
   ffff83202ecbbdb0 ffff83203ffffd18 ffff82d040256e1a ffff83203fff386c
   ffff83203fff2000 0000000000000005 ffff83202ecbbf00 ffff83402ca50140
   ffff83203fff3868 0000000000000282 0000000040233509 ffff83202ecbbdb0
   ffff83402ca50100 ffff83202f3e6d80 ffff83202ecc2ec0 ffff83202ecc2ec0
   0000000000000001 ffff82d0403da460 0000000000000048 0000000000000000
   ffff83203ffffd48 ffff82d0402414b7 0000000000000001 0000000000000000
   ffff82d0403da460 0000000000000006 ffff83203ffffd70 ffff82d04024173d
   0000000000000000 0000000000000001 ffff82d0404c4430 ffff83203ffffda0
   ffff82d04021f1f9 ffff82d0404c4628 ffff82d0404c4620 ffff82d0404c4430
   0000000000000006 ffff83203ffffde8 ffff82d04022bb2f ffff83203ffffe10
   0000000000000001 0000000000000001 0000000000000000 ffff83203ffffe10
   0000000000000000 ffff82d0405e6500 ffff83203ffffe00 ffff82d040204fd5
   0000000000000001 ffff83203ffffe30 ffff82d040205464 ffff82d0404c5860
   0000000000000001 ffff83202ec86000 0000000000000000 ffff83203ffffe48
   ffff82d040348c32 ffff83402ca500d0 ffff83203ffffe68 ffff82d04020708d
   ffff83202ec861d0 ffff82d0405ce210 ffff83203ffffe80 ffff82d0402342a3
   ffff82d0405ce200 ffff83203ffffeb0 ffff82d04023450b 0000000000000000
   0000000000007fff ffff82d0405d5080 ffff82d0405ce210 ffff83203ffffee8
Xen call trace:
   [<ffff82d040247d27>] R 
common/sched/credit2.c#csched2_switch_sched+0x115/0x339
   [<ffff82d040256e1a>] F schedule_cpu_add+0x1a4/0x463
   [<ffff82d0402414b7>] F 
common/sched/cpupool.c#cpupool_assign_cpu_locked+0x5a/0x17e
   [<ffff82d04024173d>] F common/sched/cpupool.c#cpupool_cpu_add+0x162/0x16c
   [<ffff82d04021f1f9>] F common/sched/cpupool.c#cpu_callback+0x10e/0x466
   [<ffff82d04022bb2f>] F notifier_call_chain+0x6c/0x96
   [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36
   [<ffff82d040205464>] F cpu_down+0x60/0x83
   [<ffff82d040348c32>] F cpu_down_helper+0x11/0x27
   [<ffff82d04020708d>] F 
common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd
   [<ffff82d0402342a3>] F common/tasklet.c#do_tasklet_work+0x76/0xaf
   [<ffff82d04023450b>] F do_tasklet+0x5b/0x8d
   [<ffff82d040321372>] F arch/x86/domain.c#idle_loop+0x78/0xe6
   [<ffff82d0403242f2>] F continue_running+0x5b/0x5d


****************************************
Panic on CPU 0:
Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at 
common/sched/credit2.c:3987
****************************************

This time is one of the asserts in init_pdata().

Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.