Xen project Mailing List

Re: [PATCH-for-4.17] xen/sched: migrate timers to correct cpus after suspend

To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

From: Juergen Gross <jgross@xxxxxxxx>

Date: Fri, 28 Oct 2022 12:08:44 +0200

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, George Dunlap <george.dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>, Henry Wang <Henry.Wang@xxxxxxx>

Delivery-date: Fri, 28 Oct 2022 10:08:50 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27.10.22 20:13, Marek Marczykowski-Górecki wrote:

On Fri, Oct 21, 2022 at 04:53:57PM +0200, Juergen Gross wrote:

Today all timers are migrated to cpu 0 when the system is being
suspended. They are not migrated back after resuming the system again.

This results (at least) to problems with the credit scheduler, as the
timer isn't handled on the cpu it was expected to occur.

Add migrating the scheduling related timers of a specific cpu from cpu
0 back to its original cpu when that cpu has gone up when resuming the
system.

Signed-off-by: Juergen Gross <jgross@xxxxxxxx>


I tested it in my setup, but it crashed:

(XEN) arch/x86/cpu/mcheck/mce_intel.c:770: MCA Capability: firstbank 0, 
extended MCE MSR 0, BCAST, CMCI
(XEN) CPU0 CMCI LVT vector (0xf1) already installed
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) Platform timer appears to have unexpectedly wrapped 3 times.
(XEN) ----[ Xen-4.17-rc  x86_64  debug=y  Tainted:   C    ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d040250c7e>] sched_migrate_timers+0x4d/0xc9
(XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
(XEN) rax: ffff82d0405c5298   rbx: 0000000000000000   rcx: 0000000000000001
(XEN) rdx: 0000003211219000   rsi: 0000000000000004   rdi: 0000000000000001
(XEN) rbp: ffff830256227d20   rsp: ffff830256227d18   r8:  ffff82d0405d2f78
(XEN) r9:  ffff82d0405ef8a0   r10: 00000000ffffffff   r11: 00000000002191c0
(XEN) r12: 0000000000000000   r13: 0000000000000001   r14: 0000000000000004
(XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
(XEN) cr3: 0000000049677000   cr2: 0000000000000070
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d040250c7e> (sched_migrate_timers+0x4d/0xc9):
(XEN)  48 8b 14 ca 48 8b 1c 02 <39> 7b 70 74 51 48 8d 05 56 34 37 00 48 89 e2 48
(XEN) Xen stack trace from rsp=ffff830256227d18:
(XEN)    0000000000000001 ffff830256227d58 ffff82d04023f1a0 ffff82d04047a308
(XEN)    ffff82d04047a300 ffff82d04047a060 0000000000000004 0000000000000000
(XEN)    ffff830256227da0 ffff82d040226a04 0000000000000000 0000000000000001
(XEN)    0000000000000001 0000000000000000 0000000000000001 ffff830256227fff
(XEN)    ffff82d04046c520 ffff830256227db8 ffff82d040207e75 0000000000000001
(XEN)    ffff830256227de0 ffff82d040208243 ffff82d04047a220 0000000000000001
(XEN)    0000000000000010 ffff830256227e18 ffff82d040208428 0000000000000200
(XEN)    0000000000000000 0000000000000003 ffff830256227ef8 ffff82d0405de6c0
(XEN)    ffff830256227e48 ffff82d04027a2df ffff830251491490 ffff830251757000
(XEN)    0000000000000000 0000000000000000 ffff830256227e68 ffff82d040209c73
(XEN)    ffff8302517571b8 ffff82d040479618 ffff830256227e88 ffff82d04022e484
(XEN)    ffff82d0405c41a0 ffff82d0405c41b0 ffff830256227eb8 ffff82d04022e76e
(XEN)    0000000000000000 0000000000007fff ffff82d0405caf00 ffff82d0405c41b0
(XEN)    ffff830256227ef0 ffff82d0402f455d ffff82d0402f44e5 ffff830251757000
(XEN)    ffff830256227ef8 ffff8302517f5000 0000000000000000 ffff830256227e18
(XEN)    0000000000000000 ffffc90040b43d60 0000000000003403 0000000000000000
(XEN)    0000000000000003 ffffffff82e37868 0000000000000246 0000000000000003
(XEN)    0000000000003403 0000000000003403 0000000000000000 ffffffff81e4a0ea
(XEN)    0000000000003403 0000000000000010 deadbeefdeadf00d 0000010000000000
(XEN)    ffffffff81e4a0ea 000000000000e033 0000000000000246 ffffc90040b43c30
(XEN) Xen call trace:
(XEN)    [<ffff82d040250c7e>] R sched_migrate_timers+0x4d/0xc9
(XEN)    [<ffff82d04023f1a0>] F cpupool.c#cpu_callback+0x13d/0x47e
(XEN)    [<ffff82d040226a04>] F notifier_call_chain+0x6c/0x96
(XEN)    [<ffff82d040207e75>] F cpu.c#cpu_notifier_call_chain+0x1b/0x36
(XEN)    [<ffff82d040208243>] F cpu_up+0xaf/0xc8
(XEN)    [<ffff82d040208428>] F enable_nonboot_cpus+0x7b/0x1ef
(XEN)    [<ffff82d04027a2df>] F power.c#enter_state_helper+0x156/0x5dc
(XEN)    [<ffff82d040209c73>] F 
domain.c#continue_hypercall_tasklet_handler+0x50/0xbf
(XEN)    [<ffff82d04022e484>] F tasklet.c#do_tasklet_work+0x7b/0xac
(XEN)    [<ffff82d04022e76e>] F do_tasklet+0x58/0x8a
(XEN)    [<ffff82d0402f455d>] F domain.c#idle_loop+0x78/0xe6
(XEN)
(XEN) Pagetable walk from 0000000000000070:
(XEN)  L4[0x000] = 00000002517fb063 ffffffffffffffff
(XEN)  L3[0x000] = 00000002517fa063 ffffffffffffffff
(XEN)  L2[0x000] = 00000002517f9063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 0000000000000070
(XEN) ****************************************

This is very weird. The data suggests that the scheduling resource pointer for cpu 1 was NULL, but I can't see how this can be the case without causing similar crashes without this patch. Are there any additional patches related to cpu on/offlining or suspend/resume in the hypervisor? Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.