Hello,
I've tried upgrading one of my long running Xen-dom0 machines from Xen 4.11.3 to 4.12.1. It's been working fine for several days, but after that one of the domUs failed the monitoring checks and it was impossible to access it via ssh. From the monitoring it's visible that the load just starts to grow linearly (the memory consumption grows too as new monitoring processes are spawned and don't finish) and the machine is simply stuck. It happened 3 times during the last 3 weeks. The Dom0 is working fine, just one of the domUs is stuck (always the same domU is stuck).
Xen got upgraded 19.12.2019, the first lockup happened 26.12.2019, then 28.12.2019 and 5.1.2020.
The domU kernel log is full of these messages:
Jan 5 13:19:20 kernel: [680493.141103] INFO: rcu_sched detected stalls on CPUs/tasks:
Jan 5 13:19:20 kernel: [680493.141107] (detected by 12, t=147012 jiffies, g=72555998, c=72555997, q=89937)
Jan 5 13:19:20 kernel: [680493.141112] All QSes seen, last rcu_sched kthread activity 147012 (4975178416-4975031404), jiffies_till_next_fqs=3, root ->qsmask 0x0
Jan 5 13:19:20 kernel: [680493.141114] php-fpm R running task 14024 17581 2249 0x00000000
Jan 5 13:19:20 kernel: [680493.141120] Call Trace:
Jan 5 13:19:20 kernel: [680493.141124] <IRQ>
Jan 5 13:19:20 kernel: [680493.141131] sched_show_task.cold+0xb4/0xcb
Jan 5 13:19:20 kernel: [680493.141135] rcu_check_callbacks.cold+0x36d/0x3ba
Jan 5 13:19:20 kernel: [680493.141138] update_process_times+0x24/0x60
Jan 5 13:19:20 kernel: [680493.141143] tick_sched_handle+0x30/0x50
Jan 5 13:19:20 kernel: [680493.141145] tick_sched_timer+0x30/0x70
Jan 5 13:19:20 kernel: [680493.141147] ? tick_sched_do_timer+0x40/0x40
Jan 5 13:19:20 kernel: [680493.141149] __hrtimer_run_queues+0xbc/0x1f0
Jan 5 13:19:20 kernel: [680493.141153] hrtimer_interrupt+0xa0/0x1d0
Jan 5 13:19:20 kernel: [680493.141158] xen_timer_interrupt+0x1e/0x30
Jan 5 13:19:20 kernel: [680493.141162] __handle_irq_event_percpu+0x3d/0x160
Jan 5 13:19:20 kernel: [680493.141164] handle_irq_event_percpu+0x1c/0x60
Jan 5 13:19:20 kernel: [680493.141168] handle_percpu_irq+0x32/0x50
Jan 5 13:19:20 kernel: [680493.141171] generic_handle_irq+0x1f/0x30
Jan 5 13:19:20 kernel: [680493.141175] __evtchn_fifo_handle_events+0x13f/0x150
Jan 5 13:19:20 kernel: [680493.141181] __xen_evtchn_do_upcall+0x53/0x90
Jan 5 13:19:20 kernel: [680493.141186] xen_evtchn_do_upcall+0x22/0x40
Jan 5 13:19:20 kernel: [680493.141191] xen_hvm_callback_vector+0x85/0x90
Jan 5 13:19:20 kernel: [680493.141192] </IRQ>
Jan 5 13:19:20 kernel: [680493.141194] RIP: 0033:0x56398dc8a959
Jan 5 13:19:20 kernel: [680493.141195] RSP: 002b:00007ffdd588d3d0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
Jan 5 13:19:20 kernel: [680493.141197] RAX: 0000000000000060 RBX: 00007f6ea3aa02e0 RCX: 0000000000000000
Jan 5 13:19:20 kernel: [680493.141198] RDX: 00007f6ea3aa02a0 RSI: 00007ffdd588d3d8 RDI: 00007ffdd588d3e0
Jan 5 13:19:20 kernel: [680493.141199] RBP: 00007f6ea3a9b5b0 R08: 00007f6ea49be770 R09: 00007f6ea483cdc0
Jan 5 13:19:20 kernel: [680493.141200] R10: 00007f6eae520a40 R11: 00007f6eae4933c0 R12: 00005639902892a0
Jan 5 13:19:20 kernel: [680493.141201] R13: 0000000000000000 R14: 00007f6eae41e930 R15: 00007f6ea7077138
Jan 5 13:19:20 kernel: [680493.141204] rcu_sched kthread starved for 147012 jiffies! g72555998 c72555997 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x200 ->cpu=3
Jan 5 13:19:20 kernel: [680493.141205] rcu_sched R15016 8 2 0x80000000
Jan 5 13:19:20 kernel: [680493.141210] Call Trace:
Jan 5 13:19:20 kernel: [680493.141215] ? __schedule+0x24e/0x710
Jan 5 13:19:20 kernel: [680493.141216] schedule+0x2d/0x80
Jan 5 13:19:20 kernel: [680493.141219] schedule_timeout+0x16c/0x340
Jan 5 13:19:20 kernel: [680493.141221] ? call_timer_fn+0x130/0x130
Jan 5 13:19:20 kernel: [680493.141222] rcu_gp_kthread+0x486/0xd60
Jan 5 13:19:20 kernel: [680493.141224] kthread+0xfd/0x130
Jan 5 13:19:20 kernel: [680493.141226] ? force_qs_rnp+0x170/0x170
Jan 5 13:19:20 kernel: [680493.141227] ? __kthread_parkme+0x90/0x90
Jan 5 13:19:20 kernel: [680493.141228] ret_from_fork+0x35/0x40
Xen-dom0 is running kernel 4.14.158 and these xen command line options: GRUB_CMDLINE_XEN="dom0_mem=4G gnttab_max_frames=256 ucode=scan loglvl=all guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m smt=true iommu=no-intremap"
Xen-domU config:
name = "machine"
kernel = "kernel-4.14.159-gentoo-xen"
memory = 10000
vcpus = 16
vif = [ '' ]
disk = [
'...root,raw,xvda,rw',
'...opt,raw,xvdc,rw',
'...home,raw,xvdb,rw',
'...tmp,raw,xvdd,rw',
'...var,raw,xvde,rw',
]
extra = "root=/dev/xvda net.ifnames=0 console=ttyS0 console=ttyS0,38400n8"
type = "hvm"
sdl = 0
vnc = 0
serial='pty'
xen_platform_pci=1
max_grant_frames = 256
I've had issues like this in the past with the grant frames (basically this issue
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554), maybe some other value needs to be raised too?
Thanks,
Tomas