[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] xen domU stall on 4.12.1





On Thu, Feb 6, 2020 at 8:06 AM Tomas Mozes <hydrapolic@xxxxxxxxx> wrote:



On Mon, Jan 27, 2020 at 2:42 PM Tomas Mozes <hydrapolic@xxxxxxxxx> wrote:


On Tue, Jan 7, 2020 at 8:29 AM Tomas Mozes <hydrapolic@xxxxxxxxx> wrote:
Hello,
I've tried upgrading one of my long running Xen-dom0 machines from Xen 4.11.3 to 4.12.1. It's been working fine for several days, but after that one of the domUs failed the monitoring checks and it was impossible to access it via ssh. From the monitoring it's visible that the load just starts to grow linearly (the memory consumption grows too as new monitoring processes are spawned and don't finish) and the machine is simply stuck. It happened 3 times during the last 3 weeks. The Dom0 is working fine, just one of the domUs is stuck (always the same domU is stuck).

Xen got upgraded 19.12.2019, the first lockup happened 26.12.2019, then 28.12.2019 and 5.1.2020.

The domU kernel log is full of these messages:
Jan  5 13:19:20 kernel: [680493.141103] INFO: rcu_sched detected stalls on CPUs/tasks:
Jan  5 13:19:20 kernel: [680493.141107]         (detected by 12, t=147012 jiffies, g=72555998, c=72555997, q=89937)
Jan  5 13:19:20 kernel: [680493.141112] All QSes seen, last rcu_sched kthread activity 147012 (4975178416-4975031404), jiffies_till_next_fqs=3, root ->qsmask 0x0
Jan  5 13:19:20 kernel: [680493.141114] php-fpm         R  running task    14024 17581   2249 0x00000000
Jan  5 13:19:20 kernel: [680493.141120] Call Trace:
Jan  5 13:19:20 kernel: [680493.141124]  <IRQ>
Jan  5 13:19:20 kernel: [680493.141131]  sched_show_task.cold+0xb4/0xcb
Jan  5 13:19:20 kernel: [680493.141135]  rcu_check_callbacks.cold+0x36d/0x3ba
Jan  5 13:19:20 kernel: [680493.141138]  update_process_times+0x24/0x60
Jan  5 13:19:20 kernel: [680493.141143]  tick_sched_handle+0x30/0x50
Jan  5 13:19:20 kernel: [680493.141145]  tick_sched_timer+0x30/0x70
Jan  5 13:19:20 kernel: [680493.141147]  ? tick_sched_do_timer+0x40/0x40
Jan  5 13:19:20 kernel: [680493.141149]  __hrtimer_run_queues+0xbc/0x1f0
Jan  5 13:19:20 kernel: [680493.141153]  hrtimer_interrupt+0xa0/0x1d0
Jan  5 13:19:20 kernel: [680493.141158]  xen_timer_interrupt+0x1e/0x30
Jan  5 13:19:20 kernel: [680493.141162]  __handle_irq_event_percpu+0x3d/0x160
Jan  5 13:19:20 kernel: [680493.141164]  handle_irq_event_percpu+0x1c/0x60
Jan  5 13:19:20 kernel: [680493.141168]  handle_percpu_irq+0x32/0x50
Jan  5 13:19:20 kernel: [680493.141171]  generic_handle_irq+0x1f/0x30
Jan  5 13:19:20 kernel: [680493.141175]  __evtchn_fifo_handle_events+0x13f/0x150
Jan  5 13:19:20 kernel: [680493.141181]  __xen_evtchn_do_upcall+0x53/0x90
Jan  5 13:19:20 kernel: [680493.141186]  xen_evtchn_do_upcall+0x22/0x40
Jan  5 13:19:20 kernel: [680493.141191]  xen_hvm_callback_vector+0x85/0x90
Jan  5 13:19:20 kernel: [680493.141192]  </IRQ>
Jan  5 13:19:20 kernel: [680493.141194] RIP: 0033:0x56398dc8a959
Jan  5 13:19:20 kernel: [680493.141195] RSP: 002b:00007ffdd588d3d0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
Jan  5 13:19:20 kernel: [680493.141197] RAX: 0000000000000060 RBX: 00007f6ea3aa02e0 RCX: 0000000000000000
Jan  5 13:19:20 kernel: [680493.141198] RDX: 00007f6ea3aa02a0 RSI: 00007ffdd588d3d8 RDI: 00007ffdd588d3e0
Jan  5 13:19:20 kernel: [680493.141199] RBP: 00007f6ea3a9b5b0 R08: 00007f6ea49be770 R09: 00007f6ea483cdc0
Jan  5 13:19:20 kernel: [680493.141200] R10: 00007f6eae520a40 R11: 00007f6eae4933c0 R12: 00005639902892a0
Jan  5 13:19:20 kernel: [680493.141201] R13: 0000000000000000 R14: 00007f6eae41e930 R15: 00007f6ea7077138
Jan  5 13:19:20 kernel: [680493.141204] rcu_sched kthread starved for 147012 jiffies! g72555998 c72555997 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x200 ->cpu=3
Jan  5 13:19:20 kernel: [680493.141205] rcu_sched       R15016     8      2 0x80000000
Jan  5 13:19:20 kernel: [680493.141210] Call Trace:
Jan  5 13:19:20 kernel: [680493.141215]  ? __schedule+0x24e/0x710
Jan  5 13:19:20 kernel: [680493.141216]  schedule+0x2d/0x80
Jan  5 13:19:20 kernel: [680493.141219]  schedule_timeout+0x16c/0x340
Jan  5 13:19:20 kernel: [680493.141221]  ? call_timer_fn+0x130/0x130
Jan  5 13:19:20 kernel: [680493.141222]  rcu_gp_kthread+0x486/0xd60
Jan  5 13:19:20 kernel: [680493.141224]  kthread+0xfd/0x130
Jan  5 13:19:20 kernel: [680493.141226]  ? force_qs_rnp+0x170/0x170
Jan  5 13:19:20 kernel: [680493.141227]  ? __kthread_parkme+0x90/0x90
Jan  5 13:19:20 kernel: [680493.141228]  ret_from_fork+0x35/0x40

Xen-dom0 is running kernel 4.14.158 and these xen command line options: GRUB_CMDLINE_XEN="dom0_mem=4G gnttab_max_frames=256 ucode=scan loglvl=all guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m smt=true iommu=no-intremap"

Xen-domU config:
name = "machine"
kernel = "kernel-4.14.159-gentoo-xen"
memory = 10000
vcpus = 16
vif = [ '' ]
disk = [
'...root,raw,xvda,rw',
'...opt,raw,xvdc,rw',
'...home,raw,xvdb,rw',
'...tmp,raw,xvdd,rw',
'...var,raw,xvde,rw',
]
extra = "root=/dev/xvda net.ifnames=0 console=ttyS0 console=ttyS0,38400n8"
type = "hvm"
sdl = 0
vnc = 0
serial='pty'
xen_platform_pci=1
max_grant_frames = 256

I've had issues like this in the past with the grant frames (basically this issue https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554), maybe some other value needs to be raised too?

Thanks,
Tomas

Reproduced on another HP machine. The previous one was HP ProLiant DL360 G7 with 2x Intel Xeon E5620 @ 2.40GHz. The other is HP ProLiant DL360p Gen8 with 2x Intel Xeon CPU E5-2630 @ 2.30GHz.

The strange thing is this does not happen on our testing machine (but has lower load of course) - that's a Supermicro X10DRW with 1x Intel Xeon CPU E5-2620 v3 @ 2.40GHz.


Just an update, I've tried Xen 4.12 and the latest staging Xen 4.13, both behave the same. Doesn't matter if kernel 4.14 or 5.4 is used. Right after the Xen version is reverted back to 4.11, everything works just fine, nothing else needs to be changed.

I've tried adding "mitigations=off" to kernel options and "spec-ctrl=false xpti=false pv-l1tf=false tsx=true" to Xen options, but it didn't help either.

Thanks,
Tomas


As reported in https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00361.html and https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00042.html, switching back to credit1 scheduler seems to make it working again. I've migrated 6 machines to Xen 4.12 with sched=credit xen option and haven't observed a hang for more than a week now.

Thanks,
Tomas

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.