 
	
| [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU
 On 03/25/2016 10:05 AM, Steven Haigh wrote: On 25/03/2016 11:23 PM, Boris Ostrovsky wrote:On 03/24/2016 10:53 PM, Steven Haigh wrote:Hi all, Firstly, I've cross-posted this to xen-devel and the lkml - as this problem seems to only exist when using kernel 4.4 as a Xen DomU kernel. I have also CC'ed Greg KH for his awesome insight as maintainer. Please CC myself into replies - as I'm not a member of the kernel mailing list - I may miss replies from monitoring the archives. I've noticed recently that heavy disk IO is causing rcu_sched to detect stalls. The process mentioned usually goes to 100% CPU usage, and eventually processes start segfaulting and dying. The only fix to recover the system is to use 'xl destroy' to force-kill the VM and to start it again. The majority of these issues seem to mention ext4 in the trace. This may indicate an issue there - or may be a red herring. The gritty details: INFO: rcu_sched self-detected stall on CPU #0110-...: (20999 ticks this GP) idle=327/140000000000001/0 softirq=1101493/1101493 fqs=6973 #011 (t=21000 jiffies g=827095 c=827094 q=524) Task dump for CPU 0: rsync R running task 0 2446 2444 0x00000088 ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000 ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001 ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00 Call Trace: <IRQ> [<ffffffff810a625f>] sched_show_task+0xaf/0x110 [<ffffffff810a8699>] dump_cpu_task+0x39/0x40 [<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0 [<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0 [<ffffffff810a91e1>] ? account_system_time+0x81/0x110 [<ffffffff810a9481>] ? account_process_tick+0x61/0x160 [<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30 [<ffffffff810d9749>] update_process_times+0x39/0x60 [<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50 [<ffffffff810e808d>] tick_sched_timer+0x3d/0x70 [<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250 [<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190 [<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140 [<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0 [<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50 [<ffffffff810c7d22>] generic_handle_irq+0x22/0x30 [<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180 [<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20 [<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80 [<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50 [<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90 <EOI> [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80 [<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30This looks to me like ext4 failing to grab a lock. Everything above it (in Xen code) is regular tick interrupt handling which detects the stall. Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates any possible issues with pv locks. Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg, /var/log/xen/) Are you oversubscribing your guest (CPU-wise)? That doesn't look like a full log. In any case, the RCU stall may be a secondary problem --- there is a bunch of splats before the stall. -boris Not sure if it makes any difference at all, but my DomU config is: # cat /etc/xen/backup.vm name = "backup.vm" memory = 2048 vcpus = 2 cpus = "1-3" disk = [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ] vif = [ "mac=00:11:36:35:35:09, bridge=br203, vifname=vm.backup, script=vif-bridge" ] bootloader = 'pygrub' pvh = 1 on_poweroff = 'destroy' on_reboot = 'restart' on_crash = 'restart' cpu_weight = 64 I never had this problem when running kernel 4.1.x - it only started when I upgraded everything to 4.4 - not exactly a great help - but may help narrow things down? _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel 
 
 
 | 
|  | Lists.xenproject.org is hosted with RackSpace, monitoring our |