[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] 4.9.52: INFO: task XXX blocked for more than 300 seconds.



On Wed, Oct 04, 2017 at 08:26:27PM +0200, Philipp Hahn wrote:
> Hello,
> 

Adding Ankur to this as I think he saw something similar. 

But in the meantime - do you see this with the latest version of Linux?
> with linux-4.9.52 running on Debian-Wheezy with Xen-4.1 I observed
> several stuck processes: Here is one (truncated) dump of the Linux
> kernel messages:
> 
> >  [<ffffffff8160db6d>] ? __schedule+0x23d/0x6d0
> >  [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
> >  [<ffffffff8160e032>] ? schedule+0x32/0x80
> >  [<ffffffff8161150c>] ? schedule_timeout+0x1ec/0x360
> >  [<ffffffff8130f277>] ? __blk_mq_run_hw_queue+0x327/0x3e0* see below
> >  [<ffffffff8101b7f1>] ? xen_clocksource_get_cycles+0x11/0x20
> >  [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
> >  [<ffffffff8160d8b4>] ? io_schedule_timeout+0xb4/0x130
> >  [<ffffffff810bb6f7>] ? prepare_to_wait+0x57/0x80
> >  [<ffffffff8160e8b7>] ? bit_wait_io+0x17/0x60
> >  [<ffffffff8160e3ac>] ? __wait_on_bit+0x5c/0x90
> >  [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
> >  [<ffffffff8160e50e>] ? out_of_line_wait_on_bit+0x7e/0xa0
> >  [<ffffffff810bba20>] ? autoremove_wake_function+0x40/0x40
> >  [<ffffffffc00abd18>] ? jbd2_journal_commit_transaction+0xd48/0x17e0 [jbd2]
> >  [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
> >  [<ffffffff810e67fd>] ? try_to_del_timer_sync+0x4d/0x80
> >  [<ffffffffc00b099d>] ? kjournald2+0xdd/0x280 [jbd2]
> >  [<ffffffff810bb9e0>] ? wake_up_atomic_t+0x30/0x30
> >  [<ffffffffc00b08c0>] ? commit_timeout+0x10/0x10 [jbd2]
> >  [<ffffffff81097660>] ? kthread+0xf0/0x110
> >  [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
> >  [<ffffffff81097570>] ? kthread_park+0x60/0x60
> >  [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30
> > NMI backtrace for cpu 2
> > CPU: 2 PID: 35 Comm: khungtaskd Not tainted 4.9.0-ucs105-amd64 #1 Debian 
> > 4.9.30-2A~4.2.0.201709271649
> >  0000000000000000 ffffffff81331935 0000000000000000 0000000000000002
> >  ffffffff81335e60 0000000000000002 ffffffff8104cb70 ffff8801f0c90e80
> >  ffffffff81335f6a ffff8801f0c90e80 00000000003fffbc ffffffff81128048
> > Call Trace:
> >  [<ffffffff81331935>] ? dump_stack+0x5c/0x77
> >  [<ffffffff81335e60>] ? nmi_cpu_backtrace+0x90/0xa0
> >  [<ffffffff8104cb70>] ? irq_force_complete_move+0x140/0x140
> >  [<ffffffff81335f6a>] ? nmi_trigger_cpumask_backtrace+0xfa/0x130
> >  [<ffffffff81128048>] ? watchdog+0x2b8/0x330
> >  [<ffffffff81127d90>] ? reset_hung_task_detector+0x10/0x10
> >  [<ffffffff81097660>] ? kthread+0xf0/0x110
> >  [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
> >  [<ffffffff81097570>] ? kthread_park+0x60/0x60
> >  [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30
> > Sending NMI from CPU 2 to CPUs 0-1,3:
> > NMI backtrace for cpu 1
> > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-ucs105-amd64 #1 Debian 
> > 4.9.30-2A~4.2.0.201709271649
> > task: ffff8801f4a02ec0 task.stack: ffffc90040ca4000
> > RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>] 
> > xen_hypercall_sched_op+0xa/0x20
> > RSP: e02b:ffffc90040ca7ed0  EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: ffff8801f4a02ec0 RCX: ffffffff810013aa
> > RDX: ffffffff81c4ba70 RSI: 0000000000000000 RDI: 0000000000000001
> > RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffff8801f4a02ec0 R15: ffff8801f4a02ec0
> > FS:  00007f23ac595700(0000) GS:ffff8801f5a80000(0000) knlGS:0000000000000000
> > CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
> > CR2: 00007f52537d6d46 CR3: 00000001bba23000 CR4: 0000000000002660
> > Stack:
> >  ffff8801bb832201 0000000000000001 ffffffff8101b55c ffffffff81611ec8
> >  ffff8801f4a02ec0 0000000000000001 ffffffff810bc280 ffff8801f4a02ec0
> >  ffff8801f4a02ec0 c0a995961d41240f addce6dcadd009c9 0000000000000000
> > Call Trace:
> >  [<ffffffff8101b55c>] ? xen_safe_halt+0xc/0x20
> >  [<ffffffff81611ec8>] ? default_idle+0x18/0xd0
> >  [<ffffffff810bc280>] ? cpu_startup_entry+0x1f0/0x260
> > Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc 
> > cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 
> > cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
> > NMI backtrace for cpu 3
> > CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.9.0-ucs105-amd64 #1 Debian 
> > 4.9.30-2A~4.2.0.201709271649
> > task: ffff8801f4a24f00 task.stack: ffffc90040cb4000
> > RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>] 
> > xen_hypercall_sched_op+0xa/0x20
> > RSP: e02b:ffffc90040cb7ed0  EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: ffff8801f4a24f00 RCX: ffffffff810013aa
> > RDX: ffffffff81c4ba70 RSI: 0000000000000000 RDI: 0000000000000001
> > RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffff8801f4a24f00 R15: ffff8801f4a24f00
> > FS:  00007f1a2af19700(0000) GS:ffff8801f5b80000(0000) knlGS:0000000000000000
> > CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
> > CR2: 00007f4a5416b000 CR3: 00000001d83ec000 CR4: 0000000000002660
> > Stack:
> >  0000000000000001 0000000000000001 ffffffff8101b55c ffffffff81611ec8
> >  ffff8801f4a24f00 0000000000000003 ffffffff810bc280 ffff8801f4a24f00
> >  ffff8801f4a24f00 77816deb133b9979 addce6dcadd009c9 0000000000000000
> > Call Trace:
> >  [<ffffffff8101b55c>] ? xen_safe_halt+0xc/0x20
> >  [<ffffffff81611ec8>] ? default_idle+0x18/0xd0
> >  [<ffffffff810bc280>] ? cpu_startup_entry+0x1f0/0x260
> > Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc 
> > cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 
> > cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
> > NMI backtrace for cpu 0
> > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-ucs105-amd64 #1 Debian 
> > 4.9.30-2A~4.2.0.201709271649
> > task: ffffffff81c0e540 task.stack: ffffffff81c00000
> > RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>] 
> > xen_hypercall_sched_op+0xa/0x20
> > RSP: e02b:ffffffff81c03e90  EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: ffffffff81c0e540 RCX: ffffffff810013aa
> > RDX: ffffffff81c4ba70 RSI: 0000000000000000 RDI: 0000000000000001
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffffffff81c0e540 R15: ffffffff81c0e540
> > FS:  00007f977d3c2700(0000) GS:ffff8801f5a00000(0000) knlGS:0000000000000000
> > CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 000055ae9becabc8 CR3: 00000001d6a40000 CR4: 0000000000002660
> > Stack:
> >  0000000000000001 0000000000000001 ffffffff8101b55c ffffffff81611ec8
> >  ffffffff81c0e540 0000000000000000 ffffffff810bc280 ffffffff81c0e540
> >  ffffffff81c0e540 ebeac214e330bd3f addce6dcadd009c9 ffffffffffffffff
> > Call Trace:
> >  [<ffffffff8101b55c>] ? xen_safe_halt+0xc/0x20
> >  [<ffffffff81611ec8>] ? default_idle+0x18/0xd0
> >  [<ffffffff810bc280>] ? cpu_startup_entry+0x1f0/0x260
> >  [<ffffffff81d4bf84>] ? start_kernel+0x46d/0x48d
> >  [<ffffffff81d510e6>] ? xen_start_kernel+0x52e/0x538
> > Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc 
> > cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 
> > cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
> 
> 
> Looking at the dis-assembly of xen_clocksource_get_cycles() in
> arch/x86/xen/time.c I see no path how that should call
> __blk_mq_run_hw_queue():
> 
> > 00000000000001a0 <xen_clocksource_get_cycles> mov    %gs:0x0(%rip),%rdi     
> >    # 00000000000001a8 <xen_clocksource_get_cycles+0x8>
> > 00000000000001a8 <xen_clocksource_get_cycles+0x8> add    $0x20,%rdi
> >         ret = pvclock_clocksource_read(src);
> > 00000000000001ac <xen_clocksource_get_cycles+0xc> callq  00000000000001b1 
> > <xen_clocksource_get_cycles+0x11>
> > }
> > 
> > static cycle_t xen_clocksource_get_cycles(struct clocksource *cs)
> > {
> >         return xen_clocksource_read();
> > }
> > 00000000000001b1 <xen_clocksource_get_cycles+0x11> retq   
> > 00000000000001b2 <xen_clocksource_get_cycles+0x12> data16 data16 data16 
> > data16 nopw %cs:0x0(%rax,%rax,1)
> > 
> > static void xen_read_wallclock(struct timespec *ts)
> > {
> >         struct shared_info *s = HYPERVISOR_shared_info;
> >         struct pvclock_wall_clock *wall_clock = &(s->wc);
> > 00000000000001c0 <xen_get_wallclock> mov    0x0(%rip),%rax        # 
> > 00000000000001c7 <xen_get_wallclock+0x7>
> 
> 
> Here's another dump, which diverges from xen_clocksource_get_cycles() to
> some completely other function:
> 
> > INFO: task btrfs-transacti:522 blocked for more than 300 seconds.
> > btrfs-transacti D    0   522      2 0x00000000
> >  ffff8801f3836000 0000000000000000 ffff8801f4a24f00 ffff8801f32da1c0
> >  ffff8801f5b18780 ffffc9004199fa40 ffffffff8160cd2d 0000000000000000
> >  ffff8801741a58a8 0000000000000102 000000000000000e ffff8801f32da1c0
> > Call Trace:
> >  [<ffffffff8160cd2d>] ? __schedule+0x23d/0x6d0
> >  [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> >  [<ffffffff8160d1f2>] ? schedule+0x32/0x80
> >  [<ffffffff81610729>] ? schedule_timeout+0x249/0x300
> ***
> >  [<ffffffff8101b7d1>] ? xen_clocksource_get_cycles+0x11/0x20
> >  [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> >  [<ffffffff8160ca74>] ? io_schedule_timeout+0xb4/0x130
> >  [<ffffffff810bb5e7>] ? prepare_to_wait+0x57/0x80
> >  [<ffffffff8160da77>] ? bit_wait_io+0x17/0x60
> >  [<ffffffff8160d56c>] ? __wait_on_bit+0x5c/0x90
> >  [<ffffffff8117e18e>] ? find_get_pages_tag+0x15e/0x300
> >  [<ffffffff8117d016>] ? wait_on_page_bit+0x86/0xa0
> >  [<ffffffff810bb910>] ? autoremove_wake_function+0x40/0x40
> >  [<ffffffff8117d107>] ? __filemap_fdatawait_range+0xd7/0x150
> >  [<ffffffff8117d18f>] ? filemap_fdatawait_range+0xf/0x30
> >  [<ffffffffc0175643>] ? btrfs_wait_ordered_range+0x73/0x110 [btrfs]
> >  [<ffffffffc01a2a0d>] ? btrfs_wait_cache_io+0x5d/0x1f0 [btrfs]
> >  [<ffffffffc0141a36>] ? btrfs_write_dirty_block_groups+0x106/0x380 [btrfs]
> >  [<ffffffffc0140c5d>] ? btrfs_run_delayed_refs+0x1fd/0x2b0 [btrfs]
> >  [<ffffffffc01551d7>] ? commit_cowonly_roots+0x257/0x2f0 [btrfs]
> >  [<ffffffffc0157a24>] ? btrfs_commit_transaction+0x4e4/0xa40 [btrfs]
> >  [<ffffffffc015801d>] ? start_transaction+0x9d/0x4a0 [btrfs]
> >  [<ffffffffc01523a2>] ? transaction_kthread+0x1b2/0x1f0 [btrfs]
> >  [<ffffffffc01521f0>] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> >  [<ffffffff810975c0>] ? kthread+0xf0/0x110
> >  [<ffffffff8102476b>] ? __switch_to+0x2bb/0x700
> >  [<ffffffff810974d0>] ? kthread_park+0x60/0x60
> >  [<ffffffff81611d35>] ? ret_from_fork+0x25/0x30
> 
> And another one:
> > INFO: task smbd:20101 blocked for more than 300 seconds.
> > smbd            D    0 20101   1714 0x00000000
> >  ffff8801f01cac00 0000000000000000 ffff8801f4a241c0 ffff88007f9a5240
> >  ffff8801f5b98780 ffffc90049a33bc0 ffffffff8160cd2d ffff8800f78bb5b0
> >  fd2236313fb5274b 000000008dc3c1a0 ffff8801f3ef0a40 ffff88007f9a5240
> > Call Trace:
> >  [<ffffffff8160cd2d>] ? __schedule+0x23d/0x6d0
> >  [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> >  [<ffffffff8160d1f2>] ? schedule+0x32/0x80
> >  [<ffffffff81610729>] ? schedule_timeout+0x249/0x300
> >  [<ffffffff81336486>] ? __radix_tree_lookup+0x76/0xe0
> ***
> >  [<ffffffff8101b7d1>] ? xen_clocksource_get_cycles+0x11/0x20
> >  [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> >  [<ffffffff8160ca74>] ? io_schedule_timeout+0xb4/0x130
> >  [<ffffffff810bb5e7>] ? prepare_to_wait+0x57/0x80
> >  [<ffffffff8160da77>] ? bit_wait_io+0x17/0x60
> >  [<ffffffff8160d56c>] ? __wait_on_bit+0x5c/0x90
> >  [<ffffffffc00f290b>] ? __jbd2_journal_file_buffer+0xcb/0x180 [jbd2]
> >  [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> >  [<ffffffff8160d6ce>] ? out_of_line_wait_on_bit+0x7e/0xa0
> >  [<ffffffff810bb910>] ? autoremove_wake_function+0x40/0x40
> >  [<ffffffffc00f2bc8>] ? do_get_write_access+0x208/0x420 [jbd2]
> >  [<ffffffffc00f2e0e>] ? jbd2_journal_get_write_access+0x2e/0x60 [jbd2]
> >  [<ffffffffc02c78a6>] ? __ext4_journal_get_write_access+0x36/0x70 [ext4]
> >  [<ffffffffc02a3913>] ? ext4_orphan_add+0xd3/0x230 [ext4]
> >  [<ffffffffc0296b4a>] ? ext4_mark_inode_dirty+0x6a/0x200 [ext4]
> >  [<ffffffffc02a4b8a>] ? ext4_unlink+0x36a/0x380 [ext4]
> >  [<ffffffff81211677>] ? vfs_unlink+0xe7/0x180
> >  [<ffffffff81214439>] ? do_unlinkat+0x289/0x300
> >  [<ffffffff81611abb>] ? system_call_fast_compare_end+0xc/0x9b
> 
> This does not look normal to me or did I miss something?
> 
> Where can I get more information on why there is no progress for 300s,
> what should I do to debug which task is waiting for what?
> 
> The traces of the of other CPUs look normal to me: the one posted first
> above is the shortest, in all other cases they were sooner or later
> waiting for IO (my interpretation, but I can post them if necessary.)
> 
> This problem occurs since the upgrade of the Linux kernel inside the VM
> from 4.1.x to 4.9.32 and now 4.9.52.
> 
> Any help is appreciated.
> Philipp Hahn
> -- 
> Philipp Hahn
> Open Source Software Engineer
> 
> Univention GmbH
> be open.
> Mary-Somerville-Str. 1
> D-28359 Bremen
> Tel.: +49 421 22232-0
> Fax : +49 421 22232-99
> hahn@xxxxxxxxxxxxx
> 
> http://www.univention.de/
> Geschäftsführer: Peter H. Ganten
> HRB 20755 Amtsgericht Bremen
> Steuer-Nr.: 71-597-02876
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.