Xen project Mailing List

Re: [Xen-devel] 4.9.52: INFO: task XXX blocked for more than 300 seconds.

To: Jan Beulich <JBeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

From: Philipp Hahn <hahn@xxxxxxxxxxxxx>

Date: Thu, 5 Oct 2017 14:04:44 +0200

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>

Delivery-date: Thu, 05 Oct 2017 12:04:55 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hello Jan, thank you for you answer. Am 05.10.2017 um 12:12 schrieb Jan Beulich: >>>> On 04.10.17 at 20:26, <hahn@xxxxxxxxxxxxx> wrote: >> with linux-4.9.52 running on Debian-Wheezy with Xen-4.1 I observed >> several stuck processes: Here is one (truncated) dump of the Linux >> kernel messages: >> >>> [<ffffffff8160db6d>] ? __schedule+0x23d/0x6d0 >>> [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90 >>> [<ffffffff8160e032>] ? schedule+0x32/0x80 >>> [<ffffffff8161150c>] ? schedule_timeout+0x1ec/0x360 >>> [<ffffffff8130f277>] ? __blk_mq_run_hw_queue+0x327/0x3e0* see below >>> [<ffffffff8101b7f1>] ? xen_clocksource_get_cycles+0x11/0x20 >>> [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90 >>> [<ffffffff8160d8b4>] ? io_schedule_timeout+0xb4/0x130 >>> [<ffffffff810bb6f7>] ? prepare_to_wait+0x57/0x80 >>> [<ffffffff8160e8b7>] ? bit_wait_io+0x17/0x60 >>> [<ffffffff8160e3ac>] ? __wait_on_bit+0x5c/0x90 >>> [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90 >>> [<ffffffff8160e50e>] ? out_of_line_wait_on_bit+0x7e/0xa0 >>> [<ffffffff810bba20>] ? autoremove_wake_function+0x40/0x40 >>> [<ffffffffc00abd18>] ? jbd2_journal_commit_transaction+0xd48/0x17e0 [jbd2] >>> [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720 >>> [<ffffffff810e67fd>] ? try_to_del_timer_sync+0x4d/0x80 >>> [<ffffffffc00b099d>] ? kjournald2+0xdd/0x280 [jbd2] >>> [<ffffffff810bb9e0>] ? wake_up_atomic_t+0x30/0x30 >>> [<ffffffffc00b08c0>] ? commit_timeout+0x10/0x10 [jbd2] >>> [<ffffffff81097660>] ? kthread+0xf0/0x110 >>> [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720 >>> [<ffffffff81097570>] ? kthread_park+0x60/0x60 >>> [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30 >>> NMI backtrace for cpu 2 >>> CPU: 2 PID: 35 Comm: khungtaskd Not tainted 4.9.0-ucs105-amd64 #1 Debian >>> 4.9.30-2A~4.2.0.201709271649 >>> 0000000000000000 ffffffff81331935 0000000000000000 0000000000000002 >>> ffffffff81335e60 0000000000000002 ffffffff8104cb70 ffff8801f0c90e80 >>> ffffffff81335f6a ffff8801f0c90e80 00000000003fffbc ffffffff81128048 >>> Call Trace: >>> [<ffffffff81331935>] ? dump_stack+0x5c/0x77 >>> [<ffffffff81335e60>] ? nmi_cpu_backtrace+0x90/0xa0 >>> [<ffffffff8104cb70>] ? irq_force_complete_move+0x140/0x140 >>> [<ffffffff81335f6a>] ? nmi_trigger_cpumask_backtrace+0xfa/0x130 >>> [<ffffffff81128048>] ? watchdog+0x2b8/0x330 >>> [<ffffffff81127d90>] ? reset_hung_task_detector+0x10/0x10 >>> [<ffffffff81097660>] ? kthread+0xf0/0x110 >>> [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720 >>> [<ffffffff81097570>] ? kthread_park+0x60/0x60 >>> [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30 ... >> Looking at the dis-assembly of xen_clocksource_get_cycles() in >> arch/x86/xen/time.c I see no path how that should call >> __blk_mq_run_hw_queue(): > > Hence the question marks ahead of the stack entries: What you see > there are likely leftovers from prior call trees. It just so happens > that the old return address slots haven't got overwritten yet. You > need to first sanitize the stack trace e.g. by having the kernel > dump more of the stack in raw hex form, and then looking at the > disassembly to figure out how large each stack frame is, starting > at the top-most address (i.e. the one in RIP). That explains the strange call trace for me, thank you for the enlightenment. > On Wed, Oct 04, 2017 at 06:26:27PM +0000, Philipp Hahn wrote: >>> INFO: task btrfs-transacti:522 blocked for more than 300 seconds. > [...] >> And another one: >>> INFO: task smbd:20101 blocked for more than 300 seconds. > [...] >> This does not look normal to me or did I miss something? > > So I see that both of the stuck processes listed above (smbd and > btrfs-*) are disk related processes. Might I ask how many disk/nics > (PV) do you have attached to this DomU, and how many queues does each > have? Nothing special configured, how would I best fetch that info? Which leads me back to my original problem: How can I diagnose *why* the task is blocked for that time? From my understanding this can happen if IO is too slow and task just have to wait for too long. ¹ Even if IO is slow the system should stabilize itself when no new IO is generated and the old one has been processed, right? So looking at `vmstat` or `blktrace` should tell me, that Xen/Linux/whatever is busy with IO and it is simply not fast enough to keep up with the load. Thanks again, but any hint how to diagnose this does help. Philipp ¹ <https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.