[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] 4.9.52: INFO: task X blocked for more than 300 seconds.
Hello Ankur, Am 05.10.2017 um 19:59 schrieb Ankur Arora: > On 2017-10-05 06:20 AM, Konrad Rzeszutek Wilk wrote: >> On Wed, Oct 04, 2017 at 08:26:27PM +0200, Philipp Hahn wrote: ... >> Adding Ankur to this as I think he saw something similar. >> >> But in the meantime - do you see this with the latest version of Linux? >>> with linux-4.9.52 running on Debian-Wheezy with Xen-4.1 I observed >>> several stuck processes: Here is one (truncated) dump of the Linux >>> kernel messages: >>> >>>> [<ffffffff8160db6d>] ? __schedule+0x23d/0x6d0 >>>> [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90 >>>> [<ffffffff8160e032>] ? schedule+0x32/0x80 >>>> [<ffffffff8161150c>] ? schedule_timeout+0x1ec/0x360 >>>> [<ffffffff8130f277>] ? __blk_mq_run_hw_queue+0x327/0x3e0* see below >>>> [<ffffffff8101b7f1>] ? xen_clocksource_get_cycles+0x11/0x20 >>>> [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90 >>>> [<ffffffff8160d8b4>] ? io_schedule_timeout+0xb4/0x130 >>>> [<ffffffff810bb6f7>] ? prepare_to_wait+0x57/0x80 >>>> [<ffffffff8160e8b7>] ? bit_wait_io+0x17/0x60 >>>> [<ffffffff8160e3ac>] ? __wait_on_bit+0x5c/0x90 >>>> [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90 >>>> [<ffffffff8160e50e>] ? out_of_line_wait_on_bit+0x7e/0xa0 >>>> [<ffffffff810bba20>] ? autoremove_wake_function+0x40/0x40 >>>> [<ffffffffc00abd18>] ? >>>> jbd2_journal_commit_transaction+0xd48/0x17e0 [jbd2] >>>> [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720 >>>> [<ffffffff810e67fd>] ? try_to_del_timer_sync+0x4d/0x80 >>>> [<ffffffffc00b099d>] ? kjournald2+0xdd/0x280 [jbd2] >>>> [<ffffffff810bb9e0>] ? wake_up_atomic_t+0x30/0x30 >>>> [<ffffffffc00b08c0>] ? commit_timeout+0x10/0x10 [jbd2] >>>> [<ffffffff81097660>] ? kthread+0xf0/0x110 >>>> [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720 >>>> [<ffffffff81097570>] ? kthread_park+0x60/0x60 >>>> [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30 > This looks like this race: https://patchwork.kernel.org/patch/9853443/ I built a new kernel, for which I picked that patch on top of 4.9.56. We are currently testing that, but it crashed again yesterday evening. Here is the dmesg output: > INFO: task systemd:1 blocked for more than 120 seconds. > Not tainted 4.9.0-ucs105-amd64 #1 Debian 4.9.30-2A~4.2.0.201710161640 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > systemd D 0 1 0 0x00000000 > ffff8801f1797c00 0000000000000000 ffffffff81c0e540 ffff8801f4956040 > ffff8801f5a187c0 ffffc90040c4f880 ffffffff8160ebbd ffffffff81186295 > ffffffff81350b39 0000000002cebf80 ffffffff811863c8 ffff8801f4956040 > Call Trace: > [<ffffffff8160ebbd>] ? __schedule+0x23d/0x6d0 > [<ffffffff81186295>] ? move_freepages+0x95/0xd0 > [<ffffffff81350b39>] ? list_del+0x9/0x20 > [<ffffffff811863c8>] ? __rmqueue+0x88/0x3e0 > [<ffffffff8160f082>] ? schedule+0x32/0x80 > [<ffffffff8161255c>] ? schedule_timeout+0x1ec/0x360 > [<ffffffff81187530>] ? get_page_from_freelist+0x350/0xad0 > [<ffffffff8160e904>] ? io_schedule_timeout+0xb4/0x130 > [<ffffffff81364fd4>] ? __sbitmap_queue_get+0x24/0x90 > [<ffffffff813114f9>] ? bt_get.isra.6+0x129/0x1c0 > [<ffffffff81350b39>] ? list_del+0x9/0x20 > [<ffffffff810bba00>] ? wake_up_atomic_t+0x30/0x30 > [<ffffffff81311843>] ? blk_mq_get_tag+0x23/0x90 > [<ffffffff8130d1aa>] ? __blk_mq_alloc_request+0x1a/0x220 > [<ffffffff8130dffd>] ? blk_mq_map_request+0xcd/0x170 > [<ffffffff8131016a>] ? blk_sq_make_request+0xca/0x4c0 > [<ffffffff81303e1a>] ? generic_make_request_checks+0x22a/0x4f0 > [<ffffffff81304761>] ? generic_make_request+0x121/0x2c0 > [<ffffffff8117e823>] ? __add_to_page_cache_locked+0x183/0x230 > [<ffffffff81304976>] ? submit_bio+0x76/0x150 > [<ffffffff8117e964>] ? add_to_page_cache_lru+0x84/0xe0 > [<ffffffffc017b9d9>] ? ext4_mpage_readpages+0x2b9/0x8b0 [ext4] > [<ffffffff811d94fa>] ? alloc_pages_current+0x8a/0x110 > [<ffffffff8118e745>] ? __do_page_cache_readahead+0x195/0x240 > [<ffffffff8117f627>] ? pagecache_get_page+0x27/0x2b0 > [<ffffffff811818e6>] ? filemap_fault+0x276/0x590 > [<ffffffffc0138491>] ? ext4_filemap_fault+0x31/0x50 [ext4] > [<ffffffff811b23a4>] ? __do_fault+0x84/0x190 > [<ffffffff811b783e>] ? handle_mm_fault+0xede/0x1680 > [<ffffffff8124d25e>] ? ep_poll+0x13e/0x360 > [<ffffffff8105feda>] ? __do_page_fault+0x26a/0x500 > [<ffffffff81205ba2>] ? SyS_read+0x52/0xc0 > [<ffffffff81614d58>] ? page_fault+0x28/0x30 I haven't been able to get the address of the queue object yet to get its state. (timeout for today) > Can you dump the output of: cat /sys/block/$xen-frontend-device/mq/*/tags > > If you've hit this bug, one or more of the MQs would be wedged and > the nr_free in one or more of the queues would be 0 and will not > change. As soon as the bug occurs, we can no longer access the VM via ssh or the Xen (serial) console: the connection stalls after entering 2-3 characters. I have a Xen crash-dump file of one such crash, but following /sys/block/xvd?/mq/*/tags manually to get the kobject address using "crash" (gdb) is very time consuming. So far I only did it once for xvda, but I have to to that for the other 15 block devices as well to find one culprit. Philipp Hahn -- Philipp Hahn Open Source Software Engineer Univention GmbH be open. Mary-Somerville-Str. 1 D-28359 Bremen Tel.: +49 421 22232-0 Fax : +49 421 22232-99 hahn@xxxxxxxxxxxxx http://www.univention.de/ Geschäftsführer: Peter H. Ganten HRB 20755 Amtsgericht Bremen Steuer-Nr.: 71-597-0287 Attachment:
blk_sq_make_request.txt Attachment:
blk_sq_make_request.s _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |