[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] io hang with lvm on md raid1



Bump? I've now replicated this on raid10 and raid6 as well, so its not caused by the raid level. An example of a blkback process that is stuck is below, if that offers any additional insight. In all cases I'm seeing dmeventd stuck first though

Thanks
--Glenn

<6>1 2016-10-09T02:33:28.676226+00:00 host764 kernel - - Showing busy workqueues and worker pools: <6>1 2016-10-09T02:33:28.676227+00:00 host764 kernel - - workqueue events_power_efficient: flags=0x80 <6>1 2016-10-09T02:33:28.676228+00:00 host764 kernel - - pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 <6>1 2016-10-09T02:33:28.676229+00:00 host764 kernel - - pending: fb_flashcursor <6>1 2016-10-09T02:33:28.676230+00:00 host764 kernel - - workqueue kcopyd: flags=0x8 <6>1 2016-10-09T02:33:28.676231+00:00 host764 kernel - - pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 <6>1 2016-10-09T02:33:28.676232+00:00 host764 kernel - - in-flight: 31551:do_work [dm_mod]
...
<6>1 2016-10-09T02:41:52.678298+00:00 host764 kernel - - blkback.16.xvda D ffff8801302b3738 0 16029 2 0x00000000 <4>1 2016-10-09T02:41:52.678300+00:00 host764 kernel - - ffff8801302b3738 ffffffff818b9500 ffff88013014a400 ffff8800039de870 <4>1 2016-10-09T02:41:52.678302+00:00 host764 kernel - - ffff880049649988 ffffe8ffffa69200 ffff880140455c00 0000000000000000 <4>1 2016-10-09T02:41:52.678304+00:00 host764 kernel - - 0000000000000001 ffff8801302b3658 ffffffff810a2e05 ffff8801302b3668
<4>1 2016-10-09T02:41:52.678305+00:00 host764 kernel  - - Call Trace:
<4>1 2016-10-09T02:41:52.678308+00:00 host764 kernel - - [<ffffffff810a2e05>] ? wake_up_process+0x15/0x20 <4>1 2016-10-09T02:41:52.678310+00:00 host764 kernel - - [<ffffffff81090264>] ? wake_up_worker+0x24/0x30 <4>1 2016-10-09T02:41:52.678311+00:00 host764 kernel - - [<ffffffff810929d4>] ? insert_work+0x74/0xc0 <4>1 2016-10-09T02:41:52.678313+00:00 host764 kernel - - [<ffffffff811b58d2>] ? kmem_cache_alloc+0x72/0x160 <4>1 2016-10-09T02:41:52.678315+00:00 host764 kernel - - [<ffffffff810abec0>] ? update_curr+0x110/0x1a0 <4>1 2016-10-09T02:41:52.678317+00:00 host764 kernel - - [<ffffffff81681e20>] schedule+0x40/0x90 <4>1 2016-10-09T02:41:52.678319+00:00 host764 kernel - - [<ffffffff816845ad>] rwsem_down_write_failed+0x1fd/0x360 <4>1 2016-10-09T02:41:52.678321+00:00 host764 kernel - - [<ffffffff816850f6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 <4>1 2016-10-09T02:41:52.678322+00:00 host764 kernel - - [<ffffffff810086ee>] ? xen_pte_val+0xe/0x10 <4>1 2016-10-09T02:41:52.678324+00:00 host764 kernel - - [<ffffffff8133ff53>] call_rwsem_down_write_failed+0x13/0x20 <4>1 2016-10-09T02:41:52.678325+00:00 host764 kernel - - [<ffffffff81683d84>] ? down_write+0x24/0x40 <4>1 2016-10-09T02:41:52.678327+00:00 host764 kernel - - [<ffffffffa063a89d>] __origin_write+0x6d/0x2d0 [dm_snapshot] <4>1 2016-10-09T02:41:52.678330+00:00 host764 kernel - - [<ffffffff8115cefc>] ? mempool_alloc+0x5c/0x160 <4>1 2016-10-09T02:41:52.678332+00:00 host764 kernel - - [<ffffffffa063ab66>] do_origin+0x66/0x90 [dm_snapshot] <4>1 2016-10-09T02:41:52.678333+00:00 host764 kernel - - [<ffffffffa063afbf>] origin_map+0x6f/0x90 [dm_snapshot] <4>1 2016-10-09T02:41:52.678335+00:00 host764 kernel - - [<ffffffffa000427a>] __map_bio+0x4a/0x130 [dm_mod] <4>1 2016-10-09T02:41:52.678337+00:00 host764 kernel - - [<ffffffffa0004867>] __split_and_process_bio+0x327/0x3f0 [dm_mod] <4>1 2016-10-09T02:41:52.678339+00:00 host764 kernel - - [<ffffffffa00049a4>] dm_make_request+0x74/0xe0 [dm_mod] <4>1 2016-10-09T02:41:52.678340+00:00 host764 kernel - - [<ffffffff8130922f>] generic_make_request+0xff/0x1d0 <4>1 2016-10-09T02:41:52.678342+00:00 host764 kernel - - [<ffffffff81006cbd>] ? xen_mc_flush+0xad/0x1b0 <4>1 2016-10-09T02:41:52.678344+00:00 host764 kernel - - [<ffffffff81309370>] submit_bio+0x70/0x140 <4>1 2016-10-09T02:41:52.678346+00:00 host764 kernel - - [<ffffffff814965a5>] dispatch_rw_block_io+0x615/0xb10 <4>1 2016-10-09T02:41:52.678348+00:00 host764 kernel - - [<ffffffff81681586>] ? __schedule+0x306/0xa30 <4>1 2016-10-09T02:41:52.678349+00:00 host764 kernel - - [<ffffffff816879c9>] ? error_exit+0x9/0x20 <4>1 2016-10-09T02:41:52.678355+00:00 host764 kernel - - [<ffffffff816853cf>] ? _raw_spin_lock_irqsave+0x1f/0x50 <4>1 2016-10-09T02:41:52.678357+00:00 host764 kernel - - [<ffffffff810d83ea>] ? lock_timer_base+0x5a/0x80 <4>1 2016-10-09T02:41:52.678358+00:00 host764 kernel - - [<ffffffff81496cb1>] __do_block_io_op+0x211/0x650 <4>1 2016-10-09T02:41:52.678361+00:00 host764 kernel - - [<ffffffff810d8410>] ? lock_timer_base+0x80/0x80 <4>1 2016-10-09T02:41:52.678363+00:00 host764 kernel - - [<ffffffff816853cf>] ? _raw_spin_lock_irqsave+0x1f/0x50 <4>1 2016-10-09T02:41:52.678365+00:00 host764 kernel - - [<ffffffff816850f6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 <4>1 2016-10-09T02:41:52.678366+00:00 host764 kernel - - [<ffffffff8149720d>] xen_blkif_schedule+0x11d/0xad0 <4>1 2016-10-09T02:41:52.678368+00:00 host764 kernel - - [<ffffffff810b6680>] ? woken_wake_function+0x20/0x20 <4>1 2016-10-09T02:41:52.678369+00:00 host764 kernel - - [<ffffffff814970f0>] ? __do_block_io_op+0x650/0x650 <4>1 2016-10-09T02:41:52.678371+00:00 host764 kernel - - [<ffffffff814970f0>] ? __do_block_io_op+0x650/0x650 <4>1 2016-10-09T02:41:52.678373+00:00 host764 kernel - - [<ffffffff810993ac>] kthread+0xcc/0xf0 <4>1 2016-10-09T02:41:52.678374+00:00 host764 kernel - - [<ffffffff810a1b1e>] ? schedule_tail+0x1e/0xc0 <4>1 2016-10-09T02:41:52.678376+00:00 host764 kernel - - [<ffffffff810992e0>] ? kthread_freezable_should_stop+0x70/0x70 <4>1 2016-10-09T02:41:52.678378+00:00 host764 kernel - - [<ffffffff81685a4f>] ret_from_fork+0x3f/0x70 <4>1 2016-10-09T02:41:52.678380+00:00 host764 kernel - - [<ffffffff810992e0>] ? kthread_freezable_should_stop+0x70/0x70


On 05/10/16 14:37, Glenn Enright wrote:
HI there

Seeing an issue across multiple hosts when copying content from a lvm
snapshot. The cp command appears to hang indefinitely and can not be
killed (state D). Other commands that require IO (eg lvs, dd, touch
$file) may get queued, eventually depending on how busy the host is but
normally within 48 hours all io on the host blocks and the host becomes
completely unresponsive.

Machines are all stock Centos6 using xen packages from xen.crc.id.au. I
also have a report posted on https://xen.crc.id.au/bugs/view.php?id=75

IO stack is always spindle drives -> md raid 1, lvm, lv. In this case
the cp target was a sparse image on a separate raid1 drive array (on
different drives), but that varies between incidents

We also have raid6 on some hosts, but to date have not seen this issue
occur on that which is suggestive of raid1 as the problem.

Although the problem commands are not directly xen related, posting this
here first before asking other subsystem lists due to xen hypercalls
showing up in the traces below.

Has anyone seen anything like this recently? Or have any insight as to
what might be causing this? Or perhaps suggest some ways I might debug
this to provide further useful details?

Output from host of "echo w > /proc/sysrq-trigger". We also have 't'
output if needed.

<snipped long output>
Thanks
--Glenn

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
https://lists.xen.org/xen-users

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
https://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.