[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] io hang with lvm on md raid1
Bump? I've now replicated this on raid10 and raid6 as well, so its not caused by the raid level. An example of a blkback process that is stuck is below, if that offers any additional insight. In all cases I'm seeing dmeventd stuck first though Thanks --Glenn<6>1 2016-10-09T02:33:28.676226+00:00 host764 kernel - - Showing busy workqueues and worker pools: <6>1 2016-10-09T02:33:28.676227+00:00 host764 kernel - - workqueue events_power_efficient: flags=0x80 <6>1 2016-10-09T02:33:28.676228+00:00 host764 kernel - - pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 <6>1 2016-10-09T02:33:28.676229+00:00 host764 kernel - - pending: fb_flashcursor <6>1 2016-10-09T02:33:28.676230+00:00 host764 kernel - - workqueue kcopyd: flags=0x8 <6>1 2016-10-09T02:33:28.676231+00:00 host764 kernel - - pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 <6>1 2016-10-09T02:33:28.676232+00:00 host764 kernel - - in-flight: 31551:do_work [dm_mod] ...<6>1 2016-10-09T02:41:52.678298+00:00 host764 kernel - - blkback.16.xvda D ffff8801302b3738 0 16029 2 0x00000000 <4>1 2016-10-09T02:41:52.678300+00:00 host764 kernel - - ffff8801302b3738 ffffffff818b9500 ffff88013014a400 ffff8800039de870 <4>1 2016-10-09T02:41:52.678302+00:00 host764 kernel - - ffff880049649988 ffffe8ffffa69200 ffff880140455c00 0000000000000000 <4>1 2016-10-09T02:41:52.678304+00:00 host764 kernel - - 0000000000000001 ffff8801302b3658 ffffffff810a2e05 ffff8801302b3668 <4>1 2016-10-09T02:41:52.678305+00:00 host764 kernel - - Call Trace:<4>1 2016-10-09T02:41:52.678308+00:00 host764 kernel - - [<ffffffff810a2e05>] ? wake_up_process+0x15/0x20 <4>1 2016-10-09T02:41:52.678310+00:00 host764 kernel - - [<ffffffff81090264>] ? wake_up_worker+0x24/0x30 <4>1 2016-10-09T02:41:52.678311+00:00 host764 kernel - - [<ffffffff810929d4>] ? insert_work+0x74/0xc0 <4>1 2016-10-09T02:41:52.678313+00:00 host764 kernel - - [<ffffffff811b58d2>] ? kmem_cache_alloc+0x72/0x160 <4>1 2016-10-09T02:41:52.678315+00:00 host764 kernel - - [<ffffffff810abec0>] ? update_curr+0x110/0x1a0 <4>1 2016-10-09T02:41:52.678317+00:00 host764 kernel - - [<ffffffff81681e20>] schedule+0x40/0x90 <4>1 2016-10-09T02:41:52.678319+00:00 host764 kernel - - [<ffffffff816845ad>] rwsem_down_write_failed+0x1fd/0x360 <4>1 2016-10-09T02:41:52.678321+00:00 host764 kernel - - [<ffffffff816850f6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 <4>1 2016-10-09T02:41:52.678322+00:00 host764 kernel - - [<ffffffff810086ee>] ? xen_pte_val+0xe/0x10 <4>1 2016-10-09T02:41:52.678324+00:00 host764 kernel - - [<ffffffff8133ff53>] call_rwsem_down_write_failed+0x13/0x20 <4>1 2016-10-09T02:41:52.678325+00:00 host764 kernel - - [<ffffffff81683d84>] ? down_write+0x24/0x40 <4>1 2016-10-09T02:41:52.678327+00:00 host764 kernel - - [<ffffffffa063a89d>] __origin_write+0x6d/0x2d0 [dm_snapshot] <4>1 2016-10-09T02:41:52.678330+00:00 host764 kernel - - [<ffffffff8115cefc>] ? mempool_alloc+0x5c/0x160 <4>1 2016-10-09T02:41:52.678332+00:00 host764 kernel - - [<ffffffffa063ab66>] do_origin+0x66/0x90 [dm_snapshot] <4>1 2016-10-09T02:41:52.678333+00:00 host764 kernel - - [<ffffffffa063afbf>] origin_map+0x6f/0x90 [dm_snapshot] <4>1 2016-10-09T02:41:52.678335+00:00 host764 kernel - - [<ffffffffa000427a>] __map_bio+0x4a/0x130 [dm_mod] <4>1 2016-10-09T02:41:52.678337+00:00 host764 kernel - - [<ffffffffa0004867>] __split_and_process_bio+0x327/0x3f0 [dm_mod] <4>1 2016-10-09T02:41:52.678339+00:00 host764 kernel - - [<ffffffffa00049a4>] dm_make_request+0x74/0xe0 [dm_mod] <4>1 2016-10-09T02:41:52.678340+00:00 host764 kernel - - [<ffffffff8130922f>] generic_make_request+0xff/0x1d0 <4>1 2016-10-09T02:41:52.678342+00:00 host764 kernel - - [<ffffffff81006cbd>] ? xen_mc_flush+0xad/0x1b0 <4>1 2016-10-09T02:41:52.678344+00:00 host764 kernel - - [<ffffffff81309370>] submit_bio+0x70/0x140 <4>1 2016-10-09T02:41:52.678346+00:00 host764 kernel - - [<ffffffff814965a5>] dispatch_rw_block_io+0x615/0xb10 <4>1 2016-10-09T02:41:52.678348+00:00 host764 kernel - - [<ffffffff81681586>] ? __schedule+0x306/0xa30 <4>1 2016-10-09T02:41:52.678349+00:00 host764 kernel - - [<ffffffff816879c9>] ? error_exit+0x9/0x20 <4>1 2016-10-09T02:41:52.678355+00:00 host764 kernel - - [<ffffffff816853cf>] ? _raw_spin_lock_irqsave+0x1f/0x50 <4>1 2016-10-09T02:41:52.678357+00:00 host764 kernel - - [<ffffffff810d83ea>] ? lock_timer_base+0x5a/0x80 <4>1 2016-10-09T02:41:52.678358+00:00 host764 kernel - - [<ffffffff81496cb1>] __do_block_io_op+0x211/0x650 <4>1 2016-10-09T02:41:52.678361+00:00 host764 kernel - - [<ffffffff810d8410>] ? lock_timer_base+0x80/0x80 <4>1 2016-10-09T02:41:52.678363+00:00 host764 kernel - - [<ffffffff816853cf>] ? _raw_spin_lock_irqsave+0x1f/0x50 <4>1 2016-10-09T02:41:52.678365+00:00 host764 kernel - - [<ffffffff816850f6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 <4>1 2016-10-09T02:41:52.678366+00:00 host764 kernel - - [<ffffffff8149720d>] xen_blkif_schedule+0x11d/0xad0 <4>1 2016-10-09T02:41:52.678368+00:00 host764 kernel - - [<ffffffff810b6680>] ? woken_wake_function+0x20/0x20 <4>1 2016-10-09T02:41:52.678369+00:00 host764 kernel - - [<ffffffff814970f0>] ? __do_block_io_op+0x650/0x650 <4>1 2016-10-09T02:41:52.678371+00:00 host764 kernel - - [<ffffffff814970f0>] ? __do_block_io_op+0x650/0x650 <4>1 2016-10-09T02:41:52.678373+00:00 host764 kernel - - [<ffffffff810993ac>] kthread+0xcc/0xf0 <4>1 2016-10-09T02:41:52.678374+00:00 host764 kernel - - [<ffffffff810a1b1e>] ? schedule_tail+0x1e/0xc0 <4>1 2016-10-09T02:41:52.678376+00:00 host764 kernel - - [<ffffffff810992e0>] ? kthread_freezable_should_stop+0x70/0x70 <4>1 2016-10-09T02:41:52.678378+00:00 host764 kernel - - [<ffffffff81685a4f>] ret_from_fork+0x3f/0x70 <4>1 2016-10-09T02:41:52.678380+00:00 host764 kernel - - [<ffffffff810992e0>] ? kthread_freezable_should_stop+0x70/0x70 On 05/10/16 14:37, Glenn Enright wrote: HI there Seeing an issue across multiple hosts when copying content from a lvm snapshot. The cp command appears to hang indefinitely and can not be killed (state D). Other commands that require IO (eg lvs, dd, touch $file) may get queued, eventually depending on how busy the host is but normally within 48 hours all io on the host blocks and the host becomes completely unresponsive. Machines are all stock Centos6 using xen packages from xen.crc.id.au. I also have a report posted on https://xen.crc.id.au/bugs/view.php?id=75 IO stack is always spindle drives -> md raid 1, lvm, lv. In this case the cp target was a sparse image on a separate raid1 drive array (on different drives), but that varies between incidents We also have raid6 on some hosts, but to date have not seen this issue occur on that which is suggestive of raid1 as the problem. Although the problem commands are not directly xen related, posting this here first before asking other subsystem lists due to xen hypercalls showing up in the traces below. Has anyone seen anything like this recently? Or have any insight as to what might be causing this? Or perhaps suggest some ways I might debug this to provide further useful details? Output from host of "echo w > /proc/sysrq-trigger". We also have 't' output if needed. <snipped long output> Thanks --Glenn _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx https://lists.xen.org/xen-users _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx https://lists.xen.org/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |