[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] io hang with lvm on md raid1

To: xen-users@xxxxxxxxxxxxx
From: Glenn Enright <glenn@xxxxxxxxxxxxxxx>
Date: Mon, 10 Oct 2016 10:23:37 +1300
Delivery-date: Sun, 09 Oct 2016 21:25:17 +0000
List-id: Xen user discussion <xen-users.lists.xen.org>

Bump? I've now replicated this on raid10 and raid6 as well, so its notcaused by the raid level. An example of a blkback process that is stuckis below, if that offers any additional insight. In all cases I'm seeingdmeventd stuck first though


Thanks
--Glenn

<6>1 2016-10-09T02:33:28.676226+00:00 host764 kernel - - Showing busyworkqueues and worker pools:<6>1 2016-10-09T02:33:28.676227+00:00 host764 kernel - - workqueueevents_power_efficient: flags=0x80<6>1 2016-10-09T02:33:28.676228+00:00 host764 kernel - - pwq 2:cpus=1 node=0 flags=0x0 nice=0 active=1/256<6>1 2016-10-09T02:33:28.676229+00:00 host764 kernel - - pending:fb_flashcursor<6>1 2016-10-09T02:33:28.676230+00:00 host764 kernel - - workqueuekcopyd: flags=0x8<6>1 2016-10-09T02:33:28.676231+00:00 host764 kernel - - pwq 2:cpus=1 node=0 flags=0x0 nice=0 active=1/256<6>1 2016-10-09T02:33:28.676232+00:00 host764 kernel - - in-flight:31551:do_work [dm_mod]

...

<6>1 2016-10-09T02:41:52.678298+00:00 host764 kernel - -blkback.16.xvda D ffff8801302b3738 0 16029 2 0x00000000<4>1 2016-10-09T02:41:52.678300+00:00 host764 kernel - -ffff8801302b3738 ffffffff818b9500 ffff88013014a400 ffff8800039de870<4>1 2016-10-09T02:41:52.678302+00:00 host764 kernel - -ffff880049649988 ffffe8ffffa69200 ffff880140455c00 0000000000000000<4>1 2016-10-09T02:41:52.678304+00:00 host764 kernel - -0000000000000001 ffff8801302b3658 ffffffff810a2e05 ffff8801302b3668

<4>1 2016-10-09T02:41:52.678305+00:00 host764 kernel  - - Call Trace:

<4>1 2016-10-09T02:41:52.678308+00:00 host764 kernel - -[<ffffffff810a2e05>] ? wake_up_process+0x15/0x20<4>1 2016-10-09T02:41:52.678310+00:00 host764 kernel - -[<ffffffff81090264>] ? wake_up_worker+0x24/0x30<4>1 2016-10-09T02:41:52.678311+00:00 host764 kernel - -[<ffffffff810929d4>] ? insert_work+0x74/0xc0<4>1 2016-10-09T02:41:52.678313+00:00 host764 kernel - -[<ffffffff811b58d2>] ? kmem_cache_alloc+0x72/0x160<4>1 2016-10-09T02:41:52.678315+00:00 host764 kernel - -[<ffffffff810abec0>] ? update_curr+0x110/0x1a0<4>1 2016-10-09T02:41:52.678317+00:00 host764 kernel - -[<ffffffff81681e20>] schedule+0x40/0x90<4>1 2016-10-09T02:41:52.678319+00:00 host764 kernel - -[<ffffffff816845ad>] rwsem_down_write_failed+0x1fd/0x360<4>1 2016-10-09T02:41:52.678321+00:00 host764 kernel - -[<ffffffff816850f6>] ? _raw_spin_unlock_irqrestore+0x16/0x20<4>1 2016-10-09T02:41:52.678322+00:00 host764 kernel - -[<ffffffff810086ee>] ? xen_pte_val+0xe/0x10<4>1 2016-10-09T02:41:52.678324+00:00 host764 kernel - -[<ffffffff8133ff53>] call_rwsem_down_write_failed+0x13/0x20<4>1 2016-10-09T02:41:52.678325+00:00 host764 kernel - -[<ffffffff81683d84>] ? down_write+0x24/0x40<4>1 2016-10-09T02:41:52.678327+00:00 host764 kernel - -[<ffffffffa063a89d>] __origin_write+0x6d/0x2d0 [dm_snapshot]<4>1 2016-10-09T02:41:52.678330+00:00 host764 kernel - -[<ffffffff8115cefc>] ? mempool_alloc+0x5c/0x160<4>1 2016-10-09T02:41:52.678332+00:00 host764 kernel - -[<ffffffffa063ab66>] do_origin+0x66/0x90 [dm_snapshot]<4>1 2016-10-09T02:41:52.678333+00:00 host764 kernel - -[<ffffffffa063afbf>] origin_map+0x6f/0x90 [dm_snapshot]<4>1 2016-10-09T02:41:52.678335+00:00 host764 kernel - -[<ffffffffa000427a>] __map_bio+0x4a/0x130 [dm_mod]<4>1 2016-10-09T02:41:52.678337+00:00 host764 kernel - -[<ffffffffa0004867>] __split_and_process_bio+0x327/0x3f0 [dm_mod]<4>1 2016-10-09T02:41:52.678339+00:00 host764 kernel - -[<ffffffffa00049a4>] dm_make_request+0x74/0xe0 [dm_mod]<4>1 2016-10-09T02:41:52.678340+00:00 host764 kernel - -[<ffffffff8130922f>] generic_make_request+0xff/0x1d0<4>1 2016-10-09T02:41:52.678342+00:00 host764 kernel - -[<ffffffff81006cbd>] ? xen_mc_flush+0xad/0x1b0<4>1 2016-10-09T02:41:52.678344+00:00 host764 kernel - -[<ffffffff81309370>] submit_bio+0x70/0x140<4>1 2016-10-09T02:41:52.678346+00:00 host764 kernel - -[<ffffffff814965a5>] dispatch_rw_block_io+0x615/0xb10<4>1 2016-10-09T02:41:52.678348+00:00 host764 kernel - -[<ffffffff81681586>] ? __schedule+0x306/0xa30<4>1 2016-10-09T02:41:52.678349+00:00 host764 kernel - -[<ffffffff816879c9>] ? error_exit+0x9/0x20<4>1 2016-10-09T02:41:52.678355+00:00 host764 kernel - -[<ffffffff816853cf>] ? _raw_spin_lock_irqsave+0x1f/0x50<4>1 2016-10-09T02:41:52.678357+00:00 host764 kernel - -[<ffffffff810d83ea>] ? lock_timer_base+0x5a/0x80<4>1 2016-10-09T02:41:52.678358+00:00 host764 kernel - -[<ffffffff81496cb1>] __do_block_io_op+0x211/0x650<4>1 2016-10-09T02:41:52.678361+00:00 host764 kernel - -[<ffffffff810d8410>] ? lock_timer_base+0x80/0x80<4>1 2016-10-09T02:41:52.678363+00:00 host764 kernel - -[<ffffffff816853cf>] ? _raw_spin_lock_irqsave+0x1f/0x50<4>1 2016-10-09T02:41:52.678365+00:00 host764 kernel - -[<ffffffff816850f6>] ? _raw_spin_unlock_irqrestore+0x16/0x20<4>1 2016-10-09T02:41:52.678366+00:00 host764 kernel - -[<ffffffff8149720d>] xen_blkif_schedule+0x11d/0xad0<4>1 2016-10-09T02:41:52.678368+00:00 host764 kernel - -[<ffffffff810b6680>] ? woken_wake_function+0x20/0x20<4>1 2016-10-09T02:41:52.678369+00:00 host764 kernel - -[<ffffffff814970f0>] ? __do_block_io_op+0x650/0x650<4>1 2016-10-09T02:41:52.678371+00:00 host764 kernel - -[<ffffffff814970f0>] ? __do_block_io_op+0x650/0x650<4>1 2016-10-09T02:41:52.678373+00:00 host764 kernel - -[<ffffffff810993ac>] kthread+0xcc/0xf0<4>1 2016-10-09T02:41:52.678374+00:00 host764 kernel - -[<ffffffff810a1b1e>] ? schedule_tail+0x1e/0xc0<4>1 2016-10-09T02:41:52.678376+00:00 host764 kernel - -[<ffffffff810992e0>] ? kthread_freezable_should_stop+0x70/0x70<4>1 2016-10-09T02:41:52.678378+00:00 host764 kernel - -[<ffffffff81685a4f>] ret_from_fork+0x3f/0x70<4>1 2016-10-09T02:41:52.678380+00:00 host764 kernel - -[<ffffffff810992e0>] ? kthread_freezable_should_stop+0x70/0x70



On 05/10/16 14:37, Glenn Enright wrote:

HI there

Seeing an issue across multiple hosts when copying content from a lvm
snapshot. The cp command appears to hang indefinitely and can not be
killed (state D). Other commands that require IO (eg lvs, dd, touch
$file) may get queued, eventually depending on how busy the host is but
normally within 48 hours all io on the host blocks and the host becomes
completely unresponsive.

Machines are all stock Centos6 using xen packages from xen.crc.id.au. I
also have a report posted on https://xen.crc.id.au/bugs/view.php?id=75

IO stack is always spindle drives -> md raid 1, lvm, lv. In this case
the cp target was a sparse image on a separate raid1 drive array (on
different drives), but that varies between incidents

We also have raid6 on some hosts, but to date have not seen this issue
occur on that which is suggestive of raid1 as the problem.

Although the problem commands are not directly xen related, posting this
here first before asking other subsystem lists due to xen hypercalls
showing up in the traces below.

Has anyone seen anything like this recently? Or have any insight as to
what might be causing this? Or perhaps suggest some ways I might debug
this to provide further useful details?

Output from host of "echo w > /proc/sysrq-trigger". We also have 't'
output if needed.

<snipped long output>
Thanks
--Glenn

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
https://lists.xen.org/xen-users


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
https://lists.xen.org/xen-users

Follow-Ups:
- Re: [Xen-users] io hang with lvm on md raid1
  - From: Sarah Newman

References:
- [Xen-users] io hang with lvm on md raid1
  - From: Glenn Enright

Prev by Date: [Xen-users] WinPV BSOD on Win2008R2
Next by Date: Re: [Xen-users] io hang with lvm on md raid1
Previous by thread: [Xen-users] io hang with lvm on md raid1
Next by thread: Re: [Xen-users] io hang with lvm on md raid1
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.