Xen project Mailing List

Re: [Xen-devel] linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, "george.dunlap@xxxxxxxxxxxxx" <george.dunlap@xxxxxxxxxxxxx>, "xen@xxxxxxxxxxxxxxxxxxx" <xen@xxxxxxxxxxxxxxxxxxx>, "Zhou, Chao" <chao.zhou@xxxxxxxxx>, "Xu, Jiajun" <jiajun.xu@xxxxxxxxx>, "Zhang, Yang Z" <yang.z.zhang@xxxxxxxxx>

From: "Liu, SongtaoX" <songtaox.liu@xxxxxxxxx>

Date: Mon, 11 Nov 2013 02:40:39 +0000

Accept-language: en-US

Cc: "Xu, YongweiX" <yongweix.xu@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Mon, 11 Nov 2013 02:41:05 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHO3J7CdtnONmuAk0W14zg/J8YbAZofUkpg

Thread-topic: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: [Xen-devel] test report for Xen 4.3 RC1

Yes, the patch fixed the dom0 hang issue during rebooting with guest pci device conflict. Thanks. Regards Songtao > -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx] > Sent: Saturday, November 09, 2013 12:21 AM > To: Ren, Yongjie; george.dunlap@xxxxxxxxxxxxx; xen@xxxxxxxxxxxxxxxxxxx > Cc: Xu, YongweiX; Liu, SongtaoX; Tian, Yongxue; xen-devel@xxxxxxxxxxxxx > Subject: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." > Was:Re: [Xen-devel] test report for Xen 4.3 RC1 > > On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote: > > > > 5. Dom0 cannot be shutdown before PCI device detachment from guest > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826 > > > > > > Ok, I can reproduce that too. > > > > This is what dom0 tells me: > > > > [ 483.586675] INFO: task init:4163 blocked for more than 120 seconds. > > [ 483.603675] "echo 0 > > /proc/sys/kernel/hung_task_timG^G[ 483.620747] init D > ffff880062b59c78 5904 4163 1 0x00000000 > > [ 483.637699] ffff880062b59bc8 0000000000000^G[ 483.655189] > > ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000 [ > > 483.672505] ffff880062b59fd8 ffff880062b58000 ffff880062f20180 > ffff880078bca500 [ 483.689527] Call Trace: > > [ 483.706298] [<ffffffff816a0814>] schedule+0x24/0x70 [ 483.723604] > > [<ffffffff813bb0dd>] read_reply+0xad/0x160 [ 483.741162] > > [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40 [ 483.758572] > > [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0 [ 483.775741] > > [<ffffffff813bb3c6>] xs_single+0x46/0x60 [ 483.792791] > > [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60 > > [ 483.809929] [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120 > > ^G[ 483.826947] [<ffffffff8142df39>] ? __dev_printk+0x39/0x90 [ > > 483.843792] [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50 [ 483.860412] > > [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10 [ 483.877312] > > [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0 [ 483.894036] > > [<ffffffff8142e275>] device_shutdown+0x15/0x180 [ 483.910605] > > [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40 [ 483.927100] > > [<ffffffff810a88a1>] kernel_restart+0x11^G[ 483.943262] > > [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260 [ 483.959480] > > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[ 483.975786] > > [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10 [ 483.991819] > > [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360 [ 484.007675] > > [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[ 484.023336] > > [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50 [ 484.039176] > > [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0 [ 484.055174] > > [<ffffffff816aae95>] ? sysret_check+0x22/0x5d [ 484.070747] > > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 > > [ 484.086121] [<ffffffff810a8b69>] SyS_reboot+0x9/0x10 [ > > 484.101318] [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b [ > > 484.116585] 3 locks held by init/4163: > > [ 484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260 > > ^G^G^G^G^G^G[ 484.147704] #1: (&__lockdep_no_validate__){......}, > > at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180 [ 484.164359] > > #2: (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>] > > xs_talkv+0x6b/0x1f0 > > > > A bit of debugging shows that when we are in this state: > > > MSent SIGKILL to[ 100.454603] xen-pciback pci-1-0: shutdown > > telnet> send brk > [ 110.134554] SysRq : HELP : loglevel(0-9) reboot(b) crash(c) > terminate-all-tasks(e) memory-full-oom-kill(f) debug(g) kill-all-tasks(i) > thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) > show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) > show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) > show-blocked-tasks(w) dump-ftrace-buffer(z) > > ... snip.. > > xenstored x 0000000000000002 5504 3437 1 0x00000006 > ffff88006b6efc88 0000000000000246 0000000000000d6d ffff88006b6ee000 > ffff88006b6effd8 ffff88006b6ee000 ffff88006b6ee010 ffff88006b6ee000 > ffff88006b6effd8 ffff88006b6ee000 ffff88006bc39500 ffff8800788b5480 > Call Trace: > [<ffffffff8110fede>] ? cgroup_exit+0x10e/0x130 > [<ffffffff816b1594>] schedule+0x24/0x70 > [<ffffffff8109c43d>] do_exit+0x79d/0xbc0 > [<ffffffff8109c981>] do_group_exit+0x51/0x140 > [<ffffffff810ae6f4>] get_signal_to_deliver+0x264/0x760 > [<ffffffff8104c49f>] do_signal+0x4f/0x610 > [<ffffffff811c62ce>] ? __sb_end_write+0x2e/0x60 > [<ffffffff811c3d39>] ? vfs_write+0x129/0x170 > [<ffffffff8104cabd>] do_notify_resume+0x5d/0x80 > [<ffffffff816bc372>] int_signal+0x12/0x17 > > > The 'x' means that the task has been killed. > > (The other two threads 'xenbus' and 'xenwatch' are sleeping). > > Since the xenstored can actually be in a domain nowadays and not > just in the initial domain and xenstored can be restarted anytime - we > can't depend on the task pid. Nor can we depend on the other > domain telling us that it is dead. > > The best we can do is to get out of the way of the shutdown > process and not hang on forever. > > This patch should solve it: > From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00 > 2001 > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > Date: Fri, 8 Nov 2013 10:48:58 -0500 > Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling > shutdown/restart. > > The 'read_reply' works with 'process_msg' to read of a reply in XenBus. > 'process_msg' is running from within the 'xenbus' thread. Whenever > a message shows up in XenBus it is put on a xs_state.reply_list list > and 'read_reply' picks it up. > > The problem is if the backend domain or the xenstored process is killed. > In which case 'xenbus' is still awaiting - and 'read_reply' if called - > stuck forever waiting for the reply_list to have some contents. > > This is normally not a problem - as the backend domain can come back > or the xenstored process can be restarted. However if the domain > is in process of being powered off/restarted/halted - there is no > point of waiting on it coming back - as we are effectively being > terminated and should not impede the progress. > > This patch solves this problem by checking the 'system_state' value > to see if we are in heading towards death. We also make the wait > mechanism a bit more asynchronous. > > Fixes-Bug: http://bugs.xenproject.org/xen/bug/8 > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > --- > drivers/xen/xenbus/xenbus_xs.c | 24 +++++++++++++++++++++--- > 1 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c > index b6d5fff..177fb19 100644 > --- a/drivers/xen/xenbus/xenbus_xs.c > +++ b/drivers/xen/xenbus/xenbus_xs.c > @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type > *type, unsigned int *len) > > while (list_empty(&xs_state.reply_list)) { > spin_unlock(&xs_state.reply_lock); > - /* XXX FIXME: Avoid synchronous wait for response here. */ > - wait_event(xs_state.reply_waitq, > - !list_empty(&xs_state.reply_list)); > + wait_event_timeout(xs_state.reply_waitq, > + !list_empty(&xs_state.reply_list), > + msecs_to_jiffies(500)); > + > + /* > + * If we are in the process of being shut-down there is > + * no point of trying to contact XenBus - it is either > + * killed (xenstored application) or the other domain > + * has been killed or is unreachable. > + */ > + switch (system_state) { > + case SYSTEM_POWER_OFF: > + case SYSTEM_RESTART: > + case SYSTEM_HALT: > + return ERR_PTR(-EIO); > + default: > + break; > + } > spin_lock(&xs_state.reply_lock); > } > > @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct > xsd_sockmsg *msg) > > mutex_unlock(&xs_state.request_mutex); > > + if (IS_ERR(ret)) > + return ret; > + > if ((msg->type == XS_TRANSACTION_END) || > ((req_msg.type == XS_TRANSACTION_START) && > (msg->type == XS_ERROR))) > -- > 1.7.7.6 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.