[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1



Yes, the patch fixed the dom0 hang issue during rebooting with guest pci device 
conflict.
Thanks.


Regards
Songtao

> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx]
> Sent: Saturday, November 09, 2013 12:21 AM
> To: Ren, Yongjie; george.dunlap@xxxxxxxxxxxxx; xen@xxxxxxxxxxxxxxxxxxx
> Cc: Xu, YongweiX; Liu, SongtaoX; Tian, Yongxue; xen-devel@xxxxxxxxxxxxx
> Subject: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung."
> Was:Re: [Xen-devel] test report for Xen 4.3 RC1
> 
> On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > 5. Dom0 cannot be shutdown before PCI device detachment from guest
> > > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
> > >
> > > Ok, I can reproduce that too.
> >
> > This is what dom0 tells me:
> >
> > [  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
> > [  483.603675] "echo 0 >
> /proc/sys/kernel/hung_task_timG^G[  483.620747] init            D
> ffff880062b59c78  5904  4163      1 0x00000000
> > [  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189]
> > ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000 [
> > 483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180
> ffff880078bca500 [  483.689527] Call Trace:
> > [  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70 [  483.723604]
> > [<ffffffff813bb0dd>] read_reply+0xad/0x160 [  483.741162]
> > [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40 [  483.758572]
> > [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0 [  483.775741]
> > [<ffffffff813bb3c6>] xs_single+0x46/0x60 [  483.792791]
> > [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60
> > [  483.809929]  [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120
> > ^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90 [
> > 483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50 [  483.860412]
> > [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10 [  483.877312]
> > [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0 [  483.894036]
> > [<ffffffff8142e275>] device_shutdown+0x15/0x180 [  483.910605]
> > [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40 [  483.927100]
> > [<ffffffff810a88a1>] kernel_restart+0x11^G[  483.943262]
> > [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260 [  483.959480]
> > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[  483.975786]
> > [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10 [  483.991819]
> > [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360 [  484.007675]
> > [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[  484.023336]
> > [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50 [  484.039176]
> > [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0 [  484.055174]
> > [<ffffffff816aae95>] ? sysret_check+0x22/0x5d [  484.070747]
> > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
> > [  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10 [
> > 484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b [
> > 484.116585] 3 locks held by init/4163:
> > [  484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260
> > ^G^G^G^G^G^G[  484.147704]  #1:  (&__lockdep_no_validate__){......},
> > at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180 [  484.164359]
> > #2:  (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>]
> > xs_talkv+0x6b/0x1f0
> >
> 
> A bit of debugging shows that when we are in this state:
> 
> 
> MSent SIGKILL to[  100.454603] xen-pciback pci-1-0: shutdown
> 
> telnet> send brk
> [  110.134554] SysRq : HELP : loglevel(0-9) reboot(b) crash(c)
> terminate-all-tasks(e) memory-full-oom-kill(f) debug(g) kill-all-tasks(i)
> thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l)
> show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p)
> show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V)
> show-blocked-tasks(w) dump-ftrace-buffer(z)
> 
> ... snip..
> 
>  xenstored       x 0000000000000002  5504  3437      1 0x00000006
>   ffff88006b6efc88 0000000000000246 0000000000000d6d ffff88006b6ee000
>   ffff88006b6effd8 ffff88006b6ee000 ffff88006b6ee010 ffff88006b6ee000
>   ffff88006b6effd8 ffff88006b6ee000 ffff88006bc39500 ffff8800788b5480
>  Call Trace:
>   [<ffffffff8110fede>] ? cgroup_exit+0x10e/0x130
>   [<ffffffff816b1594>] schedule+0x24/0x70
>   [<ffffffff8109c43d>] do_exit+0x79d/0xbc0
>   [<ffffffff8109c981>] do_group_exit+0x51/0x140
>   [<ffffffff810ae6f4>] get_signal_to_deliver+0x264/0x760
>   [<ffffffff8104c49f>] do_signal+0x4f/0x610
>   [<ffffffff811c62ce>] ? __sb_end_write+0x2e/0x60
>   [<ffffffff811c3d39>] ? vfs_write+0x129/0x170
>   [<ffffffff8104cabd>] do_notify_resume+0x5d/0x80
>   [<ffffffff816bc372>] int_signal+0x12/0x17
> 
> 
> The 'x' means that the task has been killed.
> 
> (The other two threads 'xenbus' and 'xenwatch' are sleeping).
> 
> Since the xenstored can actually be in a domain nowadays and not
> just in the initial domain and xenstored can be restarted anytime - we
> can't depend on the task pid. Nor can we depend on the other
> domain telling us that it is dead.
> 
> The best we can do is to get out of the way of the shutdown
> process and not hang on forever.
> 
> This patch should solve it:
> From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00
> 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Date: Fri, 8 Nov 2013 10:48:58 -0500
> Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling
>  shutdown/restart.
> 
> The 'read_reply' works with 'process_msg' to read of a reply in XenBus.
> 'process_msg' is running from within the 'xenbus' thread. Whenever
> a message shows up in XenBus it is put on a xs_state.reply_list list
> and 'read_reply' picks it up.
> 
> The problem is if the backend domain or the xenstored process is killed.
> In which case 'xenbus' is still awaiting - and 'read_reply' if called -
> stuck forever waiting for the reply_list to have some contents.
> 
> This is normally not a problem - as the backend domain can come back
> or the xenstored process can be restarted. However if the domain
> is in process of being powered off/restarted/halted - there is no
> point of waiting on it coming back - as we are effectively being
> terminated and should not impede the progress.
> 
> This patch solves this problem by checking the 'system_state' value
> to see if we are in heading towards death. We also make the wait
> mechanism a bit more asynchronous.
> 
> Fixes-Bug: http://bugs.xenproject.org/xen/bug/8
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> ---
>  drivers/xen/xenbus/xenbus_xs.c |   24 +++++++++++++++++++++---
>  1 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
> index b6d5fff..177fb19 100644
> --- a/drivers/xen/xenbus/xenbus_xs.c
> +++ b/drivers/xen/xenbus/xenbus_xs.c
> @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type
> *type, unsigned int *len)
> 
>       while (list_empty(&xs_state.reply_list)) {
>               spin_unlock(&xs_state.reply_lock);
> -             /* XXX FIXME: Avoid synchronous wait for response here. */
> -             wait_event(xs_state.reply_waitq,
> -                        !list_empty(&xs_state.reply_list));
> +             wait_event_timeout(xs_state.reply_waitq,
> +                                !list_empty(&xs_state.reply_list),
> +                                msecs_to_jiffies(500));
> +
> +             /*
> +              * If we are in the process of being shut-down there is
> +              * no point of trying to contact XenBus - it is either
> +              * killed (xenstored application) or the other domain
> +              * has been killed or is unreachable.
> +              */
> +             switch (system_state) {
> +                     case SYSTEM_POWER_OFF:
> +                     case SYSTEM_RESTART:
> +                     case SYSTEM_HALT:
> +                             return ERR_PTR(-EIO);
> +                     default:
> +                             break;
> +             }
>               spin_lock(&xs_state.reply_lock);
>       }
> 
> @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct
> xsd_sockmsg *msg)
> 
>       mutex_unlock(&xs_state.request_mutex);
> 
> +     if (IS_ERR(ret))
> +             return ret;
> +
>       if ((msg->type == XS_TRANSACTION_END) ||
>           ((req_msg.type == XS_TRANSACTION_START) &&
>            (msg->type == XS_ERROR)))
> --
> 1.7.7.6


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.