[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Dom0 crashed when rebooting whilst DomU are running



On Mon, 2012-09-10 at 16:00 +0100, Maik Brauer wrote:
> On Sep 10, 2012, at 10:39 AM, Ian Campbell wrote:
> 
> > On Sat, 2012-09-08 at 15:50 +0100, Maik Brauer wrote:
> >> On Sep 4, 2012, at 10:11 AM, Ian Campbell wrote:
> >> 
> >>> Could you not top post please, it makes it rather hard to follow the
> >>> flow of the conversation.
> >>> On Mon, 2012-09-03 at 18:10 +0100, Casey DeLorme wrote:
> >>>> As stated, you can alias shutdown to do exactly what you need, it can
> >>>> be as simple as a series of hard-coded operations to a complex custom
> >>>> shell script that parses your domains and closes each with feedback.
> >>> 
> >>> Xen ships the "xendomains" initscript which can halt guest on shutdown
> >>> as well as automatically start specific guests on boot. It can also be
> >>> configured to suspend/resume them or (I think) migrate them away.
> >>> 
> >>> For diagnosing the crash itself more details will be required than were
> >>> provided in the original post. Please see
> >>> http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for some guidance.
> >>> At a minimum we would need a capture (serial console or photo) of the
> >>> crash backtrace.
> >>> 
> >>> Ian.
> >>> 
> >>> 
> >>  I found out that it hangs during re-boot of dom0 when having more
> >> Network interfaces involved, like:
> >>      vif = [ 'mac=06:46:AB:CC:11:01, ip=<myIPadress>', '', '',
> >> 'mac=06:04:AB:BB:11:03, bridge=VLAN20, script=vif-bridge', '',
> >> 'mac=06:04:AB:BB:11:05, bridge=VLAN40, script=vif-bridge' ]
> > 
> > 6 interfaces total, 3 of which have a random mac on each reboot and all
> > get put on the default bridge?
> 
> No, not really. The bridge is different for each interface.

You have three lots of '' which will all go onto the same bridge AFAICT
(whichever one is determined to be the default)

> > If it is a hang then you might have some luck using hte magic sysrq keys
> > to print lists of blocked tasks. I'm not sure in Squeeze but you might
> > need to enable this as described in Documentation/sysrq.txt in the Linux
> > source.
> > 
> > Blocked tasks are listed with SysRQ-'w'. If you have serial console then
> > 't' will list all task, but that list can be quite long so it is useless
> > without a serial console.
> 
> List is empty. SysRQ -w and SysRQ-t shows nothing at all.

You might need to increase the log verbosity with SysRQ-9 first?

>  There is nothing running anymore.
> It shows periodically:  INFO: task xenwatch:12 blocked for more than 120 
> seconds

What is the very last thing printed before this?

> Seems that the xenwatch is blocking the reboot here, is that assumption 
> correct? But strange enough that I can't
> see any process anymore with the SysRQ -t or SysRQ -w

The xenwatch thread ought to count as a process for at least the
purposes of SysRQ-t if not -w.

> 
> >>  In the Logfile of /var/log/message you can find this as the last line: 
> >>        Sep  8 15:44:28 rootsrv01 shutdown[2445]: shutting down for system 
> >> reboot
> >>    Sep  8 15:44:31 rootsrv01 kernel: [   73.716246] VLAN20: port 1(vif2.3) 
> >> entering forwarding state
> >>    Sep  8 15:44:31 rootsrv01 kernel: [   74.500111] VLAN40: port 1(vif2.5) 
> >> entering forwarding state
> >>    Sep  8 15:44:34 rootsrv01 kernel: [   77.317431] VLAN20: port 1(vif2.3) 
> >> entering disabled state
> >>    Sep  8 15:44:34 rootsrv01 kernel: [   77.317490] VLAN20: port 1(vif2.3) 
> >> entering disabled state
> >>    Sep  8 15:44:36 rootsrv01 kernel: [   79.368685] VLAN40: port 1(vif2.5) 
> >> entering disabled state
> >>    Sep  8 15:44:36 rootsrv01 kernel: [   79.369156] VLAN40: port 1(vif2.5) 
> >> entering disabled state
> >>    Sep  8 15:44:37 rootsrv01 kernel: Kernel logging (proc) stopped.
> >>    Sep  8 15:44:37 rootsrv01 rsyslogd: [origin software="rsyslogd" 
> >> swVersion="4.6.4" x-pid="890" x-info="http://www.rsyslog.com";] exiting on 
> >> signal 15.
> >> 
> >> In the /var/log/daemong.log you can find this message:
> >>         Sep  8 15:44:37 rootsrv01 acpid: exiting
> >>         Sep  8 15:44:37 rootsrv01 rpc.statd[750]: Caught signal 15, 
> >> un-registering and exiting
> > 
> > All the above (both message and daemon.log) look like normal parts of
> > shutting down to me.
> > 
> >>         Sep  8 15:44:37 rootsrv01 udevd-work[2276]: 
> >> '/etc/xen/scripts/vif-setup offline type_if=vif' unexpected exit with 
> >> status 0x000f
> > 
> > This might be worth following up on.
> 
> When putting a "sleep 5" in stop section of the /etc/init.d/xendomains:
> case "$1" in
>     start)
>         start
>         rc_status
>         if test -f $LOCKFILE; then rc_status -v; fi
>         ;;
> 
>     stop)
>         stop
>         rc_status -v
>         sleep 5
>         ;;
> 
> then the system shuts down as expected and is rebooting properly.
> In the daemon.log file I couldn't find the error: Sep  8 15:44:37 rootsrv01 
> udevd-work[2276]: '/etc/xen/scripts/vif-setup offline type_if=vif' unexpected 
> exit with status 0x000f
> anymore. It seems that it disappeared after putting a delay inside. Could it 
> be a race condition here during shutdown, with the udev-daemon??

It could be a race with the guests actually shuting down vs the rest of
the initscripts running.

Really the initscript ought to wait, the default at least with the
script shipped with xen is to do so, by using shutdown --wait. can you
confirm whether or not this is happening for you?

Possibly someone is trying to talk to xenstore after xenstored has
exited -- I expect that would cause the sorts of blocked for 120
messages you are seeing.



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.