[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Dom0 crashed when rebooting whilst DomU are running



On Sep 10, 2012, at 5:10 PM, Ian Campbell wrote:

> On Mon, 2012-09-10 at 16:00 +0100, Maik Brauer wrote:
>> On Sep 10, 2012, at 10:39 AM, Ian Campbell wrote:
>> 
>>> On Sat, 2012-09-08 at 15:50 +0100, Maik Brauer wrote:
>>>> On Sep 4, 2012, at 10:11 AM, Ian Campbell wrote:
>>>> 
>>>>> Could you not top post please, it makes it rather hard to follow the
>>>>> flow of the conversation.
>>>>> On Mon, 2012-09-03 at 18:10 +0100, Casey DeLorme wrote:
>>>>>> As stated, you can alias shutdown to do exactly what you need, it can
>>>>>> be as simple as a series of hard-coded operations to a complex custom
>>>>>> shell script that parses your domains and closes each with feedback.
>>>>> 
>>>>> Xen ships the "xendomains" initscript which can halt guest on shutdown
>>>>> as well as automatically start specific guests on boot. It can also be
>>>>> configured to suspend/resume them or (I think) migrate them away.
>>>>> 
>>>>> For diagnosing the crash itself more details will be required than were
>>>>> provided in the original post. Please see
>>>>> http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for some guidance.
>>>>> At a minimum we would need a capture (serial console or photo) of the
>>>>> crash backtrace.
>>>>> 
>>>>> Ian.
>>>>> 
>>>>> 
>>>> I found out that it hangs during re-boot of dom0 when having more
>>>> Network interfaces involved, like:
>>>>     vif = [ 'mac=06:46:AB:CC:11:01, ip=<myIPadress>', '', '',
>>>> 'mac=06:04:AB:BB:11:03, bridge=VLAN20, script=vif-bridge', '',
>>>> 'mac=06:04:AB:BB:11:05, bridge=VLAN40, script=vif-bridge' ]
>>> 
>>> 6 interfaces total, 3 of which have a random mac on each reboot and all
>>> get put on the default bridge?
>> 
>> No, not really. The bridge is different for each interface.
> 
> You have three lots of '' which will all go onto the same bridge AFAICT
> (whichever one is determined to be the default)

That is right. As long as I put nothing inside that it should be a different 
script to execute, it will use default for ''
> 
>>> If it is a hang then you might have some luck using hte magic sysrq keys
>>> to print lists of blocked tasks. I'm not sure in Squeeze but you might
>>> need to enable this as described in Documentation/sysrq.txt in the Linux
>>> source.
>>> 
>>> Blocked tasks are listed with SysRQ-'w'. If you have serial console then
>>> 't' will list all task, but that list can be quite long so it is useless
>>> without a serial console.
>> 
>> List is empty. SysRQ -w and SysRQ-t shows nothing at all.
> 
> You might need to increase the log verbosity with SysRQ-9 first?

I did and now I got more Information. But due to the amount of data which slips 
over the console screen I am not able
to record properly. Can you advice what to do here?
> 
>> There is nothing running anymore.
>> It shows periodically:  INFO: task xenwatch:12 blocked for more than 120 
>> seconds
> 
> What is the very last thing printed before this?

There is nothing before. Just that message pops up periodically.
> 
>> Seems that the xenwatch is blocking the reboot here, is that assumption 
>> correct? But strange enough that I can't
>> see any process anymore with the SysRQ -t or SysRQ -w
> 
> The xenwatch thread ought to count as a process for at least the
> purposes of SysRQ-t if not -w.

Could be, but due to the amount it slips over the screen, that I am not able to 
read it line by line.
Please advice a procedure to record.
> 
>> 
>>>> In the Logfile of /var/log/message you can find this as the last line: 
>>>>       Sep  8 15:44:28 rootsrv01 shutdown[2445]: shutting down for system 
>>>> reboot
>>>>    Sep  8 15:44:31 rootsrv01 kernel: [   73.716246] VLAN20: port 1(vif2.3) 
>>>> entering forwarding state
>>>>    Sep  8 15:44:31 rootsrv01 kernel: [   74.500111] VLAN40: port 1(vif2.5) 
>>>> entering forwarding state
>>>>    Sep  8 15:44:34 rootsrv01 kernel: [   77.317431] VLAN20: port 1(vif2.3) 
>>>> entering disabled state
>>>>    Sep  8 15:44:34 rootsrv01 kernel: [   77.317490] VLAN20: port 1(vif2.3) 
>>>> entering disabled state
>>>>    Sep  8 15:44:36 rootsrv01 kernel: [   79.368685] VLAN40: port 1(vif2.5) 
>>>> entering disabled state
>>>>    Sep  8 15:44:36 rootsrv01 kernel: [   79.369156] VLAN40: port 1(vif2.5) 
>>>> entering disabled state
>>>>    Sep  8 15:44:37 rootsrv01 kernel: Kernel logging (proc) stopped.
>>>>    Sep  8 15:44:37 rootsrv01 rsyslogd: [origin software="rsyslogd" 
>>>> swVersion="4.6.4" x-pid="890" x-info="http://www.rsyslog.com";] exiting on 
>>>> signal 15.
>>>> 
>>>> In the /var/log/daemong.log you can find this message:
>>>>        Sep  8 15:44:37 rootsrv01 acpid: exiting
>>>>        Sep  8 15:44:37 rootsrv01 rpc.statd[750]: Caught signal 15, 
>>>> un-registering and exiting
>>> 
>>> All the above (both message and daemon.log) look like normal parts of
>>> shutting down to me.
>>> 
>>>>        Sep  8 15:44:37 rootsrv01 udevd-work[2276]: 
>>>> '/etc/xen/scripts/vif-setup offline type_if=vif' unexpected exit with 
>>>> status 0x000f
>>> 
>>> This might be worth following up on.
>> 
>> When putting a "sleep 5" in stop section of the /etc/init.d/xendomains:
>> case "$1" in
>>    start)
>>        start
>>        rc_status
>>        if test -f $LOCKFILE; then rc_status -v; fi
>>        ;;
>> 
>>    stop)
>>        stop
>>        rc_status -v
>>        sleep 5
>>        ;;
>> 
>> then the system shuts down as expected and is rebooting properly.
>> In the daemon.log file I couldn't find the error: Sep  8 15:44:37 rootsrv01 
>> udevd-work[2276]: '/etc/xen/scripts/vif-setup offline type_if=vif' 
>> unexpected exit with status 0x000f
>> anymore. It seems that it disappeared after putting a delay inside. Could it 
>> be a race condition here during shutdown, with the udev-daemon??
> 
> It could be a race with the guests actually shuting down vs the rest of
> the initscripts running.
> 
> Really the initscript ought to wait, the default at least with the
> script shipped with xen is to do so, by using shutdown --wait. can you
> confirm whether or not this is happening for you?

At least I can see that the shutdown --wait is in the scripts. So it seems that 
the init script is waiting.
But independent from that, something must be still in use. Which block the 
reboot process.
> 
> Possibly someone is trying to talk to xenstore after xenstored has
> exited -- I expect that would cause the sorts of blocked for 120
> messages you are seeing.
> 
Could be, but we need to find out what is blocking the shutdown. I do not know 
what else I can do in order to measure and collect
data for investigation. Let me know what else I can do? You can easiliy 
reproduce this issue, when using more that 3 Network devices.
I installed that now on several machines at home and I have on all the same 
issue when using more than 2-3 network Interfaces.
> 
> 
> _______________________________________________
> Xen-users mailing list
> Xen-users@xxxxxxxxxxxxx
> http://lists.xen.org/xen-users



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.