[Xen-users] CentOS domU hangs on "Restarting system" - didn't you have that one, too?


I'm still trying to pin down one of the last issues on some systems here.

I'm interested for input from people who *recognize* the following:

Sending all processes the TERM signal...     [  OK  ]
Sending all processes the KILL signal...     [  OK  ]
Saving random seed:                          [  OK  ]
Syncing hardware clock to system time        [FAILED]
Turning off swap:                            [  OK  ] 
Unmounting file systems:                     [  OK  ] 
Please stand by while rebooting the system. 
Restarting the system.                      
   \_______ this is a lie, no restart ever happens.

This error will occur sometimes, not always.
It reliably goes away upon a XenD restart.

OS: CentOS 5.4 / 32bit / Xen 3 (outdatedness grade indicator:

All guests (around 80) & hosts (10ish) run the same release, but I also
have done a test with one host running the latetest and greatest Xen
version from CentOS 5.7

Things that I tried to blame so far:
= Old Xen version (switching to less old one didn't help)
qemu VFB due to

= the event channel issue where the dom0 and domU are using different
vcpus while talking to each other
this could possibly be sorted with a nightmarish hack that maps all vcpus
onto one cpu on shutdown time by sshing into dom0. One would have to ensure
the mapping is OK again after a reboot.
err. you can imagine how much I "like" this idea.

= domU kernel: yet untested, I hardly have any chance of updating it,
rather would need to backport the fix (if there was one) to 5.4

I found some posts by people that didn't get the error any more after
moving to something newer than CentOS 5.2 but this doesn't seem to have
completely done away with it.

So far I failed to make this issue 100% reproducible. It will show up
minutes after freshly installing a Xen host, or it will not show up for a
week on another one. It may affect all VMs on a host, or it may affect only

You can work around it by using 
xm destroy plus 
killing of any stuck qemu vfb processes (which is one of the reasons for
pointing at the VFB)
service xend restart

xm create vm

but the xend restart introduces other issues, i.e. that any unaffected VM
that is rebooted during the restart will be gone with the winds, or the
fact that you'd have to have a magic way that detects a stuck VM and
triggers the restart. Also I don't feel quite sure that a few 100 xend
restarts would do no harm over time...

The low chance of reproducing the issue is one of the big problems with
it[*], so if you remember that issue and did any successful troubleshooting
for it (or fixed it...) let me know.

Thanks :)

[*]Let alone systems that won't even make a reboot and what it makes me
think about the QA
