[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Debugging (possible) Xen-related hang-issues without the possibility of attaching serial console to capture Xen output



Am 28.03.2012 15:03, schrieb Tom Snowhorn:
Got it.  I've seen hundreds, if not thousands of Kingston sticks and
never encountered a bad one, let alone 3 times in a row from a
reputable manufacturer.  I'm guessing your problem is not confined to
the box's hardware itself.

No, it's not confined to the specific hardware that I originally noted - as detailed in the last mail, I've had the same hangs on three similar (i.e. same MoBo-type, same RAM [by amount, but ot make], same Adapted-RAID-PCIE-Controller, same PSU-type) but (besides the HDs, which were always carried over) different boxes. I'll not get another (different) one soon to test on, the datacenter is pretty adamant about that. ;-)

When switching boxes or as emergency measures, I've tried several combinations of Xen/Dom0 kernels (Kernel: xenified 2.6.x, vanilla 3.0, vanilla 3.1.x, vanilla 3.2.6/.9, Xen: 4.0.1 up to 4.1.2, all revisions, no -RCs), and all of the fail similarly, see below for some more detail.

To detail the hardware: board is an MSI X58 Pro-E (MS-7522), with current BIOS (I guess that's V8.15, but don't have a chance to look now), the CPUs I tested were all i7 920+, with no disks attached to the mainboard but rather to a PCIE-SATA-RAID Adaptec 5405, configured as RAID-10. Network card is an Intel-based dual-port 1GBit card, also connected via PCIE. RAM is/was either Kingston or Samsung, 4GB modules, 24GB socketed on the mainboard in total in 6 banks.

I'm having trouble "reading between the lines" of your description.
What occurs with regard to the machine when this problem happens?  A
reboot?  You mentioned the word "hang", but you also mentioned that
the datacenter might be losing power or similar, so I'm guessing the
machine reboots some of the time?  If that's the case, are you using
the "noreboot" kernel flag?

Sorry that I wasn't really clear in the description (I've talked too much with colleagues of mine about this, and as such take "info" for granted, probably): the system does hang hard (i.e., it doesn't reboot). I've turned off console blanking for the Dom0-kernel, and the Dom0-login-prompt remains visible on the IP-KVM connected to the host, but the system is otherwise "frozen", i.e. pressing Caps/Numlock doesn't change state of the corresponding keyboard LED (which the datacenter have "physically" confirmed for me), and there is no backtrace on the console, so I'm pretty sure that it's _not_ the Dom0 kernel which is panicing (at least it's not showing signs of panic).

Whether Xen panics: I can't say; that's why I tried to get the datacenter to attach a serial console to the system in question so that I might be able to grab output of Xen to actually diagnose a possible hardware incompatability, but as the original mail stated: I can't get the data-center provider to attach a serial cable... Leading to my "cry for help." :-)

Why I hinted at the data-center possibly being responsible for the hangs: I wouldn't actually be surprised (because I've seen that in our own data-center, albeit that was very exceptional) if some voltage fluctutation due to too many systems being connected to the same circuit would cause servers which were under high stress at such time to freeze - we had similar symptoms (of non-reproducible "hangs") at our own data-center when too many servers were connected to a 16A circuit, and peak usage for the connected servers seemed to be just above that, changing some to a different circuit without restacking in the rack(s) cleared the hangs for all of them. Same thing goes for temperature: I've not seen exceptionally high temperatures on the system sensors, but operating at 70ÂC for the southbridge is somewhat high in my book - a recipe for hangs.

The thing I'm currently trying to do is to exclude Xen from the loop (by making sure that it's not Xen that's hanging/going into "debug" mode), which would then leave a discussion with the hosting provider about rehousing the corresponding server(s). Which is why a memory dump/some form of memory access to Xen would be extremely valuable, after me resetting the system to get it back up and running.

Thanks for your help!

--
--- Heiko.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.