[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] disk io errors possibly caused by high network load?
A strange thing is that not a single non-xen machine went down. I will set up two xen machines on a switch with a loop and see what I'll get ;) -----Original Message----- From: James Harper [mailto:james.harper@xxxxxxxxxxxxxxxx] Sent: Friday, September 19, 2008 3:22 PM To: Moritz Möller; Ian Pratt; xen-devel@xxxxxxxxxxxxxxxxxxx Subject: RE: [Xen-devel] disk io errors possibly caused by high network load? > > We rebooted the machines really quickly because it was a productive > system, so I didn't have the time to copy the logs, and on the disks I > see nothing about this in the logfiles, propably because the IO was > already down. > > The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32 > GB RAM, and some have a mdraid setup with two SATA drives with the on > board sata controller (intel ICH), other have a dedicated 3ware / AMCC > 9660 or similar. > > The machines that crashed were on different power lines and connected to > different switches, although on the same network segment. Also there > were no physical interferences. > > The error was reported by domU and dom0 - both saying the disk would > give a I/O error, but no specific information. > > Network card is intel e1000. The error wasn't a timeout was it? We had a similar problem under Windows (no Xen involved at all) where the switch the server was plugged into was looped back to itself one evening. Any broadcast packet sent to the switch would just circulate around the switch indefinitely, until there were enough broadcast packets looping around that everything ground to a halt. The server was a HP DL380, so a more than capable machine, but there were enough interrupts occurring due to a completely saturated network that everything was reporting timeouts. In this case the server didn't require a reboot. It sat in that state the whole night, reporting disk timeouts etc but the moment we rectified the cabling fault in the morning it instantly bounced back to life. It could be that Linux treats timeout errors a little more severely? Can anyone say if the layer above blkfront in the Linux kernel will report timeouts? Or would the errors have been coming through from Dom0? Anyway, do you have a test environment you can reproduce the problem on? If the problem is as simple as a looped switch then it shouldn't be too hard to reproduce... James _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |