[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-users] Re: [Linux-HA] very odd iowait problem
2010/6/19 Miles Fidelman <mfidelman@xxxxxxxxxxxxxxxx>: > Hi Folks, > > I'm experiencing a very odd, daily, high-load situation - that seems to > localize in my disk stack. ÂI direct this to the xen-users, linux-raid > and linux-ha lists as I expect there's a pretty high degree of > experience on these lists with complicated disk driver stacks. > > I recently virtualized a production system, and have been slowly > wringing out the bugs that have shown up. ÂThis seems to be the last > one, and it's a doozie. > > Basic setup: ÂTwo identical machines except for the DomUs they're running. > > Two machines, slightly older Pentium 4 processors, 4meg RAM each (max), > 2 CPUs each, 4 SATA Drives each. > Debian Lenny Install for Dom0 and DomUs (2.6.26-2-xen-686) > > Disk setup on each: > - 4 partitions on each drive > - 3 RAID-1s set up across the 4 drives (4 drives in each - yes it's > silly, but easy) - for Dom0 /boot / swap > - 1 RAID-6 set up across the 4 drives - set up as a LVM PV - underlies > all my DomUs > note: all the RAIDs are set up with internal metadata, chunk size of > 131072KB - per advice here - works like a charm > - pairs of LVs - / and swap per VM > - each LV is linked with it's counterpart on the other machine, using DRBD > - LVs are specified as drbd: devices in DomU .cfg files > - LVs are mounted with noatime option inside production DomU - makes a > big difference > > A few DomUs - currently started and stopped either via links in > /etc/xen/auto or manually - I've temporarily turned off heartbeat and > pacemaker until I get the underlying stuff stable. > > ------ > now to the problem: > > for several days in a row, at 2:05am, iowait on the production DomU went > from averaging 10% or to 100% (I've been running vmstat 1 in a window > and watching the iowait column) > > the past two days, this has happened at 2:26am instead of 2:05 > > rebooting the VM fixes the problem, though it has occured again within > 20 minutes of the reboot, and then another reboot fixes the problem > until 2am the next day > > killing a bunch of processes also fixed things, but at that point so > little was running that I just rebooted the DomU - unfortunately, one > night it looked like lwresd was eating up resources, the next night it > was something else. > > ------ > ok... so I'm thinking there'a cron job that's doing something that eats > up all my i/o - I traced a couple of other issues back to cron jobs - I > can't seem to find either a cron job that runs around this time, or > anything in my logs > > so, now I set up a bunch of things to watch what's going - copies of > atop running in Dom0 on both servers, and in the production DomU (note: > I caught a couple of more bugs by running top in a window, and seeing > what was frozen in the window, after the machine crashed) > > ok - so I'm up at 2am for the 4th day in a row (along with a couple of > proposals I'm writing during the day, and a couple of fires with my > kids' computers - I've discovered that Mozy is perhaps the world's worst > backup service - it's impossible to restore things) - anyway.... 2:26 > rolls around, the iowait goes to 100%, and I start looking using ps, and > iostat, and lsof and such to try to locate whatever process is locking > up my DomU, when I notice: > > --- on one server, atop is showing one drive (/dev/sdb) maxing out at > 98% busy - sort of suggestive of a drive failure, and something that > would certainly ripple through all the layers of RAID, LVM, DRBD to slow > down everything on top of it (which is everything) > > Now this is pretty weird - given the way my system is set up, I'd expect > a dying disk Âthat to show up as very high iowaits, but.... > - it's a relatively new drive > - I've been running smartd, and smartctl doesn't yield any results > suggesting a drive problem > - the problem goes away when I reboot the DomU > > One more symptom: ÂI migrated the DomU to my other server, and there's > still a correlation between seeing the 98% busy on /dev/sdb, and seeing > iowait of 100% on the DomU - even though we're now talking a disk on one > machine dragging down a VM on the other machine. Â(Presumeably it's > impacting DRBD replication.) > > So.... > - on the one hand, the 98% busy on /dev/sdb is rippling up through md, > lvm, drbd, dom0 - and causing 100% iowait in the production DomU - which > is to be expected in a raided, drbd'd environment - a low level delay > ripples all the way up > - on the other hand, it's only effecting the one DomU, and it's not > effecting the Dom0 on that machine > - there seems to be something going on at 2:25am, give or take a few, > that kicks everything into the high iowait state (but I can't find a job > running at that time - though I guess someone could be hitting me with > some spam that's kicking amavisd or clam into a high-resource mode) > > All of which leads to two questions: > - if it's a disk going bad, why does this manifest nightly, at roughly > the same time, and effect only one DomU? > - if it's something in the DomU, by what mechanism is that rippling all > they way down to a component of a raid array, hidden below several > layers of stuff that's supposed to isolate virtual volumes from hardware? > > The only thought that occurs to me is that perhaps there's a bad record > or block on that one drive, that only gets exercised when on particular > process runs. > - is that a possibility? > - if yes, why isn't drbd or md or something catching it and fixing it > (or adding the block to the bad block table)? > - any suggestions on diagnostic or drive rebuilding steps to take next? > (includings that I can do before staying up until 2am tomorrow!) > > If it weren't hitting me, I'd be intrigued by this one. ÂUnfortunately, > it IS hitting me, and I'm getting tireder and crankier by the minute, > hour, and day. ÂAnd it's now 4:26am. ÂSigh... > > Thanks very much for any ideas or suggestions. > > Off to bed.... > > Miles Fidelman > > > > > > > -- > In theory, there is no difference between theory and practice. > In<fnord> Âpractice, there is. Â .... Yogi Berra > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@xxxxxxxxxxxxxxxxxx > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- Ciro Iriarte http://cyruspy.wordpress.com -- _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |