Xen project Mailing List

Re: [Xen-users] how to start VMs in a particular order

Date: Tue, 01 Jul 2014 23:48:51 +0200

Delivery-date: Tue, 01 Jul 2014 22:52:59 +0000

List-id: Xen user discussion <xen-users.lists.xen.org>

Mail-followup-to: xen-users@xxxxxxxxxxxxx

Joost Roeleveld <joost@xxxxxxxxxxxx> writes: >> On Monday 30 June 2014 13:43:16 lee wrote: >>> Joost Roeleveld <joost@xxxxxxxxxxxx> writes: >> >> IIRC, there is a way to somehow display the smart info, probably with >> arcconf. I'd rather not use that if it might cause problems. >> Connecting them to SATA ports would be going to lengths. In any case, >> I'd get some numbers that won't tell me anything, and that three disks >> would suddenly go bad only because they are connected to a different >> controller seems very unlikely. > > Check the howtos for smartctl, they explain how to interpret the data. > I'd recommend: > http://www.smartmontools.org/ Ok, if I get to see the numbers, I can look there. I never believed in this smart thing ... >> >> I know, they aren't suited for this purpose. Yet they have been working >> >> fine on the P800, and that three disks should decide to go bad in a way >> >> that blocks the controller (or whatever happens) every now and then >> >> seems unlikely. >> > >> > No, it doesn't. >> >> Why not? > > Because I've seen it happen. You have seen three (or more) disks going bad all at the same time just because they were connected to a different controller? > WD makes good disks, but those 2TB green drives you are using gave me > the largest amount of failures I ever experienced. I don't even bother > sending them back for warranty replacement anymore. They really aren't the greatest disk one can imagine. I'd say they are ok for what they are and better than their reputation, considering the price --- you could get them for EUR 65 new a few years ago, maybe even less, before all disk prices increased. I'll replace them with something suitable when they fail. >> > Does the error occur after the server has been idle for a while? Or when >> > the disks are being stressed? >> >> I haven't seen any relation between disk usage and crashes. There seem >> to have been different reasons for crashing, i. e. first it would crash >> with "swiotbl is full", > > That happens when the buffer is full, from a very quick read on the subject > (so please, > someone with more knowledge, please correct me if I am mistaken), this can be > caused when the underlying I/O system is not able to keep up. It's probably more complicated than that. Systems would go down all the time if exceeding their I/O capacity would make them crash. >> then with "arcconf seems to hang" and now with >> "scsi bus hanging?". > > These might be different ways of showing the same error, just being passed on > to a > different subsystem. Possibly --- I'd think it also means that something must have changed when a particular error that repeatedly showed up as X suddenly and repeatedly shows up as Z instead. (The kernel version changed. I haven't figured out what in particular changed in the kernel code around the place where that message is generated, and it may very well be changes somewhere else which are relevant. What did change is that the swiotbl message is now printed by different means to prevent a flood of messages that makes the system unusable. I don't know what these means exactly do; it's possible that the message isn't printed to console anymore so that now the "scsi bus hangs" message has become visible after it had been not.) >> It still crashed with PHY on 1, and I'm on 2 now. It hasn't crashed in >> over a day yet [knocks on wood]. If it works now, I'll leave it at 2; >> if it crashes again, I'll increase to 3 ... > > Interesting, while googling for the PHY setting, I come across the following > URL: > http://serverfault.com/questions/95190/ibm-serverraid-8k-and-sata1-issue > > The following comes from there: > *** > The reason your Sata drives are running at 1.5Gb/s vs 3.0Gb/s on your server > is > because their was a bug in the backplane that caused 30 second freezes under > heavy > workloads. > [...] > > You might want to look into that, as it's the same server and raid-card as > you are > using. > Do note, the website for that IBM-link does not work at the moment. Yes, I had found the same page. I'm not sure if that statement is true because the P800 also links SATA with 1.5 and SAS with 3Gbit/sec, without a backplane in the way. It is probably true that IBM --- and/or Adaptec --- ran into problems with SATA drives connected to the controller they couldn't really solve, for otherwise there wouldn't be a need to implement different PHY settings and even a utility in the controllers' BIOS to let users change them. The documentation speaks of "different SATA channels" and claims that improvements have been made to the PHY settings, apparently hiding what's actually going on. Anyway, server uptime is 3 days, 9 hours now. That's a great improvement :) So for what's it worth: For WD20EARS on a ServeRaid 8k, try different PHY settings. PHY 2 seems to work much better than 0, 1 and 5. > True, but, SATA drives don't always work when used with port multipliers, > which from > the above, I think you are actually using. Hm, I doubt it. The drive slots are numbered 0--5, and I can set a PHY setting for each drive individually. Would I be able to do that if a PMP was used? And can a single port keep up with 6 SAS drives? I'd have to take it all apart to see how this backplane is made. --- Think I'm silly, but I really marvelled at the drive caddies. They are anything but simple and *not* easy to manufacture. >> But then, it seems that an SATA link goes down or can go down when a >> disk saves power. So you might be right: disk goes to sleep, controller >> cannot re-establish link because of PHY settings, and then things hang. > > Yep, it all depends on what is happening, without proper errorlogs and > reproducable > crashes, it will be difficult to determine exactly what is happening. Yes --- I have two PHY settings left I can try if I have to. If that doesn't help, I can look into disabling power saving. -- Knowledge is volatile and fluid. Software is power. _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.