Xen project Mailing List

Re: [Xen-users] how to start VMs in a particular order

From: Joost Roeleveld <joost@xxxxxxxxxxxx>

Date: Wed, 02 Jul 2014 08:45:16 +0200

Delivery-date: Wed, 02 Jul 2014 06:45:40 +0000

List-id: Xen user discussion <xen-users.lists.xen.org>

On Tuesday 01 July 2014 23:48:51 lee wrote: > Joost Roeleveld <joost@xxxxxxxxxxxx> writes: > > Check the howtos for smartctl, they explain how to interpret the data. > > I'd recommend: > > http://www.smartmontools.org/ > > Ok, if I get to see the numbers, I can look there. I never believed in > this smart thing ... You just wait for disks to die suddenly? > >> >> I know, they aren't suited for this purpose. Yet they have been > >> >> working > >> >> fine on the P800, and that three disks should decide to go bad in a > >> >> way > >> >> that blocks the controller (or whatever happens) every now and then > >> >> seems unlikely. > >> > > >> > No, it doesn't. > >> > >> Why not? > > > > Because I've seen it happen. > > You have seen three (or more) disks going bad all at the same time just > because they were connected to a different controller? Yes, it was a cheap controller though, but it did actually kill any disk I connected to it. I was working at a computer shop at the time and the owner wanted us to try different disks even though the first 2(!) died and those wouldn't work on any other system anymore. > > WD makes good disks, but those 2TB green drives you are using gave me > > the largest amount of failures I ever experienced. I don't even bother > > sending them back for warranty replacement anymore. > > They really aren't the greatest disk one can imagine. I'd say they are > ok for what they are and better than their reputation, considering the > price --- you could get them for EUR 65 new a few years ago, maybe even > less, before all disk prices increased. I'll replace them with > something suitable when they fail. For twice that, I got 3TB WD Red drives a few years ago, after the factories came back online. > > That happens when the buffer is full, from a very quick read on the > > subject (so please, someone with more knowledge, please correct me if I > > am mistaken), this can be caused when the underlying I/O system is not > > able to keep up. > > It's probably more complicated than that. Systems would go down all the > time if exceeding their I/O capacity would make them crash. It depends on how big the capacity is and how the underlying hardware handles it. > >> then with "arcconf seems to hang" and now with > >> "scsi bus hanging?". > > > > These might be different ways of showing the same error, just being passed > > on to a different subsystem. > > Possibly --- I'd think it also means that something must have changed > when a particular error that repeatedly showed up as X suddenly and > repeatedly shows up as Z instead. > > (The kernel version changed. I haven't figured out what in particular > changed in the kernel code around the place where that message is > generated, and it may very well be changes somewhere else which are > relevant. What did change is that the swiotbl message is now printed by > different means to prevent a flood of messages that makes the system > unusable. I don't know what these means exactly do; it's possible that > the message isn't printed to console anymore so that now the "scsi bus > hangs" message has become visible after it had been not.) I saw some changes listing that the kernel should report it rather then panic. Possibly the next step was to handle the errors at a different point. > >> It still crashed with PHY on 1, and I'm on 2 now. It hasn't crashed in > >> over a day yet [knocks on wood]. If it works now, I'll leave it at 2; > >> if it crashes again, I'll increase to 3 ... > > > > Interesting, while googling for the PHY setting, I come across the > > following URL: > > http://serverfault.com/questions/95190/ibm-serverraid-8k-and-sata1-issue > > > > The following comes from there: > > *** > > The reason your Sata drives are running at 1.5Gb/s vs 3.0Gb/s on your > > server is because their was a bug in the backplane that caused 30 second > > freezes under heavy workloads. > > [...] > > > > You might want to look into that, as it's the same server and raid-card as > > you are using. > > Do note, the website for that IBM-link does not work at the moment. > > Yes, I had found the same page. I'm not sure if that statement is true > because the P800 also links SATA with 1.5 and SAS with 3Gbit/sec, > without a backplane in the way. It is probably true that IBM --- and/or > Adaptec I believe you are using an IBM raid controller. Not an Adaptec part. At least, I can't see Adaptect in any of the documentation I saw online. > --- ran into problems with SATA drives connected to the > controller they couldn't really solve, for otherwise there wouldn't be a > need to implement different PHY settings and even a utility in the > controllers' BIOS to let users change them. The backplane used in these systems, from my understanding, have a port multiplier built-in. I think it is that part causing the problem. > The documentation speaks of "different SATA channels" and claims that > improvements have been made to the PHY settings, apparently hiding > what's actually going on. SAS and SATA controllers often talk about sata channels. My raid controller even still calls them IDE-channels. It's just a name. > Anyway, server uptime is 3 days, 9 hours now. That's a great > improvement :) > > So for what's it worth: For WD20EARS on a ServeRaid 8k, try different > PHY settings. PHY 2 seems to work much better than 0, 1 and 5. That is usefull news, especially if that keeps the system running. Maybe post that online somewhere, including on that page? > > True, but, SATA drives don't always work when used with port multipliers, > > which from the above, I think you are actually using. > > Hm, I doubt it. The drive slots are numbered 0--5, and I can set a PHY > setting for each drive individually. Would I be able to do that if a > PMP was used? Yes, the question is, does the PMP used handle that correctly? > And can a single port keep up with 6 SAS drives? How many drives do you know of that can provide a sustained datastream of 3Gb/s? Or, in the case of 6 drives, 500Mb/s? Assuming you have a drive that can sustain 200Mb/s, that still means a single port can theoretically handle 3000 / 200 = 15 disks. With SSDs the picture is slightly different. With a sustained read speed of 550Mb/s, you would get nearly 5.5 disks. So, yes, a single port can easily keep up with 6 SAS drives. > I'd have to take it all apart to see how this backplane is made. --- > Think I'm silly, but I really marvelled at the drive caddies. They are > anything but simple and *not* easy to manufacture. IOW, expensive replacements if they break. > >> But then, it seems that an SATA link goes down or can go down when a > >> disk saves power. So you might be right: disk goes to sleep, controller > >> cannot re-establish link because of PHY settings, and then things hang. > > > > Yep, it all depends on what is happening, without proper errorlogs and > > reproducable crashes, it will be difficult to determine exactly what is > > happening. > Yes --- I have two PHY settings left I can try if I have to. If that > doesn't help, I can look into disabling power saving. I hope setting 2, as you mentioned above, keeps it stable. -- Joost _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.