[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] how to start VMs in a particular order



Joost Roeleveld <joost@xxxxxxxxxxxx> writes:

>> On Monday 30 June 2014 13:43:16 lee wrote:
>>> Joost Roeleveld <joost@xxxxxxxxxxxx> writes:
>> 
>> IIRC, there is a way to somehow display the smart info, probably with
>> arcconf.  I'd rather not use that if it might cause problems.
>> Connecting them to SATA ports would be going to lengths.  In any case,
>> I'd get some numbers that won't tell me anything, and that three disks
>> would suddenly go bad only because they are connected to a different
>> controller seems very unlikely.
>
> Check the howtos for smartctl, they explain how to interpret the data.
> I'd recommend:
> http://www.smartmontools.org/

Ok, if I get to see the numbers, I can look there.  I never believed in
this smart thing ...

>> >> I know, they aren't suited for this purpose.  Yet they have been working
>> >> fine on the P800, and that three disks should decide to go bad in a way
>> >> that blocks the controller (or whatever happens) every now and then
>> >> seems unlikely.
>> > 
>> > No, it doesn't.
>> 
>> Why not?
>
> Because I've seen it happen.

You have seen three (or more) disks going bad all at the same time just
because they were connected to a different controller?

> WD makes good disks, but those 2TB green drives you are using gave me
> the largest amount of failures I ever experienced. I don't even bother
> sending them back for warranty replacement anymore.

They really aren't the greatest disk one can imagine.  I'd say they are
ok for what they are and better than their reputation, considering the
price --- you could get them for EUR 65 new a few years ago, maybe even
less, before all disk prices increased.  I'll replace them with
something suitable when they fail.

>> > Does the error occur after the server has been idle for a while? Or when
>> > the disks are being stressed?
>> 
>> I haven't seen any relation between disk usage and crashes.  There seem
>> to have been different reasons for crashing, i. e. first it would crash
>> with "swiotbl is full",
>
> That happens when the buffer is full, from a very quick read on the subject 
> (so please, 
> someone with more knowledge, please correct me if I am mistaken), this can be 
> caused when the underlying I/O system is not able to keep up.

It's probably more complicated than that.  Systems would go down all the
time if exceeding their I/O capacity would make them crash.

>> then with "arcconf seems to hang" and now with
>> "scsi bus hanging?".
>
> These might be different ways of showing the same error, just being passed on 
> to a 
> different subsystem.

Possibly --- I'd think it also means that something must have changed
when a particular error that repeatedly showed up as X suddenly and
repeatedly shows up as Z instead.

(The kernel version changed.  I haven't figured out what in particular
changed in the kernel code around the place where that message is
generated, and it may very well be changes somewhere else which are
relevant.  What did change is that the swiotbl message is now printed by
different means to prevent a flood of messages that makes the system
unusable.  I don't know what these means exactly do; it's possible that
the message isn't printed to console anymore so that now the "scsi bus
hangs" message has become visible after it had been not.)

>> It still crashed with PHY on 1, and I'm on 2 now.  It hasn't crashed in
>> over a day yet [knocks on wood].  If it works now, I'll leave it at 2;
>> if it crashes again, I'll increase to 3 ...
>
> Interesting, while googling for the PHY setting, I come across the following 
> URL:
> http://serverfault.com/questions/95190/ibm-serverraid-8k-and-sata1-issue
>
> The following comes from there:
> ***
> The reason your Sata drives are running at 1.5Gb/s vs 3.0Gb/s on your server 
> is 
> because their was a bug in the backplane that caused 30 second freezes under 
> heavy 
> workloads.
> [...]
>
> You might want to look into that, as it's the same server and raid-card as 
> you are 
> using.
> Do note, the website for that IBM-link does not work at the moment.

Yes, I had found the same page.  I'm not sure if that statement is true
because the P800 also links SATA with 1.5 and SAS with 3Gbit/sec,
without a backplane in the way.  It is probably true that IBM --- and/or
Adaptec --- ran into problems with SATA drives connected to the
controller they couldn't really solve, for otherwise there wouldn't be a
need to implement different PHY settings and even a utility in the
controllers' BIOS to let users change them.

The documentation speaks of "different SATA channels" and claims that
improvements have been made to the PHY settings, apparently hiding
what's actually going on.

Anyway, server uptime is 3 days, 9 hours now.  That's a great
improvement :)

So for what's it worth:  For WD20EARS on a ServeRaid 8k, try different
PHY settings.  PHY 2 seems to work much better than 0, 1 and 5.

> True, but, SATA drives don't always work when used with port multipliers, 
> which from 
> the above, I think you are actually using.

Hm, I doubt it.  The drive slots are numbered 0--5, and I can set a PHY
setting for each drive individually.  Would I be able to do that if a
PMP was used?  And can a single port keep up with 6 SAS drives?

I'd have to take it all apart to see how this backplane is made.  ---
Think I'm silly, but I really marvelled at the drive caddies.  They are
anything but simple and *not* easy to manufacture.

>> But then, it seems that an SATA link goes down or can go down when a
>> disk saves power.  So you might be right: disk goes to sleep, controller
>> cannot re-establish link because of PHY settings, and then things hang.
>
> Yep, it all depends on what is happening, without proper errorlogs and 
> reproducable 
> crashes, it will be difficult to determine exactly what is happening.

Yes --- I have two PHY settings left I can try if I have to.  If that
doesn't help, I can look into disabling power saving.


-- 
Knowledge is volatile and fluid.  Software is power.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.