[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance


  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Wed, 02 Jul 2014 08:25:45 +0100
  • Delivery-date: Wed, 02 Jul 2014 07:26:30 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 07/01/2014 08:23 PM, lee wrote:
Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 06/29/2014 08:12 AM, lee wrote:
Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 06/28/2014 08:45 AM, lee wrote:

The hardware RAID controller gives me 10fps more with my favourite game
I'm playing, compared to software raid.  Since fps rates can be rather
low (because I'm CPU limited), that means a significant difference.

If your game is grinding onto disk I/O during play all is lost
anyway. If your CPU and RAM are _that_ constrained, there is probably
a better way to spend whatever you might pay for a new caching RAID
controller these days.

Only I didn't buy the controller new, and I bought it to have a decent
amount of ports.

Fair re: number of ports, but 2nd hand CPUs don't go for that much on
ebay, either, if you are concerned about the CPU hit.

What would you put into an AM3 socket which is so much faster than a
Phenom 965 to be worthwhile and doesn't have an enormous power
consumption?

I'd have to switch to Intel --- and that would bring about UEFI which,
besides the tremendous hassle, comes with serious security risks.  And
after reading [1], what could you still buy?


[1]: https://www.fsf.org/blogs/community/active-management-technology

UEFI comes with extra security risks? Do tell. If anything it is more secure because it requires all firmware to be cryptographically signed, so you cannot have something compromise a firmwre without getting detected.

It's not disk I/O or a lack of RAM that limits the fps rates, it's
actually the CPU (or the whole combination of CPU, board and RAM) not
being able to feed the graphics card fast enough --- or the graphics
card being too fast for the CPU, if you want to see it that way.  To get
a significantly faster system, I'd have to spend ten times or more than
what I payed for the controller.

Fair enough - you must have got the controller _really_ cheap. Expect
to spend a small fortune on a battery replacement when that fails,
though. They only typically last a couple of years.

Yes, it was a good deal --- you can still get them off ebay for about
that price.  If the BBU fails, it might be advisable to buy another
controller with BBU because it doesn't cost much more than the BBU, and
I'd have a spare controller.  Or I could retire the controller and
perhaps use ZFS or mdraid or, since I have the server now, just get a
pair of SSDs.

My intention is to reduce the number of disks anyway.  The less disks
you have, the less can fail.

Or you could use them for extra redundancy.

The CPU alone would cost more.  I
didn't expect any change in fps rates and got the improvement as a
surprising side effect.

That will only apply when you are doing disk I/O at the same time,
surely. If you aren't disk disk I/O, then your CPU isn't doing
checksumming.

Well, I was really surprised about the improvement in fps and have a
hard time to believe it myself.  There's no good explanation for it
other than the theory that the system somehow gets plugged when having
to deal with so many disks in software raids, and that would seem to
only apply when so much disk I/O is going on.  "So much" would probably
mean "only very little".

Possibly a reduced number of interrupts hitting the CPU.

As I said above - clearly you got the RAID controller very
cheap. Within that cost envelope it may well have been a reasonable
solution - but that doesn't take away the points about data safety and
disk feature requirements for use with hardware RAID (TLER).

True --- luckily, the WD20EARS hold up remarkably well.  One out of
three failed, after about three years, probably because it would take
too long to recover an error.  I replaced it and haven't checked it any
further yet; it might still be usable.

Isn't that the green model that spins down so aggressively that it causes premature spindle and actuator failure? I have a duff one on the shelf that randomly loses different sectors on every scan.

But because WD firmwares lie about reallocations (or worse, it just keeps reusing bad sectors as long as it thinks the data has stuck), it never actually reads as failed in SMART. I have observed this on WD disks going back several generations, which is why I avoid them wherever possible. Samsungs seem to do something similar. The only "honest" disks seem to be HGST and Seagate, but experience with Seagate over the last 6 years shows them to be far too unreliable for my liking. As to whether I prefer honest but unreliable vs. dishonest but more reliable disks - it's a tough call, but luckily I don't have to make that decision at the moment.

Disks with crippled firmware will deliberately not respond to resets
while trying to do sector recovery, again, purely to prevent working
around the fact the disk doesn't report TLER.

When this happens, one of two things will occur:
1) The controller will kick out the disk and carry on without it, thus
losing the redundancy.

That's probably what happened with the failed disk.

2) The controller might be clever enough to not kick out the disk that
doesn't support TLER (theoretical - you are at the mercy of the
closed-spec RAID card firmware that may or may not do something
sensible), but the only other thing it can do is wait for the disk to
return. But until the disk returns, the card will block further I/O.

So in reality you have a choice between losing redundancy on the first
pending sector you encounter and the machine becoming unresponsive for
prolonged periods whenever you encounter a pending sector.

With software RAID you can at least choose between the two (by
selecting the disk command timeout appropriately), and managing the
situation (e.g. setting up a process to monitor for disks that have
been kicked out of the array and automatically re-adding them (if they
become responsive again) to restore the redundancy).

Is that a good idea to do, continue to use a disk that spends quite a
while on error recovery?

Sector failures shouldn't be that common. If the disk keeps losing sectors that regularly it will eventually run out of them (unless it's a WD), and SMART will eventually just complain about the disk having failed. And you will probably notice the errors in the log (unless you use a RAID card that has no logging facilities) and degraded performance long before the disk deems itself to have failed.

And have you seen what happens when a SATA disk becomes unresponsive?
The kernel will, apparently indefinitely, try heavily to reset the SATA
link and flood you with messages about it to the point where your only
choice is to try to shut the system down.

Not indefinitely - it will try to reset it a few times, and if that fails it will just stop trying to talk to it. I had a Seagate become unresponsive on one of my servers yesterday, and ZFS dealt with it by kicking out the disk from the pool without any external intervention.

Hot unplug-replug of the disk woke it up. I probably should replace that disk, but I have enough redundancy and mirrored copies of that server to not have to worry about that sort of thing too much.

I've had that happen a while ago with a very old disk.  I wonder why the
kernel behaves that badly.  The disk that failed wasn't relevant at all
for keeping the system running.  I'd rather have the kernel try maybe
ten times and then give up, and letting you have an option to make it
try again.

I wonder what happens when that happens with a disk on a PMP.  Will the
whole set on the PMP become inaccessible because the port gets blocked?

At least my PMPs are reasonably well behaved. You lose a disk, but you don't lose the whole set.

How come you use such a card?  Couldn't you use the on-board SATA ports
and connect a multiplier to them?

Most south-bridge based SATA controllers don't support FIS, and

My understanding of [2] is that the SATA protocol doesn't work at all
without FIS.  All SATA controllers would have to support it then.


[2]: https://en.wikipedia.org/wiki/Frame_Information_Structure#FIS

FIS _switching_ definitely doesn't work on all SATA controllers. Intel ICH doesn't support it, and many others don't, either.

performance takes a massive nosedive when you lose command
interlieving between the disks.

Huh?

You mean if a SATA controller doesn't support FIS, it would leave you
unable to use all but one disk behind a PMP connected to it because
switching between the disks won't be possible?

No, I mean it will issue a command and have to wait for it to complete. Then switch to a different disk, issue a command and wait for it. As opposed to issuing a command, switching to a different disk and issuing a command, then waiting for both of them to return.

I have a theory that when you have a software
RAID-5 with three disks and another RAID-1 with two disks, you have to
move so much data around that it plugs up the system, causing slowdowns.

Yes, you have to send twice the data down the PCIe bus, but in
practice this creates a negligible overhead unless your machine is
massively short of CPU and PCIe bandwidth.

With five disks in total in different arrays you'd have to send five
times the data, wouldn't you?

In RAID5? No, only redundancy of data extra. Mirroring is the worst case, and that only doubles the I/O.

and even on RAID6 I don't trust the closed, opaque, undocumented
implementation that might be in the firmware, is

It's a big disadvantage of hardware RAID that you can't read the data
when the controller has failed, unless you have another, compatible
controller at hand.  Did you check the sources of ZFS so that you can
trust it?

I have looked at the ZoL and zfs-fuse sources in passing when looking
at various patches I wanted to apply (and writing a small one of my
own for zfs-fuse, to always force 4KB ashift), but I wouldn't say that
I have looked through the source enough to trust it based purely on my
reading of the code.

But I suspect that orders of magnitude more people have looked at the
ZFS source than have looked at the RAID controller firmware source.

It's entirely possible, that, in total, 5000 people have looked at the
ZFS sources, and that 4995 of them have looked at the same 5% of it,
with the exception of the 5 ZFS developers (or how many there are) who
have read it all.  It's a bitchy argument --- but it's possible :)

Which is still probably more than the number of people that looked at the firmware source. :)

For a typical example of the sort of errors I'm talking about, say you
have a hardware RAID5 array and a ZFS RAIDZ1 pool.

Take one disk off each controller, write some random data over parts
of it (let's be kind, don't overwrite the RAID card's headers, which a
phantom write could theoretically do).

Now put the disks back into their pools, and read the files whose data
you just overwrite.

ZFS will spot the error, restore the data and hand you a good copy back.

RAID controller will most likely give you back duff data without even
noticing something is wrong with it.

It would start a rebuild once you plug the disk back in --- provided
that you hotplugged it or had the array up while the disk was removed.
Otherwise, we'd have to try out what would happen, which can be
different from controller to controller.

I was talking about swapping disks while the machine was shut down. The RAID controller would never know the disk was even unplugged. The aim was to simulate silent data corruption.

You could run ZFS on top of the RAID logical volume, but because ZFS
would have no visibility of the raw disks and redundancy underneath,
there is nothing it can do to scrub out the bad data. But even in that
stack, ZFS would still at least notice the data is corrupted and
refuse to return it to the application.

Only if you do not happen to write correct checksums for the data that
was overwritten --- thought it seems unlikely enough to happen.

That, too, would get detected because the checksum wouldn't match the data.

What does ZFS do when it finds block A and block B of data D, both
blocks with a correct checksum but containing different data?  Will it
let you decide which block to use, or refuse to deliver the data, or not
notice and return some data instead of D which is composed of whatever
is in the blocks A and B?

I'm not sure. You would have to really hand craft that kind of a data corruption. A scrub would find the discrepancy, but I'm not sure what it would do about it. One block would probably win since both are "valid". What would a hardware RAID controller do? :)

no longer fit for purpose with disks of the kind of size that ship
today.

How would that depend on the capacity of the disks?  More data --> more
potential for errors --> more security required?

The data error rates have been stagnant at one unrecoverable sector in
10^-14 bits read. That's one bad sector on about 10TB of data. If you
are using 4TB drives in 3-disk RAID5, if you lose a disk you have to
read back 8TB of data to rebuild the parity onto the replaced disk. If
you are statistically going to get one bad block every 10TB of reads,
it means you have 80% chance of losing some data during rebuilding
that array.

Additionally, rebuilding times have been going up with disk size which
increases both the time of degraded performance during rebuild and the
probability of failure of another disk during the rebuild.

It makes me wonder why such errors haven't become a widespread serious
problem yet --- not everyone is using ZFS.

They have become a widespread serious problem. People just don't notice the data corruption often.

So with VMware, you'd have to get certified hardware.

You wouldn't _have_ to get certified hardware. It just means that if
you find that there is a total of one motherboard that fits your
requirements and it's not on the certified list, you can plausibly
take your chances with it even if it doesn't work out of the box. I
did that with the SR-2 and got it working eventually in a way that
would never have been possible with ESX.

Are you saying that for your requirements you couldn't use VMware, which
makes it irrelevant whether the hardware is certified for it or not?

Both. What I wanted to do couldn't be done on ESX within the same
resource constraints. I wanted to use a deduplicated ZFS pool for my
VM images, and there is no ESX port of ZFS.

Additionally, there would have been no way for me to work around my
hardware bugs with ESX because I couldn't write patches to work around
the problem.

The idea is that you wouldn't have had to work around hardware bugs
when you'd be using certified hardware.

Sure, but there was no certified hardware at the time that met all of my requirements. So I went with the best compromise I could: hardware with the features I needed and an open source hypervisor that I might have a remote chance of patching if I had to work around hardware bugs.

Why would they think that virtualization benefits things that require
high performance?

I don't. But many people think it makes negligible difference because
they never did their testing properly.

How much difference does it make, actually?  I mean just the
virtualization, without a number of other VMs doing something.  Like you
could plan on a particular hardware for a particular workload and get
performance X with it.  Let's say you plan 4 CPUs/8GB RAM and buy
hardware that has 2x4 CPUs/64GB RAM.  Now you somehow limit the software
to use only 4 CPUs/8GB of that hardware and measure the performance ---
perhaps physically take out one CPU and install only 8GB.

Then you plug in the second CPU and all the RAM, but use a VM, give it 4
CPUs/8GB and measure the performance again.  How much less performance
will you get?

The difference is quite substantial. Here are some tests I did a couple of years ago:
http://www.altechnative.net/2012/08/04/virtual-performance-part-1-vmware/

I did some testing (ESX) with MySQL 3 years ago for a customer and the results were approximately a 40% degradation at saturation point, clock for clock.

I also did testing (also ESX) with MySQL for a different client last week, as they are planning to move to "the cloud", and the performance drop, clock-for-clock, was about 35%.

It is definitely not negligible.





_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.