[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance

  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Mon, 30 Jun 2014 11:06:46 +0100
  • Delivery-date: Mon, 30 Jun 2014 10:07:16 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 06/29/2014 08:12 AM, lee wrote:
Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 06/28/2014 08:45 AM, lee wrote:

The hardware RAID controller gives me 10fps more with my favourite game
I'm playing, compared to software raid.  Since fps rates can be rather
low (because I'm CPU limited), that means a significant difference.

If your game is grinding onto disk I/O during play all is lost
anyway. If your CPU and RAM are _that_ constrained, there is probably
a better way to spend whatever you might pay for a new caching RAID
controller these days.

Only I didn't buy the controller new, and I bought it to have a decent
amount of ports.

Fair re: number of ports, but 2nd hand CPUs don't go for that much on ebay, either, if you are concerned about the CPU hit. The main difference being that when you aren't doing disk I/O, you still get the use the extra CPU you might, whereas the RAID card sits idle. Having extra CPU is more generic and flexible.

It's not disk I/O or a lack of RAM that limits the fps rates, it's
actually the CPU (or the whole combination of CPU, board and RAM) not
being able to feed the graphics card fast enough --- or the graphics
card being too fast for the CPU, if you want to see it that way.  To get
a significantly faster system, I'd have to spend ten times or more than
what I payed for the controller.

Fair enough - you must have got the controller _really_ cheap. Expect to spend a small fortune on a battery replacement when that fails, though. They only typically last a couple of years.

The CPU alone would cost more.  I
didn't expect any change in fps rates and got the improvement as a
surprising side effect.

That will only apply when you are doing disk I/O at the same time, surely. If you aren't disk disk I/O, then your CPU isn't doing checksumming.

I don't know about ZFS, though, never used that.  How much CPU overhead
is involved with that?  I don't need any more CPU overhead like comes
with software raid.

If you are that CPU constrained, tuning the storage is the wrong thing
to be looking at.

What would you tune without buying a new CPU, board and RAM, and without
running into the same problem of too few SATA ports?

As I said above - clearly you got the RAID controller very cheap. Within that cost envelope it may well have been a reasonable solution - but that doesn't take away the points about data safety and disk feature requirements for use with hardware RAID (TLER).

The reason TLER is important is because it allows you to limit time the disk will try to spend scrubbing out an unreadable sector. When a sector goes bad and the data read doesn't match the sector ECC, the disk will try reading it over and over and over until it gets a read of the data that is recoverable. A disk without TLER enabled might go away for a very long time, sometimes a minute or two. Disks with crippled firmware will deliberately not respond to resets while trying to do sector recovery, again, purely to prevent working around the fact the disk doesn't report TLER.

When this happens, one of two things will occur:
1) The controller will kick out the disk and carry on without it, thus losing the redundancy.

2) The controller might be clever enough to not kick out the disk that doesn't support TLER (theoretical - you are at the mercy of the closed-spec RAID card firmware that may or may not do something sensible), but the only other thing it can do is wait for the disk to return. But until the disk returns, the card will block further I/O.

So in reality you have a choice between losing redundancy on the first pending sector you encounter and the machine becoming unresponsive for prolonged periods whenever you encounter a pending sector.

With software RAID you can at least choose between the two (by selecting the disk command timeout appropriately), and managing the situation (e.g. setting up a process to monitor for disks that have been kicked out of the array and automatically re-adding them (if they become responsive again) to restore the redundancy).

expensive ones.  Perhaps the lack of ports is not so much of a problem
with the available disk capacities nowadays; however, it is what
made me
get a hardware raid controller.

Hardware RAID is, IMO, far too much of a liability with
modern disks. Latent sector errors happen a lot more
often than most people realize, and there are error
situations that hardware RAID cannot meaningfully handle.

So far, it works very well here.  Do you think that software RAID can
handle errors better?

Possibly in some cases.

Cases like?  ZFS, as you described it, might.

ZFS is one example. I listed another example above.

And where do you find a mainboard that has like
12 SAS/SATA ports?

I use a Marvell 88SX7042 4-port card with two SIL3726 SATA port
multipliers on it. This works very well for me and provides more
bandwidth that my 12 disks can serve in a realistic usage pattern.

Do you mean a card like this one:

Yes, although €80 is more than I paid for mine. IIRC mine were around £40.

This card alone costs almost as much as I payed for the RAID controller.

You got the RAID controller _REALLY_ cheap.

How come you use such a card?  Couldn't you use the on-board SATA ports
and connect a multiplier to them?

Most south-bridge based SATA controllers don't support FIS, and performance takes a massive nosedive when you lose command interlieving between the disks. Some motherboards come with secondary SATA controllers which behave much better but they often have other problems. For example, two of the motherboards I have used recently have the following secondary SATA controllers for extra ports:

Silicon Image SIL3132, 2 ports: Has FIS+NCQ, but the card bottlenecks at about 170MB/s, which is very low.

Marvell 88SE9123, 2 ports: Has FIS+NCQ, but it has a PMP handling bug that prevents the use of more than one PMP. You can use a PMP on one port with 5 disks attached to the PMP, plus one disk attached directly to the other port. If you attach a PMP to each port, no disks show up at all.

I needed more than one PMPs worth of extra disks, so I used a card with the older Marvell chipset that works correctly.

In contrast, I have three SAS RAID cards, two LSI and one Adaptec,
none of which work at all on my motherboard with the IOMMU enabled.

Hmmm, how come, and what are the symptoms?  Perhaps I should try to
force NUMA to be enabled for the server.

The symptoms are that all the disk commands get truncated and none of the disks show up. DMA to/from the controller doesn't work. I'm pretty sure it has nothing to do with NUMA, though - most like a side effect of the Nvidia NF200 PCIe bridges on my motherboard.

It seems that things are getting more and more complicated --- despite
they don't need to --- and that people are getting more and more
clueless.  More bugs might be a side effect of that, and things aren't
done as thoroughly as they used to be done.

Indeed. The chances of getting a filed Fedora bug fixed, or even
acknowledged before the Fedora's 6-month EOL bug zapper closes it for
you are vanishlighly small, in my experience.

Yes, it's ridiculous.  I find it really stupid to close a bug because
some amount of time has passed rather than that the bug was looked into
and fixed, or at least checked whether it still exists or not.  That's
not a way to handle bug reports.  People will simply stop making any
because that's the best they can do.

As far as I can tell that already happened some years ago.

I find that on my motherboard most RAID controllers don't work
at all with IOMMU enabled. Something about the way the transparent
bridging native PCIX RAID ASICs to PCIe makes things not work.

Perhaps that's a problem of your board, not of the controllers.

It may well be, but it does show that the idea that a SAS RAID
controller with many ports is a better solution does not universally

I never said it would :)  I was looking at what's available to increase
the number of disks I could connect, and I found you can get relatively
cheap cards with only two ports which may work or not.  More expensive
cards would have four ports and might work or not, and cards with more
than four ports were mostly RAID controllers.  For the 4/4+ cards, the
prices were higher than what I could get the fully featured SAS/SATA
RAID controller with 8 internal ports for, so I got that one --- and
it's been working flawlessly for two years or so now.  Only the server
has problems ...

The battery packs on the memory modules do expire eventually, and if you force write-caching to on without the BBU, you are probably going to thoroughly destroy the file system the first time you have a power cut in the middle of a heavy writing operation.

If that happens you would probably be better off putting the controller into a plain HBA/JBOD mode, but that would mean rebuilding the RAID array.

Cheap SAS cards, OTOH, work just fine, and at a fraction of
the cost.

And they provide only a fraction of the ports and features.

When I said SAS above I meant SATA. And PMPs help.

Well, which ones do work?  I didn't find anything to that when I looked
and didn't come across multipliers.

Sa I said, I use Marvell 88SX7042 based cards (IIRC the StarTech branded models, but I may have more than one brand - the cards all look and function pretty identical, though) with SIL3276 based port multipliers (IIRC my multipliers are made by Lycom, model ST-126, but there are other similar models).

The combination of SATA card and PMPs supports FIS and NCQ which means
that the SATA controller's bandwidth per port is used very

Is that a good thing?

Very. The alternative would be to wait for each command to complete, which would massively cripple the disk I/O throughput.

I have a theory that when you have a software
RAID-5 with three disks and another RAID-1 with two disks, you have to
move so much data around that it plugs up the system, causing slowdowns.

Yes, you have to send twice the data down the PCIe bus, but in practice this creates a negligible overhead unless your machine is massively short of CPU and PCIe bandwidth.

My 7 year old Core2 machine manages about 8GB/s on memory I/O (limited by the MCH, not the CPU). Let's say a modern SATA disk might (optimistically) manage 200MB/s on linear transfers. So you are talking bout a 200MB/s extra overhead on I/O that your CPU might have to do. That is 1/40 of what the CPU can handle, or about 2.5%, and that is assuming you are completely limited by disk I/O (in which case the most of the other 97.5% of your CPU will probably be idle anyway).

A more recent machine with the MCH integrated on the CPU die will manage several times more memory bandwidth than my old Core2 that I used as an example.

Even a software RAID-1 with two disks can create slowdowns, depending on
what data transfer rates the disks can sustain.  I do not have such
slowdowns when the disks are on the hardware RAID controller.
Perhaps it's a problem with the particular board I have and/or the CPU
being too slow to be able to deal with the overhead in such a way that
it's not noticeable, and perhaps it's much better to fill the bandwidth
of a single SATA port rather than using some of the bandwidth of five
SATA ports.

If the total amount of I/O consumed is the same it shouldn't make much difference.

Or perhaps filling the bandwidth of one SATA port plus the
CPU handling the overhead ZFS brings about isn't any better, who knows.

More saturation of a SATA port isn't what gives you more performance - the fact that you are managing to saturate the SATA port more is the _consequence_ of getting more performance out of the disk subsystem.

Anyway, I have come to like hardware RAID better than software RAID.

Whatever works for you. My view is that traditional RAID, certainly
anything below RAID6,

Well, you have to afford all the disks for such RAID levels.  Is ZFS any
better in this regard?

It doesn't make the disks cheaper, no. :)

and even on RAID6 I don't trust the closed, opaque, undocumented
implementation that might be in the firmware, is

It's a big disadvantage of hardware RAID that you can't read the data
when the controller has failed, unless you have another, compatible
controller at hand.  Did you check the sources of ZFS so that you can
trust it?

I have looked at the ZoL and zfs-fuse sources in passing when looking at various patches I wanted to apply (and writing a small one of my own for zfs-fuse, to always force 4KB ashift), but I wouldn't say that I have looked through the source enough to trust it based purely on my reading of the code.

But I suspect that orders of magnitude more people have looked at the ZFS source than have looked at the RAID controller firmware source.

For a typical example of the sort of errors I'm talking about, say you have a hardware RAID5 array and a ZFS RAIDZ1 pool.

Take one disk off each controller, write some random data over parts of it (let's be kind, don't overwrite the RAID card's headers, which a phantom write could theoretically do).

Now put the disks back into their pools, and read the files whose data you just overwrite.

ZFS will spot the error, restore the data and hand you a good copy back.

RAID controller will most likely give you back duff data without even noticing something is wrong with it.

You could run ZFS on top of the RAID logical volume, but because ZFS would have no visibility of the raw disks and redundancy underneath, there is nothing it can do to scrub out the bad data. But even in that stack, ZFS would still at least notice the data is corrupted and refuse to return it to the application.

no longer fit for purpose with disks of the kind of size that ship

How would that depend on the capacity of the disks?  More data --> more
potential for errors --> more security required?

The data error rates have been stagnant at one unrecoverable sector in 10^-14 bits read. That's one bad sector on about 10TB of data. If you are using 4TB drives in 3-disk RAID5, if you lose a disk you have to read back 8TB of data to rebuild the parity onto the replaced disk. If you are statistically going to get one bad block every 10TB of reads, it means you have 80% chance of losing some data during rebuilding that array.

Additionally, rebuilding times have been going up with disk size which increases both the time of degraded performance during rebuild and the probability of failure of another disk during the rebuild.

So with VMware, you'd have to get certified hardware.

You wouldn't _have_ to get certified hardware. It just means that if
you find that there is a total of one motherboard that fits your
requirements and it's not on the certified list, you can plausibly
take your chances with it even if it doesn't work out of the box. I
did that with the SR-2 and got it working eventually in a way that
would never have been possible with ESX.

Are you saying that for your requirements you couldn't use VMware, which
makes it irrelevant whether the hardware is certified for it or not?

Both. What I wanted to do couldn't be done on ESX within the same resource constraints. I wanted to use a deduplicated ZFS pool for my VM images, and there is no ESX port of ZFS.

Additionally, there would have been no way for me to work around my hardware bugs with ESX because I couldn't write patches to work around the problem.

After all, I'm not convinced that virtualization as it's done with xen
and the like is the right way to go.

I am not a fan of virtualization for most workloads, but sometimes
it is convenient, not least in order to work around deficiencies of
other OS-es you might want to run. For example, I don't want to
maintain 3 separate systems - partitioning up one big system is
much more convenient. And I can run Windows gaming VMs while
still having the advantages of easy full system rollbacks by
having my domU disks backed by ZFS volumes. It's not for HPC
workloads, but for some things it is the least unsuitable solution.

Not even for most?  It seems as if everyone is using it quite a lot,
make it sense or not.

Most people haven't realized yet that the king's clothes are not
suitable for every occasion, so to speak. In terms of the hype cycle,
different users are at different stages. Many are still around the
point of "peak of inflated expectations". Those that do the testing
for their particular high performance workloads they were hoping to
virtualize hit the "trough of disilusionment" pretty quickly most of
the time.

Why would they think that virtualization benefits things that require
high performance?

I don't. But many people think it makes negligible difference because they never did their testing properly.

When I need the most/best performance possible, it's
obviously counter productive.

Apparently it's not obvious to many, many people.

Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.