[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance

To: xen-users@xxxxxxxxxxxxx
From: Gordan Bobic <gordan@xxxxxxxxxx>
Date: Mon, 30 Jun 2014 11:06:46 +0100
Delivery-date: Mon, 30 Jun 2014 10:07:16 +0000
List-id: Xen user discussion <xen-users.lists.xen.org>

On 06/29/2014 08:12 AM, lee wrote:

Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 06/28/2014 08:45 AM, lee wrote:

The hardware RAID controller gives me 10fps more with my favourite game
I'm playing, compared to software raid.  Since fps rates can be rather
low (because I'm CPU limited), that means a significant difference.


If your game is grinding onto disk I/O during play all is lost
anyway. If your CPU and RAM are _that_ constrained, there is probably
a better way to spend whatever you might pay for a new caching RAID
controller these days.


Only I didn't buy the controller new, and I bought it to have a decent
amount of ports.

Fair re: number of ports, but 2nd hand CPUs don't go for that much onebay, either, if you are concerned about the CPU hit. The maindifference being that when you aren't doing disk I/O, you still get theuse the extra CPU you might, whereas the RAID card sits idle. Havingextra CPU is more generic and flexible.

It's not disk I/O or a lack of RAM that limits the fps rates, it's
actually the CPU (or the whole combination of CPU, board and RAM) not
being able to feed the graphics card fast enough --- or the graphics
card being too fast for the CPU, if you want to see it that way.  To get
a significantly faster system, I'd have to spend ten times or more than
what I payed for the controller.

Fair enough - you must have got the controller _really_ cheap. Expect tospend a small fortune on a battery replacement when that fails, though.They only typically last a couple of years.

The CPU alone would cost more.  I
didn't expect any change in fps rates and got the improvement as a
surprising side effect.

That will only apply when you are doing disk I/O at the same time,surely. If you aren't disk disk I/O, then your CPU isn't doing checksumming.

I don't know about ZFS, though, never used that.  How much CPU overhead
is involved with that?  I don't need any more CPU overhead like comes
with software raid.


If you are that CPU constrained, tuning the storage is the wrong thing
to be looking at.


What would you tune without buying a new CPU, board and RAM, and without
running into the same problem of too few SATA ports?

As I said above - clearly you got the RAID controller very cheap. Withinthat cost envelope it may well have been a reasonable solution - butthat doesn't take away the points about data safety and disk featurerequirements for use with hardware RAID (TLER).

The reason TLER is important is because it allows you to limit time thedisk will try to spend scrubbing out an unreadable sector. When a sectorgoes bad and the data read doesn't match the sector ECC, the disk willtry reading it over and over and over until it gets a read of the datathat is recoverable. A disk without TLER enabled might go away for avery long time, sometimes a minute or two. Disks with crippled firmwarewill deliberately not respond to resets while trying to do sectorrecovery, again, purely to prevent working around the fact the diskdoesn't report TLER.


When this happens, one of two things will occur:

1) The controller will kick out the disk and carry on without it, thuslosing the redundancy.

2) The controller might be clever enough to not kick out the disk thatdoesn't support TLER (theoretical - you are at the mercy of theclosed-spec RAID card firmware that may or may not do somethingsensible), but the only other thing it can do is wait for the disk toreturn. But until the disk returns, the card will block further I/O.

So in reality you have a choice between losing redundancy on the firstpending sector you encounter and the machine becoming unresponsive forprolonged periods whenever you encounter a pending sector.

With software RAID you can at least choose between the two (by selectingthe disk command timeout appropriately), and managing the situation(e.g. setting up a process to monitor for disks that have been kickedout of the array and automatically re-adding them (if they becomeresponsive again) to restore the redundancy).

expensive ones.  Perhaps the lack of ports is not so much of a problem
with the available disk capacities nowadays; however, it is what
made me
get a hardware raid controller.


Hardware RAID is, IMO, far too much of a liability with
modern disks. Latent sector errors happen a lot more
often than most people realize, and there are error
situations that hardware RAID cannot meaningfully handle.


So far, it works very well here.  Do you think that software RAID can
handle errors better?


Possibly in some cases.


Cases like?  ZFS, as you described it, might.


ZFS is one example. I listed another example above.

And where do you find a mainboard that has like
12 SAS/SATA ports?


I use a Marvell 88SX7042 4-port card with two SIL3726 SATA port
multipliers on it. This works very well for me and provides more
bandwidth that my 12 disks can serve in a realistic usage pattern.


Do you mean a card like this one:
http://www.hardware-rogge.com/product_info.php?products_id=15226


Yes, although €80 is more than I paid for mine. IIRC mine were around £40.

This card alone costs almost as much as I payed for the RAID controller.


You got the RAID controller _REALLY_ cheap.

How come you use such a card?  Couldn't you use the on-board SATA ports
and connect a multiplier to them?

Most south-bridge based SATA controllers don't support FIS, andperformance takes a massive nosedive when you lose command interlievingbetween the disks. Some motherboards come with secondary SATAcontrollers which behave much better but they often have other problems.For example, two of the motherboards I have used recently have thefollowing secondary SATA controllers for extra ports:

Silicon Image SIL3132, 2 ports: Has FIS+NCQ, but the card bottlenecks atabout 170MB/s, which is very low.

Marvell 88SE9123, 2 ports: Has FIS+NCQ, but it has a PMP handling bugthat prevents the use of more than one PMP. You can use a PMP on oneport with 5 disks attached to the PMP, plus one disk attached directlyto the other port. If you attach a PMP to each port, no disks show up atall.

I needed more than one PMPs worth of extra disks, so I used a card withthe older Marvell chipset that works correctly.

In contrast, I have three SAS RAID cards, two LSI and one Adaptec,
none of which work at all on my motherboard with the IOMMU enabled.


Hmmm, how come, and what are the symptoms?  Perhaps I should try to
force NUMA to be enabled for the server.

The symptoms are that all the disk commands get truncated and none ofthe disks show up. DMA to/from the controller doesn't work. I'm prettysure it has nothing to do with NUMA, though - most like a side effect ofthe Nvidia NF200 PCIe bridges on my motherboard.

It seems that things are getting more and more complicated --- despite
they don't need to --- and that people are getting more and more
clueless.  More bugs might be a side effect of that, and things aren't
done as thoroughly as they used to be done.


Indeed. The chances of getting a filed Fedora bug fixed, or even
acknowledged before the Fedora's 6-month EOL bug zapper closes it for
you are vanishlighly small, in my experience.


Yes, it's ridiculous.  I find it really stupid to close a bug because
some amount of time has passed rather than that the bug was looked into
and fixed, or at least checked whether it still exists or not.  That's
not a way to handle bug reports.  People will simply stop making any
because that's the best they can do.


As far as I can tell that already happened some years ago.

I find that on my motherboard most RAID controllers don't work
at all with IOMMU enabled. Something about the way the transparent
bridging native PCIX RAID ASICs to PCIe makes things not work.


Perhaps that's a problem of your board, not of the controllers.


It may well be, but it does show that the idea that a SAS RAID
controller with many ports is a better solution does not universally
apply.


I never said it would :)  I was looking at what's available to increase
the number of disks I could connect, and I found you can get relatively
cheap cards with only two ports which may work or not.  More expensive
cards would have four ports and might work or not, and cards with more
than four ports were mostly RAID controllers.  For the 4/4+ cards, the
prices were higher than what I could get the fully featured SAS/SATA
RAID controller with 8 internal ports for, so I got that one --- and
it's been working flawlessly for two years or so now.  Only the server
has problems ...

The battery packs on the memory modules do expire eventually, and if youforce write-caching to on without the BBU, you are probably going tothoroughly destroy the file system the first time you have a power cutin the middle of a heavy writing operation.

If that happens you would probably be better off putting the controllerinto a plain HBA/JBOD mode, but that would mean rebuilding the RAID array.

Cheap SAS cards, OTOH, work just fine, and at a fraction of
the cost.


And they provide only a fraction of the ports and features.


When I said SAS above I meant SATA. And PMPs help.


Well, which ones do work?  I didn't find anything to that when I looked
and didn't come across multipliers.

Sa I said, I use Marvell 88SX7042 based cards (IIRC the StarTech brandedmodels, but I may have more than one brand - the cards all look andfunction pretty identical, though) with SIL3276 based port multipliers(IIRC my multipliers are made by Lycom, model ST-126, but there areother similar models).

The combination of SATA card and PMPs supports FIS and NCQ which means
that the SATA controller's bandwidth per port is used very
efficiently.


Is that a good thing?

Very. The alternative would be to wait for each command to complete,which would massively cripple the disk I/O throughput.

I have a theory that when you have a software
RAID-5 with three disks and another RAID-1 with two disks, you have to
move so much data around that it plugs up the system, causing slowdowns.

Yes, you have to send twice the data down the PCIe bus, but in practicethis creates a negligible overhead unless your machine is massivelyshort of CPU and PCIe bandwidth.

My 7 year old Core2 machine manages about 8GB/s on memory I/O (limitedby the MCH, not the CPU). Let's say a modern SATA disk might(optimistically) manage 200MB/s on linear transfers. So you are talkingbout a 200MB/s extra overhead on I/O that your CPU might have to do.That is 1/40 of what the CPU can handle, or about 2.5%, and that isassuming you are completely limited by disk I/O (in which case the mostof the other 97.5% of your CPU will probably be idle anyway).

A more recent machine with the MCH integrated on the CPU die will manageseveral times more memory bandwidth than my old Core2 that I used as anexample.

Even a software RAID-1 with two disks can create slowdowns, depending on
what data transfer rates the disks can sustain.  I do not have such
slowdowns when the disks are on the hardware RAID controller.

Perhaps it's a problem with the particular board I have and/or the CPU
being too slow to be able to deal with the overhead in such a way that
it's not noticeable, and perhaps it's much better to fill the bandwidth
of a single SATA port rather than using some of the bandwidth of five
SATA ports.

If the total amount of I/O consumed is the same it shouldn't make muchdifference.

Or perhaps filling the bandwidth of one SATA port plus the
CPU handling the overhead ZFS brings about isn't any better, who knows.

More saturation of a SATA port isn't what gives you more performance -the fact that you are managing to saturate the SATA port more is the_consequence_ of getting more performance out of the disk subsystem.

Anyway, I have come to like hardware RAID better than software RAID.


Whatever works for you. My view is that traditional RAID, certainly
anything below RAID6,


Well, you have to afford all the disks for such RAID levels.  Is ZFS any
better in this regard?


It doesn't make the disks cheaper, no. :)

and even on RAID6 I don't trust the closed, opaque, undocumented
implementation that might be in the firmware, is


It's a big disadvantage of hardware RAID that you can't read the data
when the controller has failed, unless you have another, compatible
controller at hand.  Did you check the sources of ZFS so that you can
trust it?

I have looked at the ZoL and zfs-fuse sources in passing when looking atvarious patches I wanted to apply (and writing a small one of my own forzfs-fuse, to always force 4KB ashift), but I wouldn't say that I havelooked through the source enough to trust it based purely on my readingof the code.

But I suspect that orders of magnitude more people have looked at theZFS source than have looked at the RAID controller firmware source.

For a typical example of the sort of errors I'm talking about, say youhave a hardware RAID5 array and a ZFS RAIDZ1 pool.

Take one disk off each controller, write some random data over parts ofit (let's be kind, don't overwrite the RAID card's headers, which aphantom write could theoretically do).

Now put the disks back into their pools, and read the files whose datayou just overwrite.


ZFS will spot the error, restore the data and hand you a good copy back.

RAID controller will most likely give you back duff data without evennoticing something is wrong with it.

You could run ZFS on top of the RAID logical volume, but because ZFSwould have no visibility of the raw disks and redundancy underneath,there is nothing it can do to scrub out the bad data. But even in thatstack, ZFS would still at least notice the data is corrupted and refuseto return it to the application.

no longer fit for purpose with disks of the kind of size that ship
today.


How would that depend on the capacity of the disks?  More data --> more
potential for errors --> more security required?

The data error rates have been stagnant at one unrecoverable sector in10^-14 bits read. That's one bad sector on about 10TB of data. If youare using 4TB drives in 3-disk RAID5, if you lose a disk you have toread back 8TB of data to rebuild the parity onto the replaced disk. Ifyou are statistically going to get one bad block every 10TB of reads, itmeans you have 80% chance of losing some data during rebuilding that array.

Additionally, rebuilding times have been going up with disk size whichincreases both the time of degraded performance during rebuild and theprobability of failure of another disk during the rebuild.

So with VMware, you'd have to get certified hardware.


You wouldn't _have_ to get certified hardware. It just means that if
you find that there is a total of one motherboard that fits your
requirements and it's not on the certified list, you can plausibly
take your chances with it even if it doesn't work out of the box. I
did that with the SR-2 and got it working eventually in a way that
would never have been possible with ESX.


Are you saying that for your requirements you couldn't use VMware, which
makes it irrelevant whether the hardware is certified for it or not?

Both. What I wanted to do couldn't be done on ESX within the sameresource constraints. I wanted to use a deduplicated ZFS pool for my VMimages, and there is no ESX port of ZFS.

Additionally, there would have been no way for me to work around myhardware bugs with ESX because I couldn't write patches to work aroundthe problem.

After all, I'm not convinced that virtualization as it's done with xen
and the like is the right way to go.

[...]

I am not a fan of virtualization for most workloads, but sometimes
it is convenient, not least in order to work around deficiencies of
other OS-es you might want to run. For example, I don't want to
maintain 3 separate systems - partitioning up one big system is
much more convenient. And I can run Windows gaming VMs while
still having the advantages of easy full system rollbacks by
having my domU disks backed by ZFS volumes. It's not for HPC
workloads, but for some things it is the least unsuitable solution.


Not even for most?  It seems as if everyone is using it quite a lot,
make it sense or not.


Most people haven't realized yet that the king's clothes are not
suitable for every occasion, so to speak. In terms of the hype cycle,
different users are at different stages. Many are still around the
point of "peak of inflated expectations". Those that do the testing
for their particular high performance workloads they were hoping to
virtualize hit the "trough of disilusionment" pretty quickly most of
the time.


Why would they think that virtualization benefits things that require
high performance?

I don't. But many people think it makes negligible difference becausethey never did their testing properly.

When I need the most/best performance possible, it's
obviously counter productive.


Apparently it's not obvious to many, many people.


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

Follow-Ups:
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Adam Goryachev

References:
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Mihail Ivanov
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Gordan Bobic
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Gordan Bobic
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Gordan Bobic
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee

Prev by Date: Re: [Xen-users] how to start VMs in a particular order
Next by Date: Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
Previous by thread: Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
Next by thread: Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.