[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance

Gordan Bobic <gordan@xxxxxxxxxx> writes:

> On 06/29/2014 08:12 AM, lee wrote:
>> Gordan Bobic <gordan@xxxxxxxxxx> writes:
>>>> On 06/28/2014 08:45 AM, lee wrote:
>>>> The hardware RAID controller gives me 10fps more with my favourite game
>>>> I'm playing, compared to software raid.  Since fps rates can be rather
>>>> low (because I'm CPU limited), that means a significant difference.
>>> If your game is grinding onto disk I/O during play all is lost
>>> anyway. If your CPU and RAM are _that_ constrained, there is probably
>>> a better way to spend whatever you might pay for a new caching RAID
>>> controller these days.
>> Only I didn't buy the controller new, and I bought it to have a decent
>> amount of ports.
> Fair re: number of ports, but 2nd hand CPUs don't go for that much on
> ebay, either, if you are concerned about the CPU hit.

What would you put into an AM3 socket which is so much faster than a
Phenom 965 to be worthwhile and doesn't have an enormous power

I'd have to switch to Intel --- and that would bring about UEFI which,
besides the tremendous hassle, comes with serious security risks.  And
after reading [1], what could you still buy?

[1]: https://www.fsf.org/blogs/community/active-management-technology

> The main difference being that when you aren't doing disk I/O, you
> still get the use the extra CPU you might, whereas the RAID card sits
> idle. Having extra CPU is more generic and flexible.

Yes, and when you have more RAM, it's also to your advantage regardless
of disk I/O.  Only it would cost more --- even off ebay --- than what I
payed for the controller; I'd be throwing 8GB away, and it won't give
me more ports, either ...

I considered all these options and didn't find any that would have been
more advantageous for the money.  Nothing is perfect ...

>> It's not disk I/O or a lack of RAM that limits the fps rates, it's
>> actually the CPU (or the whole combination of CPU, board and RAM) not
>> being able to feed the graphics card fast enough --- or the graphics
>> card being too fast for the CPU, if you want to see it that way.  To get
>> a significantly faster system, I'd have to spend ten times or more than
>> what I payed for the controller.
> Fair enough - you must have got the controller _really_ cheap. Expect
> to spend a small fortune on a battery replacement when that fails,
> though. They only typically last a couple of years.

Yes, it was a good deal --- you can still get them off ebay for about
that price.  If the BBU fails, it might be advisable to buy another
controller with BBU because it doesn't cost much more than the BBU, and
I'd have a spare controller.  Or I could retire the controller and
perhaps use ZFS or mdraid or, since I have the server now, just get a
pair of SSDs.

My intention is to reduce the number of disks anyway.  The less disks
you have, the less can fail.

>> The CPU alone would cost more.  I
>> didn't expect any change in fps rates and got the improvement as a
>> surprising side effect.
> That will only apply when you are doing disk I/O at the same time,
> surely. If you aren't disk disk I/O, then your CPU isn't doing
> checksumming.

Well, I was really surprised about the improvement in fps and have a
hard time to believe it myself.  There's no good explanation for it
other than the theory that the system somehow gets plugged when having
to deal with so many disks in software raids, and that would seem to
only apply when so much disk I/O is going on.  "So much" would probably
mean "only very little".

I just noticed the improvement and didn't do any further investigation
since that would have required to completely revert the move to the
controller.  Now I have the backup disks on software raid-1, and when
copying data, the load is noticeable.

> As I said above - clearly you got the RAID controller very
> cheap. Within that cost envelope it may well have been a reasonable
> solution - but that doesn't take away the points about data safety and
> disk feature requirements for use with hardware RAID (TLER).

True --- luckily, the WD20EARS hold up remarkably well.  One out of
three failed, after about three years, probably because it would take
too long to recover an error.  I replaced it and haven't checked it any
further yet; it might still be usable.

> Disks with crippled firmware will deliberately not respond to resets
> while trying to do sector recovery, again, purely to prevent working
> around the fact the disk doesn't report TLER.
> When this happens, one of two things will occur:
> 1) The controller will kick out the disk and carry on without it, thus
> losing the redundancy.

That's probably what happened with the failed disk.

> 2) The controller might be clever enough to not kick out the disk that
> doesn't support TLER (theoretical - you are at the mercy of the
> closed-spec RAID card firmware that may or may not do something
> sensible), but the only other thing it can do is wait for the disk to
> return. But until the disk returns, the card will block further I/O.
> So in reality you have a choice between losing redundancy on the first
> pending sector you encounter and the machine becoming unresponsive for
> prolonged periods whenever you encounter a pending sector.
> With software RAID you can at least choose between the two (by
> selecting the disk command timeout appropriately), and managing the
> situation (e.g. setting up a process to monitor for disks that have
> been kicked out of the array and automatically re-adding them (if they
> become responsive again) to restore the redundancy).

Is that a good idea to do, continue to use a disk that spends quite a
while on error recovery?

And have you seen what happens when a SATA disk becomes unresponsive?
The kernel will, apparently indefinitely, try heavily to reset the SATA
link and flood you with messages about it to the point where your only
choice is to try to shut the system down.

I've had that happen a while ago with a very old disk.  I wonder why the
kernel behaves that badly.  The disk that failed wasn't relevant at all
for keeping the system running.  I'd rather have the kernel try maybe
ten times and then give up, and letting you have an option to make it
try again.

I wonder what happens when that happens with a disk on a PMP.  Will the
whole set on the PMP become inaccessible because the port gets blocked?

>> Do you mean a card like this one:
>> http://www.hardware-rogge.com/product_info.php?products_id=15226
> Yes, although â80 is more than I paid for mine. IIRC mine were around Â40.

That's about EUR 65?

>> This card alone costs almost as much as I payed for the RAID controller.
> You got the RAID controller _REALLY_ cheap.


You can get one, too :)  IIRC a P800 doesn't do JBOD, though.

>> How come you use such a card?  Couldn't you use the on-board SATA ports
>> and connect a multiplier to them?
> Most south-bridge based SATA controllers don't support FIS, and

My understanding of [2] is that the SATA protocol doesn't work at all
without FIS.  All SATA controllers would have to support it then.

[2]: https://en.wikipedia.org/wiki/Frame_Information_Structure#FIS

> performance takes a massive nosedive when you lose command
> interlieving between the disks.


You mean if a SATA controller doesn't support FIS, it would leave you
unable to use all but one disk behind a PMP connected to it because
switching between the disks won't be possible?

> Some motherboards come with secondary
> SATA controllers which behave much better but they often have other
> problems. For example, two of the motherboards I have used recently
> have the following secondary SATA controllers for extra ports:
> Silicon Image SIL3132, 2 ports: Has FIS+NCQ, but the card bottlenecks
> at about 170MB/s, which is very low.
> Marvell 88SE9123, 2 ports: Has FIS+NCQ, but it has a PMP handling bug
> that prevents the use of more than one PMP. You can use a PMP on one
> port with 5 disks attached to the PMP, plus one disk attached directly
> to the other port. If you attach a PMP to each port, no disks show up
> at all.
> I needed more than one PMPs worth of extra disks, so I used a card
> with the older Marvell chipset that works correctly.

The problem is finding all this out so you can buy the right hardware.
I might have bought just an additional SATA controller instead of the
P800, but there was no telling which one would be good, or even work at

>>> In contrast, I have three SAS RAID cards, two LSI and one Adaptec,
>>> none of which work at all on my motherboard with the IOMMU enabled.
>> Hmmm, how come, and what are the symptoms?  Perhaps I should try to
>> force NUMA to be enabled for the server.
> The symptoms are that all the disk commands get truncated and none of
> the disks show up. DMA to/from the controller doesn't work. I'm pretty
> sure it has nothing to do with NUMA, though - most like a side effect
> of the Nvidia NF200 PCIe bridges on my motherboard.

Hm :(  It becomes increasingly difficult to find a good board, other
than for gaming maybe.  That, and things like UEFI and AMT, make good
reasons not to buy anything new until it cannot be avoided.

> The battery packs on the memory modules do expire eventually, and if
> you force write-caching to on without the BBU, you are probably going
> to thoroughly destroy the file system the first time you have a power
> cut in the middle of a heavy writing operation.

The write cache is supposed to become disabled when the BBU fails.  It
probably doesn't even take a power outage; when booting, the P800
sometimes says it still has data to write to the disks after the system
was shut down nicely.  Interestingly, 'shutdown -h now' instead of
'halt' seems to prevent that.

> If that happens you would probably be better off putting the
> controller into a plain HBA/JBOD mode, but that would mean rebuilding
> the RAID array.

IIRC, you can't do that with a P800.  It's a pretty serious card that
can drive 256 disks or so, supports multipath, volume sharing and
migration and whatever you may think of.  You can still buy them new for
over $800 despite they're kinda ancient.

>> I have a theory that when you have a software
>> RAID-5 with three disks and another RAID-1 with two disks, you have to
>> move so much data around that it plugs up the system, causing slowdowns.
> Yes, you have to send twice the data down the PCIe bus, but in
> practice this creates a negligible overhead unless your machine is
> massively short of CPU and PCIe bandwidth.

With five disks in total in different arrays you'd have to send five
times the data, wouldn't you?

It might be short of bandwidth if the CPU is busy and heavily accesses
the RAM and feeds the graphics card at the same time.  Or the SATA
controller has problems with it.  I really don't know.

>> Or perhaps filling the bandwidth of one SATA port plus the
>> CPU handling the overhead ZFS brings about isn't any better, who knows.
> More saturation of a SATA port isn't what gives you more performance - 
> the fact that you are managing to saturate the SATA port more is the
> _consequence_ of getting more performance out of the disk subsystem.

Getting more performance out of the disk subsystem doesn't necessarily
mean getting more performance out of the system (like higher fps
rates).  It might even reduce performance (i. e. lower fps rates).

In the end, you only have "this much" in terms of resources.  Adding a
hardware RAID controller increases the resources.

Imagine how nice it might be to have a hardware ZFS controller :)
Manufacture one, and you might make money with it, replacing all the
RAID cards.  But I want at least a 10% share.

>>>> Anyway, I have come to like hardware RAID better than software RAID.
>>> Whatever works for you. My view is that traditional RAID, certainly
>>> anything below RAID6,
>> Well, you have to afford all the disks for such RAID levels.  Is ZFS any
>> better in this regard?
> It doesn't make the disks cheaper, no. :)

I mean in requiring a smaller number of disks :)

And since it can save space, ZFS kinda makes the disks a little cheaper.

>>> and even on RAID6 I don't trust the closed, opaque, undocumented
>>> implementation that might be in the firmware, is
>> It's a big disadvantage of hardware RAID that you can't read the data
>> when the controller has failed, unless you have another, compatible
>> controller at hand.  Did you check the sources of ZFS so that you can
>> trust it?
> I have looked at the ZoL and zfs-fuse sources in passing when looking
> at various patches I wanted to apply (and writing a small one of my
> own for zfs-fuse, to always force 4KB ashift), but I wouldn't say that
> I have looked through the source enough to trust it based purely on my
> reading of the code.
> But I suspect that orders of magnitude more people have looked at the
> ZFS source than have looked at the RAID controller firmware source.

It's entirely possible, that, in total, 5000 people have looked at the
ZFS sources, and that 4995 of them have looked at the same 5% of it,
with the exception of the 5 ZFS developers (or how many there are) who
have read it all.  It's a bitchy argument --- but it's possible :)

It doesn't eliminate that even if 5000 people wanted to look at any of
the sources for what software is in a raid controller, they couldn't.
However, unless the 5000 people actually read 100% of the ZFS sources,
that's irrelevant.

> For a typical example of the sort of errors I'm talking about, say you
> have a hardware RAID5 array and a ZFS RAIDZ1 pool.
> Take one disk off each controller, write some random data over parts
> of it (let's be kind, don't overwrite the RAID card's headers, which a
> phantom write could theoretically do).
> Now put the disks back into their pools, and read the files whose data
> you just overwrite.
> ZFS will spot the error, restore the data and hand you a good copy back.
> RAID controller will most likely give you back duff data without even
> noticing something is wrong with it.

It would start a rebuild once you plug the disk back in --- provided
that you hotplugged it or had the array up while the disk was removed.
Otherwise, we'd have to try out what would happen, which can be
different from controller to controller.

> You could run ZFS on top of the RAID logical volume, but because ZFS
> would have no visibility of the raw disks and redundancy underneath,
> there is nothing it can do to scrub out the bad data. But even in that
> stack, ZFS would still at least notice the data is corrupted and
> refuse to return it to the application.

Only if you do not happen to write correct checksums for the data that
was overwritten --- thought it seems unlikely enough to happen.

What does ZFS do when it finds block A and block B of data D, both
blocks with a correct checksum but containing different data?  Will it
let you decide which block to use, or refuse to deliver the data, or not
notice and return some data instead of D which is composed of whatever
is in the blocks A and B?

>>> no longer fit for purpose with disks of the kind of size that ship
>>> today.
>> How would that depend on the capacity of the disks?  More data --> more
>> potential for errors --> more security required?
> The data error rates have been stagnant at one unrecoverable sector in
> 10^-14 bits read. That's one bad sector on about 10TB of data. If you
> are using 4TB drives in 3-disk RAID5, if you lose a disk you have to
> read back 8TB of data to rebuild the parity onto the replaced disk. If
> you are statistically going to get one bad block every 10TB of reads,
> it means you have 80% chance of losing some data during rebuilding
> that array.
> Additionally, rebuilding times have been going up with disk size which
> increases both the time of degraded performance during rebuild and the
> probability of failure of another disk during the rebuild.

It makes me wonder why such errors haven't become a widespread serious
problem yet --- not everyone is using ZFS.  Rebuild times are really an
issue; it took days to get it done with the WD20EARS in the server, and
I thought it would never finish because it might start over with each
reboot though it shouldn't.  It didn't and finished eventually, but
stressing the disks for days like that really isn't something you want
to see after replacing one.

>>>> So with VMware, you'd have to get certified hardware.
>>> You wouldn't _have_ to get certified hardware. It just means that if
>>> you find that there is a total of one motherboard that fits your
>>> requirements and it's not on the certified list, you can plausibly
>>> take your chances with it even if it doesn't work out of the box. I
>>> did that with the SR-2 and got it working eventually in a way that
>>> would never have been possible with ESX.
>> Are you saying that for your requirements you couldn't use VMware, which
>> makes it irrelevant whether the hardware is certified for it or not?
> Both. What I wanted to do couldn't be done on ESX within the same
> resource constraints. I wanted to use a deduplicated ZFS pool for my
> VM images, and there is no ESX port of ZFS.
> Additionally, there would have been no way for me to work around my
> hardware bugs with ESX because I couldn't write patches to work around
> the problem.

The idea is that you wouldn't have had to work around hardware bugs
when you'd be using certified hardware.

>> Why would they think that virtualization benefits things that require
>> high performance?
> I don't. But many people think it makes negligible difference because
> they never did their testing properly.

How much difference does it make, actually?  I mean just the
virtualization, without a number of other VMs doing something.  Like you
could plan on a particular hardware for a particular workload and get
performance X with it.  Let's say you plan 4 CPUs/8GB RAM and buy
hardware that has 2x4 CPUs/64GB RAM.  Now you somehow limit the software
to use only 4 CPUs/8GB of that hardware and measure the performance ---
perhaps physically take out one CPU and install only 8GB.

Then you plug in the second CPU and all the RAM, but use a VM, give it 4
CPUs/8GB and measure the performance again.  How much less performance
will you get?

>> When I need the most/best performance possible, it's
>> obviously counter productive.
> Apparently it's not obvious to many, many people.

Thinking of it, you might even get better performance from the VM in the
example above.  You'd be cheating because there are 4 CPUs and 56GB of
RAM underneath the VM to do some work it would otherwise do itself ---
but corrected for that?  What would you get?

And depending on the workload, it might not benefit from more CPUs
and/or more RAM.  But it might benefit from CPUs and RAM underneath a VM
that handles the workload.  So if you'd have all the 8 CPUs/64GB for
your workload, you wouldn't get better performance, but you would by
using a VM for it.  So it's not as obvious as I thought.

Knowledge is volatile and fluid.  Software is power.

Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.