[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance



Gordan Bobic <gordan@xxxxxxxxxx> writes:

> On 06/29/2014 06:07 AM, lee wrote:
>> Gordan Bobic <gordan@xxxxxxxxxx> writes:
>>
>>> On 06/28/2014 12:25 PM, lee wrote:
>>>> Kuba <kuba.0000@xxxxx> writes:
>>
>>>> and ZFS doesn't increase the number of SAS/SATA ports you have.
>>>
>>> No, but it does deprecate the RAID and caching parts of a controller,
>>
>> Why does it deprecate them?
>
> Because it's RAID is far more advanced, and it makes far better use of
> the caches built into the disks.

When solution A and B for problem X are available, that doesn't mean
that either solution would deprecate the other, be it "more advanced" or
not.

ZFS has it's advantages, and it would seem a bad idea to use it with
RAID.  It's tempting to try it out, and I really like the checksumming
it does, and it's also confusing: There's (at least) ZFS and OpenZFS,
and Debian requires you to use fuse if you want ZFS, adding more
complexity.  There's also uncertainty about changes currently being made
to ZFS which makes me wonder if my data might become unreadable after a
software update or a software change when I install the disks in a
different computer --- I've read reports of that happeneing, though it
shouldn't.  And perhaps the next day after I switch to ZFS, a new
feature comes out which would require me to re-create the volumes and to
copy the data over yet again, at least if I wanted to use that feature.

It seems that ZFS isn't sufficiently mature yet to use it.  I haven't
learned much about it yet, but that's my impression so far.

And how about ZFS with JBOD on a hardware RAID controller?

> Since ZFS uses variable width stripes, every write is always a single
> operation.

Which may be completed or not?  And what about the on-disk caches and
power failures?

>> Can you connect SAS disks to them as well?
>
> No. You cannot plug SAS disks into SATA ports, with or without multipliers.

Hm, I wish they would some with which you can do that.

>> Aren't they getting into each others ways, filling up the bandwidth of
>> the port?
>
> If your HBA/PMP/HDD don't support FIS+NCQ, then yes.

It seems that FIS would have to be supported by every HBA because it's
the second layer of the SATA protocol.  And I thought that NCQ is a
feature of the disk itself, which either supports it or not.  Why/how
would a HBA or PMP interfere with NCQ?

> If your HBA/PMP/HDD do support FIS+NCQ (the models I mentioned do),
> then the bandwidth is effectively multiplexed on demand. It works a
> bit like VLAN tagging. A command gets issued, but while that command
> is completing (~8ms on a 7200rpm disk) you can issue commands to other
> disks, so multiple commands on multiple disks can be completing at the
> same time. As each disk completes the command and returns data, the
> same happens in reverse.

And there aren't any (potential) problems with the disks?  Each disk
would have to happily wait around until it can communicate with the HBA
again.  SCSI disks were designed for that, but SATA disks?

>>>> How does it do the checksumming?
> [...]
>>> with nothing else to be done. If the checksum doesn't match the data
>>> (silent corruption), or read of one of the disks containing a piece of
>>> the block fails (non-silent corruption, failed sector)), ZFS will go
>>> and
>>
>> And? Correct the error?
>
> Sorry, did that get truncated?

it did

> Yes, indeed, ZFS will initiate recovery procedures, find a combination
> of blocks which, when assembled, match the checksum, return the data,
> re-calculate the damaged block and write it back to the disk that
> didn't return the correct data.

I'd really like to have that.  There is only so much point in using ECC
RAM when your data may get silently damaged on disk.

>> So it's like RAID built into the file system?  What about all the CPU
>> overhead?
>
> It's expensive - for a file system. In reality, I use a number of
> 1.3GHz N36L HP Microservers with 4-8 disks in each in RAIDZ2 (n+2
> redundancy similar to RAID6), and even on weekly disk scrubs they
> never get anywhere near running out of CPU.

I considered getting one of those --- they seem perfect for NAS.  Some
CPU overhead on the server for ZFS won't hurt.

>> rest is WD SATAs --- and I'm starting to suspect that the RAID
>> controller in the server doesn't like the WD disks at all, which causes
>> the crashes.  Those disks weren't made at all for this application.
>
> This is another problem with clever controllers, especially hardware
> RAID. RAID controllers typically wait around 8-9 seconds for the disk
> to return the data. If it doesn't, they kick the disk out of the
> array.

Well, yes, the disk has failed when it doesn't return data reliably, so
I don't consider that as a problem but as a desirable feature.

What does ZFS do?  Continue to use an unreliable disk?

Or how unreliable is a disk that spends significant amounts of time on
error correction?

> HGST are one exception to the rule - I have a bunch of their 4TB
> drives, and they only make one 4TB model, which has TLER. Most other
> manufacturers make multiple variants of the same drive, and most are
> selectively feature-crippled.

You seem to like the HGST ones a lot.  They seem to cost more than the
WD reds.


-- 
Knowledge is volatile and fluid.  Software is power.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.