[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance

Gordan Bobic <gordan@xxxxxxxxxx> writes:

>> On 07/01/2014 05:24 PM, lee wrote:
>> ZFS has it's advantages, and it would seem a bad idea to use it with
>> RAID.
> That isn't true. While it is better to use it with bare disks, using
> it on top of RAID is still better than using something else because
> you still at least get to know about errors that creep in, even if ZFS
> can no longer fix them for you.

That's why I'm saying that it seems a bad idea:  You have redundancy and
you can't fully use it for the redundancy is in the wrong place.  If it
was in the right place, the errors could be corrected.

>> It's tempting to try it out, and I really like the checksumming
>> it does, and it's also confusing: There's (at least) ZFS and OpenZFS,
>> and Debian requires you to use fuse if you want ZFS, adding more
>> complexity.
> You haven't done your research thoroughly enough.

No, I haven't looked into it thoroughly at all.

> On Linux there is for all intents and purposes one implementation.

Where is this implementation?  Is it available by default?  I only saw
that there's a Debian package for ZFS which involves fuse.

>> There's also uncertainty about changes currently being made
>> to ZFS which makes me wonder if my data might become unreadable after a
>> software update or a software change when I install the disks in a
>> different computer --- I've read reports of that happeneing, though it
>> shouldn't.
> If you've read about it I'd like to know where.

It was some blog post somewhere --- unfortunately, I can't find it

> I've been using ZoL since back when the only POSIX layer
> implementation was from KQ Infotech, which was a rather early alpha
> grade bodge, and I never saw any forward incompatibility, nor have I
> ever lost any data to ZFS, which is more than I can say for most other
> file systems.

A very long time ago, I lost data with xfs once.  It probably was my own
fault, using some mount parameters wrongly.  That taught me to be very
careful with file system and to prefer file systems that are easy to
use, that don't have many or any parameters that need to be considered
and basically just do what they are supposed to right out of the box.

Does ZFS do that?  Since it's about keeping the data safe, it might have
a good deal of protection against user errors.

>> And perhaps the next day after I switch to ZFS, a new
>> feature comes out which would require me to re-create the volumes and to
>> copy the data over yet again, at least if I wanted to use that feature.
> You're spreading misinformed FUD. There are no "features" that could
> be added that might require you to rebuild the pool. An existing pool
> can always be upgraded. There is no way to downgrade, though, so make
> sure you really want those extra features.

See http://open-zfs.org/wiki/Features:

"SA based xattrs

Improves performance of linux-style (short) xattrs [...]

Requires a disk format change and is off by default [...]

Note that SA based xattrs are no longer used on symlinks as of Aug 2013
until an issue is resolved."

What's the difference between "a disk format change" and "rebuilding the
pool"?  And how could you predict that nothing changes requiring a
rebuild or a format change when there are issues that apparently haven't
been resolved in almost a year now and features that haven't been
implemented yet?  You might enable a new feature and find that it causes
problems, but you can't downgrade ...

>> It seems that ZFS isn't sufficiently mature yet to use it.  I haven't
>> learned much about it yet, but that's my impression so far.
> As I said above - you haven't done your research very thoroughly.

I haven't, yet all I've been reading so far makes me very careful.  When
you search for "zfs linux mature", you find more sources saying
something like "it is not really mature" and not many, if any, that
would say something like "of course you should use it, it works

>> And how about ZFS with JBOD on a hardware RAID controller?
> That is the recommended way to use it (effectively do away with the
> RAID part of the controller).

That's something else I never tried.  What if I make a JBOD and then
connect the disks to "normal" on-board SATA ports?  Will they be
readable just like they were never connected to a RAID controller?

>>> Since ZFS uses variable width stripes, every write is always a single
>>> operation.
>> Which may be completed or not?  And what about the on-disk caches and
>> power failures?
> That's what barriers and the sync settings on each file systems are
> for. As with any FS, any commits since the last barrier call will be
> lost. Everything up to the last barrier call is guaranteed to be safe,
> unless your disk or controllers lie about having commited things. This
> is not ZFS specific and applies to any FS.

IIRC, when I had the WD20EARS in software RAID-5, I got messages about
barriers being disabled.  I tried to find out what that was supposed to
tell me, and it didn't seem to be too harmful, and there wasn't anything
I could do about it anyway.  What if I use them as JBOD with ZFS and get
such messages?

>> It seems that FIS would have to be supported by every HBA because it's
>> the second layer of the SATA protocol.  And I thought that NCQ is a
>> feature of the disk itself, which either supports it or not.  Why/how
>> would a HBA or PMP interfere with NCQ?
> Not all HBAs support NCQ, and not all support FIS. They are specific
> features that have to be implemented on the HBA and PMP and HDD.

So the wikipedia article about SATA is wrong?  Or how does that work
when any of the involved devices does not support some apparently
substantial parts of the SATA protocol?

>>> If your HBA/PMP/HDD do support FIS+NCQ (the models I mentioned do),
>>> then the bandwidth is effectively multiplexed on demand. It works a
>>> bit like VLAN tagging. A command gets issued, but while that command
>>> is completing (~8ms on a 7200rpm disk) you can issue commands to other
>>> disks, so multiple commands on multiple disks can be completing at the
>>> same time. As each disk completes the command and returns data, the
>>> same happens in reverse.
>> And there aren't any (potential) problems with the disks?  Each disk
>> would have to happily wait around until it can communicate with the HBA
>> again.  SCSI disks were designed for that, but SATA disks?
> The PMP takes care of it. It works, and it works well. NCQ on most
> SATA SSDs works in reverse this way because most of the time the disk
> is faster than the SATA port.

You said before that the disks are slower than the port because they
spend so much time seeking.  Does the PMP have some cache built in so
that the disks don't need to wait?  Or are the disks designed to wait?

>> Well, yes, the disk has failed when it doesn't return data reliably, so
>> I don't consider that as a problem but as a desirable feature.
>> What does ZFS do?  Continue to use an unreliable disk?
> Until the OS kernel's controller driver decides the disk has stopped
> responding and kicked it out.

As far as I've seen, that doesn't happen.  Instead, the system goes
down, trying to access the unresponsive disk indefinitely.

> Hence why TLER is still a useful feature
> - 
> you don't want your application to end up being made to wait for
> potentially minutes when the data could be recovered and repaired in a
> few seconds if the disk would only give up and return an error in a
> timely manner.

So you would be running ZFS on unreliable disks, with the errors being
corrected and going unnoticed, until either, without TLER, the system
goes down or, with TLER, until the errors aren't recoverable anymore and
become noticeable only when it's too late.

>> Or how unreliable is a disk that spends significant amounts of time on
>> error correction?
> Exactly - 7 seconds is about 840 read attempts. If the sector read
> failed 840 times in a row, what are the chances that it will ever
> succeed?

Isn't the disk supposed not to use the failed sector once it has been
discovered, meaning that the disk might still be useable?

>>> HGST are one exception to the rule - I have a bunch of their 4TB
>>> drives, and they only make one 4TB model, which has TLER. Most other
>>> manufacturers make multiple variants of the same drive, and most are
>>> selectively feature-crippled.
>> You seem to like the HGST ones a lot.  They seem to cost more than the
>> WD reds.
> I prefer them for a very good reason:
> http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/

Those guys don't use ZFS.  They must have very good reasons not to.

> 20 years of personal experience also agrees with their findings.

My experience agrees with their findings, too, only it's not tied to a
particular brand or model other than that Seagate disks failed
remarkably often and that I should never have bought Maxtor.

HGST has been bought by WD, though.  We can only hope that they will
continue to make outstanding disks.

Knowledge is volatile and fluid.  Software is power.

Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.