[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance

  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Thu, 03 Jul 2014 07:46:28 +0100
  • Delivery-date: Thu, 03 Jul 2014 06:47:17 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 07/02/2014 11:45 PM, lee wrote:
Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 07/01/2014 05:24 PM, lee wrote:

ZFS has it's advantages, and it would seem a bad idea to use it with

That isn't true. While it is better to use it with bare disks, using
it on top of RAID is still better than using something else because
you still at least get to know about errors that creep in, even if ZFS
can no longer fix them for you.

That's why I'm saying that it seems a bad idea:  You have redundancy and
you can't fully use it for the redundancy is in the wrong place.  If it
was in the right place, the errors could be corrected.

It's tempting to try it out, and I really like the checksumming
it does, and it's also confusing: There's (at least) ZFS and OpenZFS,
and Debian requires you to use fuse if you want ZFS, adding more

You haven't done your research thoroughly enough.

No, I haven't looked into it thoroughly at all.

On Linux there is for all intents and purposes one implementation.

Where is this implementation?  Is it available by default?  I only saw
that there's a Debian package for ZFS which involves fuse.


I've been using ZoL since back when the only POSIX layer
implementation was from KQ Infotech, which was a rather early alpha
grade bodge, and I never saw any forward incompatibility, nor have I
ever lost any data to ZFS, which is more than I can say for most other
file systems.

A very long time ago, I lost data with xfs once.  It probably was my own
fault, using some mount parameters wrongly.  That taught me to be very
careful with file system and to prefer file systems that are easy to
use, that don't have many or any parameters that need to be considered
and basically just do what they are supposed to right out of the box.

No file system is immune from user error. If you want optimal performance that's a whole different issue, and there are a lot of things you have to tweak on most of them to achieve that, especially on RAID.

Does ZFS do that?  Since it's about keeping the data safe, it might have
a good deal of protection against user errors.

I don't think it's possible to guard against user errors. If you're concerned about user errors, get someone else to manage your machines and not give you the root password.

And perhaps the next day after I switch to ZFS, a new
feature comes out which would require me to re-create the volumes and to
copy the data over yet again, at least if I wanted to use that feature.

You're spreading misinformed FUD. There are no "features" that could
be added that might require you to rebuild the pool. An existing pool
can always be upgraded. There is no way to downgrade, though, so make
sure you really want those extra features.

See http://open-zfs.org/wiki/Features:

"SA based xattrs

Improves performance of linux-style (short) xattrs [...]

Requires a disk format change and is off by default [...]

Note that SA based xattrs are no longer used on symlinks as of Aug 2013
until an issue is resolved."

What's the difference between "a disk format change" and "rebuilding the
pool"?  And how could you predict that nothing changes requiring a
rebuild or a format change when there are issues that apparently haven't
been resolved in almost a year now and features that haven't been
implemented yet?

You don't have to rebuild a pool. The existing pool is modified in place and that usually takes a few seconds. Typically the pool version headers get a bump, and from there on ZFS knows it can put additional metadata in place.

Similar happens when you toggle deduplication on a pool. It puts the deduplication hash table headers in place. Even you remove the volume that has been deduplicated and don't have any deduplicated blocks afterwards, the headers will remain in place. But that doesn't break anything and it doesn't require rebuilding of a pool.

You might enable a new feature and find that it causes
problems, but you can't downgrade ...

You could have to _use_ a feature that causes problems just because it's available. And features that broken are rare, and non-critical.

It seems that ZFS isn't sufficiently mature yet to use it.  I haven't
learned much about it yet, but that's my impression so far.

As I said above - you haven't done your research very thoroughly.

I haven't, yet all I've been reading so far makes me very careful.  When
you search for "zfs linux mature", you find more sources saying
something like "it is not really mature" and not many, if any, that
would say something like "of course you should use it, it works

There's a lot of FUD out there, mostly coming from people who have neither tried it nor know what they are talking about. Whatever next? "It must be true because I read it on the internet"?

And how about ZFS with JBOD on a hardware RAID controller?

That is the recommended way to use it (effectively do away with the
RAID part of the controller).

That's something else I never tried.  What if I make a JBOD and then
connect the disks to "normal" on-board SATA ports?  Will they be
readable just like they were never connected to a RAID controller?

That depends on what the RAID controller does to the disks. It is another big unknown that is a reason why RAID controllers are best avoided. Somebody recently got help on the ZFS mailing list recovering from just such a situation, where their new RAID controller clobbered the front of their disks.

Since ZFS uses variable width stripes, every write is always a single

Which may be completed or not?  And what about the on-disk caches and
power failures?

That's what barriers and the sync settings on each file systems are
for. As with any FS, any commits since the last barrier call will be
lost. Everything up to the last barrier call is guaranteed to be safe,
unless your disk or controllers lie about having commited things. This
is not ZFS specific and applies to any FS.

IIRC, when I had the WD20EARS in software RAID-5, I got messages about
barriers being disabled.  I tried to find out what that was supposed to
tell me, and it didn't seem to be too harmful, and there wasn't anything
I could do about it anyway.  What if I use them as JBOD with ZFS and get
such messages?

No idea, I don't see any such messages. It's probably a feature of your RAID controller driver.

It seems that FIS would have to be supported by every HBA because it's
the second layer of the SATA protocol.  And I thought that NCQ is a
feature of the disk itself, which either supports it or not.  Why/how
would a HBA or PMP interfere with NCQ?

Not all HBAs support NCQ, and not all support FIS. They are specific
features that have to be implemented on the HBA and PMP and HDD.

So the wikipedia article about SATA is wrong?  Or how does that work
when any of the involved devices does not support some apparently
substantial parts of the SATA protocol?

You misunderstand. When I say "FIS" I am talking about FIS based switching, as opposed to command based switching. Perhaps a lack of clarity on my part, apologies for that.

If your HBA/PMP/HDD do support FIS+NCQ (the models I mentioned do),
then the bandwidth is effectively multiplexed on demand. It works a
bit like VLAN tagging. A command gets issued, but while that command
is completing (~8ms on a 7200rpm disk) you can issue commands to other
disks, so multiple commands on multiple disks can be completing at the
same time. As each disk completes the command and returns data, the
same happens in reverse.

And there aren't any (potential) problems with the disks?  Each disk
would have to happily wait around until it can communicate with the HBA
again.  SCSI disks were designed for that, but SATA disks?

The PMP takes care of it. It works, and it works well. NCQ on most
SATA SSDs works in reverse this way because most of the time the disk
is faster than the SATA port.

You said before that the disks are slower than the port because they
spend so much time seeking.  Does the PMP have some cache built in so
that the disks don't need to wait?  Or are the disks designed to wait?

I'm not sure what the implementation details are, but there is no difference between what happens with a PMP and without a PMP - the SATA controller migh not always be able to receive a request that had been queued, so the disk has to "wait", e.g. if there's an interrupt storm going on, or there is another bottleneck on the system (e.g. having two 6GBit SATA ports behind a single PCIe lane).

Well, yes, the disk has failed when it doesn't return data reliably, so
I don't consider that as a problem but as a desirable feature.

What does ZFS do?  Continue to use an unreliable disk?

Until the OS kernel's controller driver decides the disk has stopped
responding and kicked it out.

As far as I've seen, that doesn't happen.  Instead, the system goes
down, trying to access the unresponsive disk indefinitely.

I see a disk get kicked out all the time. Most recent occurrence was 2 days ago.

Hence why TLER is still a useful feature
you don't want your application to end up being made to wait for
potentially minutes when the data could be recovered and repaired in a
few seconds if the disk would only give up and return an error in a
timely manner.

So you would be running ZFS on unreliable disks, with the errors being
corrected and going unnoticed, until either, without TLER, the system
goes down or, with TLER, until the errors aren't recoverable anymore and
become noticeable only when it's too late.

"zfs status" shows you the errors on each disk in the pool. This should be monitored along with regular SMART checks. Using ZFS doesn't mean you no longer have to monitor for hardware failure, any more than you can not monitor for failure of a disk in a hardware RAID array.

Or how unreliable is a disk that spends significant amounts of time on
error correction?

Exactly - 7 seconds is about 840 read attempts. If the sector read
failed 840 times in a row, what are the chances that it will ever

Isn't the disk supposed not to use the failed sector once it has been
discovered, meaning that the disk might still be useable?

When a sector becomes unreadable, it is marked as "pending". Rad attempts from it will return an error. The next write to it will cause it to get reallocated from the spare sectors the disk comes with. As far as I can tell, some disks try to re-use the sector when a write for it arrives, and see if the data sticks to the sector within the ability of the sector's ECC to recover. If it sticks, it's kept, if it doesn't, it's reallocated.

HGST are one exception to the rule - I have a bunch of their 4TB
drives, and they only make one 4TB model, which has TLER. Most other
manufacturers make multiple variants of the same drive, and most are
selectively feature-crippled.

You seem to like the HGST ones a lot.  They seem to cost more than the
WD reds.

I prefer them for a very good reason:

Those guys don't use ZFS.  They must have very good reasons not to.

I don't know what they use.

20 years of personal experience also agrees with their findings.

My experience agrees with their findings, too, only it's not tied to a
particular brand or model other than that Seagate disks failed
remarkably often and that I should never have bought Maxtor.

HGST has been bought by WD, though.  We can only hope that they will
continue to make outstanding disks.

We can but hope.


Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.