[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance


  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Wed, 02 Jul 2014 07:46:50 +0100
  • Delivery-date: Wed, 02 Jul 2014 06:46:57 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 07/01/2014 05:24 PM, lee wrote:
Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 06/29/2014 06:07 AM, lee wrote:
Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 06/28/2014 12:25 PM, lee wrote:
Kuba <kuba.0000@xxxxx> writes:

and ZFS doesn't increase the number of SAS/SATA ports you have.

No, but it does deprecate the RAID and caching parts of a controller,

Why does it deprecate them?

Because it's RAID is far more advanced, and it makes far better use of
the caches built into the disks.

When solution A and B for problem X are available, that doesn't mean
that either solution would deprecate the other, be it "more advanced" or
not.

ZFS has it's advantages, and it would seem a bad idea to use it with
RAID.

That isn't true. While it is better to use it with bare disks, using it on top of RAID is still better than using something else because you still at least get to know about errors that creep in, even if ZFS can no longer fix them for you. Fixing is better than just knowing about errors, but even just knowning that there is an error is a valuable improvement on blissful ignorance.

It's tempting to try it out, and I really like the checksumming
it does, and it's also confusing: There's (at least) ZFS and OpenZFS,
and Debian requires you to use fuse if you want ZFS, adding more
complexity.

You haven't done your research thoroughly enough.

OpenZFS is a collaborative project that works on consistency and code sharing between different open source implementations of ZFS, e.g. ZFS-on-Linux, FreeBSD and Illumos.

You only need fuse for fuse-zfs, which is now deprecated and you probably shouldn't be using it unless you are running on a 32-bit platform or are using it for attempting to recover a damaged pool that won't import with other implementations.

On Linux there is for all intents and purposes one implementation.

There's also uncertainty about changes currently being made
to ZFS which makes me wonder if my data might become unreadable after a
software update or a software change when I install the disks in a
different computer --- I've read reports of that happeneing, though it
shouldn't.

If you've read about it I'd like to know where. I've been using ZoL since back when the only POSIX layer implementation was from KQ Infotech, which was a rather early alpha grade bodge, and I never saw any forward incompatibility, nor have I ever lost any data to ZFS, which is more than I can say for most other file systems.

There is no way the on-disk format should be different for any given pool version across ZFS implementations.

And perhaps the next day after I switch to ZFS, a new
feature comes out which would require me to re-create the volumes and to
copy the data over yet again, at least if I wanted to use that feature.

You're spreading misinformed FUD. There are no "features" that could be added that might require you to rebuild the pool. An existing pool can always be upgraded. There is no way to downgrade, though, so make sure you really want those extra features.

It seems that ZFS isn't sufficiently mature yet to use it.  I haven't
learned much about it yet, but that's my impression so far.

As I said above - you haven't done your research very thoroughly.

And how about ZFS with JBOD on a hardware RAID controller?

That is the recommended way to use it (effectively do away with the RAID part of the controller).

Since ZFS uses variable width stripes, every write is always a single
operation.

Which may be completed or not?  And what about the on-disk caches and
power failures?

That's what barriers and the sync settings on each file systems are for. As with any FS, any commits since the last barrier call will be lost. Everything up to the last barrier call is guaranteed to be safe, unless your disk or controllers lie about having commited things. This is not ZFS specific and applies to any FS.

Aren't they getting into each others ways, filling up the bandwidth of
the port?

If your HBA/PMP/HDD don't support FIS+NCQ, then yes.

It seems that FIS would have to be supported by every HBA because it's
the second layer of the SATA protocol.  And I thought that NCQ is a
feature of the disk itself, which either supports it or not.  Why/how
would a HBA or PMP interfere with NCQ?

Not all HBAs support NCQ, and not all support FIS. They are specific features that have to be implemented on the HBA and PMP and HDD.

If your HBA/PMP/HDD do support FIS+NCQ (the models I mentioned do),
then the bandwidth is effectively multiplexed on demand. It works a
bit like VLAN tagging. A command gets issued, but while that command
is completing (~8ms on a 7200rpm disk) you can issue commands to other
disks, so multiple commands on multiple disks can be completing at the
same time. As each disk completes the command and returns data, the
same happens in reverse.

And there aren't any (potential) problems with the disks?  Each disk
would have to happily wait around until it can communicate with the HBA
again.  SCSI disks were designed for that, but SATA disks?

The PMP takes care of it. It works, and it works well. NCQ on most SATA SSDs works in reverse this way because most of the time the disk is faster than the SATA port.

rest is WD SATAs --- and I'm starting to suspect that the RAID
controller in the server doesn't like the WD disks at all, which causes
the crashes.  Those disks weren't made at all for this application.

This is another problem with clever controllers, especially hardware
RAID. RAID controllers typically wait around 8-9 seconds for the disk
to return the data. If it doesn't, they kick the disk out of the
array.

Well, yes, the disk has failed when it doesn't return data reliably, so
I don't consider that as a problem but as a desirable feature.

What does ZFS do?  Continue to use an unreliable disk?

Until the OS kernel's controller driver decides the disk has stopped responding and kicked it out. Hence why TLER is still a useful feature - you don't want your application to end up being made to wait for potentially minutes when the data could be recovered and repaired in a few seconds if the disk would only give up and return an error in a timely manner.

Or how unreliable is a disk that spends significant amounts of time on
error correction?

Exactly - 7 seconds is about 840 read attempts. If the sector read failed 840 times in a row, what are the chances that it will ever succeed?

HGST are one exception to the rule - I have a bunch of their 4TB
drives, and they only make one 4TB model, which has TLER. Most other
manufacturers make multiple variants of the same drive, and most are
selectively feature-crippled.

You seem to like the HGST ones a lot.  They seem to cost more than the
WD reds.

I prefer them for a very good reason:
http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/

20 years of personal experience also agrees with their findings.

Gordan

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.