[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance


  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Wed, 09 Jul 2014 08:39:25 +0100
  • Delivery-date: Wed, 09 Jul 2014 07:40:07 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 2014-07-09 03:29, lee wrote:

When every file occupies at least 4k because that's the block size the
FS is using, you can waste a lot of space.

ZFS cannot use stripes smaller than (sector size) + (redundancy).

i.e. if you use disks with 4KB sectors, and you are writing a 10 byte
file on RAIDZ2 (n+2 redundancy, similar to RAID6), that will use 3
sectors (one for the data, plus two for n+2 redundancy), i.e. 12KB.

Variable stripe width is there to improve write performance of partial
writes.

And the checksums go into the same sector? So for writing a file that's
4k, two sectors would be used, plus redundancy?

If that is so, wouldn't the capacity appear to be increased or to be
variable with ZFS, depending on file size?  I'm confused now ...

I don't know the details of the on-disk format - you'll
have to read the code for that.

The biggest advantage would be checksumming.  I'd be trading that
against ease of use and great complexity.

Not to mention resistance to learning something new.

Not mentioning the risks involved ...

Perhaps our experiences differ - mine shows that lying and dying disks
pose a sufficiently high risk of data loss that a traditional RAID and
file system cannot be trusted with keeping the data safe.

That doesn't eliminate the risks.  Perhaps I've been lucky --- the more
I learn about it, the more I think I should do something.

The biggest problem that a lot of the time you wouldn't even
be aware that there is a problem with traditional RAID. Same
as without ECC memory, it's not just that the errors are
uncorrectable, it's the fact that you don't even know if
the errors have occurred.

Same way you know with any disk failure - appropriate
monitoring. Surely that is obvious.

It's not obvious at all.  Do you replace a disk when ZFS has found 10
errors?

Do you replace a disk when SMART is reporting 10 reallocated sectors?

No, I'm not using smart.

Then is seems you have far bigger configuration issues to address
before looking at ZFS. :)

You have to exercise some reasonable judgement there, and apply
monitoring, just like you would with any other disk/RAID.

It's simple with RAID because the disk either fails or not.  It's
usually simple with disks because it either fails or not.  Introducing
another indicator which may mean that a disk has "failed a little"
doesn't make things simpler.

But it does mean that you keep extra redundancy for as long as
possible. By keeping a disk with an inccreasing number of failed
sectors in the pool, it means that while you are rebuilding to a
new drive with the failing disk still attached, that disk is still
providing redundancy of all of it's surviving sectors.

zfs send produces a data stream that can be applied to another pool
using zfs receive. You can pipe this over ssh or netcat to a different
machine, or you can pipe it to a different pool locally.

So I'd be required to also use ZFS on the receiving side for it to make
sense.

Indeed.

That's kinda evil because if I run into problems with ZFS, it might be
good to have used a different file system for backups, and when I have
to restore from the backup, I might get corrupted data because it hasn't
been checksumed.

You can't have it both ways. Either you use zfs on both ends and get
advantage of it's fetures, such as zfs send, or you do things the hard,
slow and error prone way with rsync.

They claim[1] that they are currently storing over 100 petabytes and
have restored 6.27 billion files. They are expecting to store another 500 petabyte at another datacenter. That's over a hundred, and if they meet their plan, at least 500 detected broken files, so they must know.

Mentioning such a thing occurs could be considered bad for business.

There's no good way to hide it because ppl would notice when restoring.
But they say their software does checksuming, just not how.

There are many ways to do that. Tripwire does it. If you just wanted
a cross check you could have a process on a write-only system that
periodically runs and stores/compares the md5sum of the file stored
in, say, it's extended attributes, to the contents of the file.

What is the actual rate of data corruption or loss prevented or
corrected by ZFS due to its checksumming in daily usage?

According to disk manufacturers' own specifications for their own
disks (i.e. assume it's worse), one unrecoverable error in 10^14 bits
read. This doesn't include complete disk failures.

That still doesn't answer the question ...

If you define what "daily usage" is in TB/day, you will be able to
work out how many errors per day you can expect from the numbers I
mentioned above.

That would be theoretical numbers.  That so many errors /can/ occur
doesn't mean that they /do/ occur.

You are right, it doesn't. But my SMART attributes on disks seem to
broadly agree, in terms of relation between values such as total
LBAs read and reallocated sector counts (caveat - not all disks have
the total LBAs read value in SMART).

ZFS is not involved in determining
such numbers, either, and there may be more or less errors ZFS detects
than what the specification says. I'm asking what the rate actually is.

As I said, from SMART figures on my many disks the 10^-14 seems to be
in the right ball park, although some disks are better than others.

I have disks for which the specification says that the MTBF is one
million hours (or maybe even two million).  That means I would see only
one single failure in over a hundred years, and if I had over a hundred
of such disks, I would only see one failure within a year.  My
experience indicates, and studies with large numbers of disks indicate,
that this is BS and that the actual failure rate is much different.

Indeed, the MTBF figures seem to bear little resembleance to reality,
but the unrecoverable error rate figures do seem to tally up with
empirical observations, in my experience.


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.