[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance


  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Sun, 06 Jul 2014 17:54:56 +0100
  • Delivery-date: Sun, 06 Jul 2014 16:55:30 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 07/06/2014 04:38 PM, lee wrote:
Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 07/04/2014 06:11 PM, lee wrote:

Thanks!  Sooner or later I'll try it out.  How come there are no
packages in the Debian repos other than the fuse package?

Is this some kind of Debian/Ubuntu brain damage that demands that
everything be pre-chewed and served on a plate via the distro attached
repositories? That's a very solipsistic view.

If you think it's an indication of brain damage to prefer using software
that is included in the distribution you're using and to wonder why a
particular software isn't included, then it must be brain damage.

Why? My preferred distribution is Enterprise Linux (RedHat, CentOS, Scientific, or derivatives thereof). I maintain a derivative thereof (for ARM, because nobody else did). EL package set is relatively limited, and even if you include well known external repositories like epel and rpmforge, it is still easy to find relatively well known packages that are either not included or have ancient versions in those repositories.

IMO, the problem is in a distribution teaching it's users that what doesn't ship with the distribution might as well not exist. That kind of conditioning is what I am referring to.

Why exactly is that?  Are you modifying your storage system all the time
or making snapshots all the time?

Since snapshots in ZFS are "free" in terms of performance, they are
much more useful for everyday use. They also make incremental backups
easier because you can use send/receive commands to transfer
incrementally only the delta between the snapshots. Between that and
extra integrity-preserving features it makes reaching for backups much
less frequent.

So for example, before I start working on some source code ~/src/test.c,
I make a snapshot, and when I'm unhappy with the result, I revert to
what I made the snapshot of?  What about emails that have been received
in ~/Mail in the meantime?

Don't keep ~/Mail and src on the same volume.

Checksumming is sure good to have, being able to fully use the disk
caches is, too, as well as not wasting space through fixed block sizes.

Fixed block sizes don't waste space on traditional RAID. Variable
block sizes are a performance feature that allows ZFS to work around
the parity RAID problem of performance dropping down to 50% of
performance of a single disk on partial stripe writes.

When every file occupies at least 4k because that's the block size the
FS is using, you can waste a lot of space.

ZFS cannot use stripes smaller than (sector size) + (redundancy).

i.e. if you use disks with 4KB sectors, and you are writing a 10 byte file on RAIDZ2 (n+2 redundancy, similar to RAID6), that will use 3 sectors (one for the data, plus two for n+2 redundancy), i.e. 12KB.

Variable stripe width is there to improve write performance of partial writes.

The biggest advantage would be checksumming.  I'd be trading that
against ease of use and great complexity.

Not to mention resistance to learning something new.

Not mentioning the risks involved ...

Perhaps our experiences differ - mine shows that lying and dying disks pose a sufficiently high risk of data loss that a traditional RAID and file system cannot be trusted with keeping the data safe.

So you can see how it is not
understandable to me what makes ZFS so great that I wouldn't be able to
do without anymore.

Then don't use it.

Maybe, maybe not --- learning about it doesn't hurt.

Then you better stop coming up with reasons to not use it. :)

So you would be running ZFS on unreliable disks, with the errors being
corrected and going unnoticed, until either, without TLER, the system
goes down or, with TLER, until the errors aren't recoverable anymore and
become noticeable only when it's too late.

ZFS tells you it had problems ("zpool status"). ZFS can also check
entire pool for defects ("zpool scrub", you should do that
periodically).

You're silently loosing more and more redundancy.  How do you know when
a disk needs to be replaced?

Same way you know with any disk failure - appropriate
monitoring. Surely that is obvious.

It's not obvious at all.  Do you replace a disk when ZFS has found 10
errors?

Do you replace a disk when SMART is reporting 10 reallocated sectors? Can you even get to information of that granularity with most hardware RAID controllers?

You have to exercise some reasonable judgement there, and apply monitoring, just like you would with any other disk/RAID.

Does ZFS maintain a list of bad sectors which are not to be used again?

By that fact you are asking this question, I dare say you need to go
and read up more on how modern disks work. Modern disks manage their
defects themselves. When a sector fails and cannot be read, they
return an error on the read, and mark the sector as pending. Next time
that sector is written, they will write it to one of the spare, hidden
sectors, and map the LBA for the failed sector to the new
sector. There has been no need for the file system to keep track of
physical disk defects in decades.

That's assuming that the disks reliably do what they are supposed to
do.  Can you guarantee that they always will?

Of course I can't - but I trust ZFS to mitigate the issue by providing several additional layers that increase the chances that the data will not get damaged.

zfs send produces a data stream that can be applied to another pool
using zfs receive. You can pipe this over ssh or netcat to a different
machine, or you can pipe it to a different pool locally.

So I'd be required to also use ZFS on the receiving side for it to make
sense.

Indeed.

That's some laboratory experimenting with ZFS.  Backblaze uses ext4,
though ZFS would seem to be a very good choice for what they're doing.
How can they store so much data without checksumming, without using ECC
RAM and not experience a significant amount of data corruption?

You are asking the wrong question - how would they know if they are
experiencing data corruption? The vast majority of backups are
write-only. If 4KB of data (one sector) goes bad for every 10TB read,
if only 1% of the backups ever need to get retrieved, that's one
detected broken file over 1 petabyte of data stored.

They claim[1] that they are currently storing over 100 petabytes and
have restored 6.27 billion files.  They are expecting to store another
500 petabyte at another datacenter.  That's over a hundred, and if they
meet their plan, at least 500 detected broken files, so they must know.

Mentioning such a thing occurs could be considered bad for business.

And I would guess that the number of retrieved files is far greater than
1%.  You get unlimited storage for $5/month and are able to retrieve a
particular single file without significant delays.  When you're using
their service, why would you even keep files you don't access frequently
on your own disks?

Because their software only lets you back up files you store on your disk - last I checked there are restrictions in place to prevent abuse of the system by using it as unlimited cloud storage rather than backups.

At some point, it's cheaper to have them in backups
and to just retrieve them when you need them.  You think you have use
for a NAS or something similar?  Why throw the money at it when you can
have something very similar for $5/month?

How many people have large amounts of data which must be available right
away and couldn't be stored remotely (letting security issues aside)?
Can you store the part of your data which you do not need to have
available right away for a total cost of only $5/month yourself while
that data is readily accessible at any time?

Considering that, the rate of data restored may well be 20%--50% or even
more.  And with only a single file they are unable to restore, their
service would have failed.

So how can they afford not to use ECC RAM and to use a file system that
allows for data corruption?


[1]: http://blog.backblaze.com/category/behind-backblaze/

See above - it can only be used using their closed-source backup software and there are features that get in the way of abuse of the system by using it as just offline storage, and least from what I remember from last time I checked. There's also no Linux support, so I don't use it, so I cannot tell you any more details.

What is the actual rate of data corruption or loss prevented or
corrected by ZFS due to its checksumming in daily usage?

According to disk manufacturers' own specifications for their own
disks (i.e. assume it's worse), one unrecoverable error in 10^14 bits
read. This doesn't include complete disk failures.

That still doesn't answer the question ...

If you define what "daily usage" is in TB/day, you will be able to work out how many errors per day you can expect from the numbers I mentioned above.

Gordan

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.