[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance


  • To: xen-users@xxxxxxxxxxxxx
  • From: Kuba <kuba.0000@xxxxx>
  • Date: Sat, 05 Jul 2014 12:33:42 +0200
  • Delivery-date: Sat, 05 Jul 2014 10:34:19 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

W dniu 2014-07-05 04:57, lee pisze:
Kuba <kuba.0000@xxxxx> writes:

W dniu 2014-07-04 19:11, lee pisze:
Kuba <kuba.0000@xxxxx> writes:

"Rollback" doesn't sound very destructive.

For me "rollback" always meant "revert to some previous state" and for
me it sounds very destructive - at least for the "current state" from
which you are reverting.

It still doesn't sound destructive.

How can a file system protect you from executing a destructive
operation?

It can try by warning you.

Does "rm" sound destructive or try to warn you? It just does what you
tell it to do.

It's not a file system and has options to warn you.  The options aren't
enabled by default because it won't make much sense.  For a file systen,
it would make sense to get a warning like "this will destroy your
current data" when you issue a command that would perform a rollback and
to have an option to disable the warning.

I believe one should know exactly what hitting [enter] is going to do when the line you're typing on starts with a #.

Snapshots protect you from most user errors. Off-site backups protect
you from su errors. To some extent.

Off-site would be good, but it's a hassle because I'd have to carry the
disks back and forth.

You can do that over the network.

Unfortunately, that is entirely not feasible.

And it's always pros vs cons. It's your data, your requirements, your
decisions and your responsibility.

Yes, and I can only do so much.

And how are snapshots better than copying the
data?

Snapshots are just snapshots, making them does not copy your data
(well, in fact, ZFS is a COW file system, so making a snapshot may
result in actually copying your data later on, if it's needed, but
it's not copying as in "making a backup"). Replicating a snapshot
results in creation of another dataset identical to the original
snapshot. It's just a one more way of making full or incremental
backups.

So it's making a backup and not making a backup?  What are snapshots
good for when I can't restore from them, i. e. use them for backups?

Snapshots are not backups. I believe it holds true for anything that lets you make a snapshot.

What if I need to access a file that's in the snapshot:  Do I
need to restore the snapshot first?

Usually you can "cd .zfs" directory, which contains subdirectories
named after your snapshots, and inside that directories you have
complete datasets just like the ones you took the snapshots of. No
rollback/restoring/mounting is necessary.

And that also works when the file system the snapshot was created from
doesn't exist anymore, or when the disks with the FS the snapshot was
made from have become inaccessible, provided that the snapshot was made
to different disks?

Oversimplifying: yes.

any other FS. I have the same feeling about ZFS as Gordan - once you
start using it, you cannot imagine making do without it.

Why exactly is that?  Are you modifying your storage system all the time
or making snapshots all the time?

Yes, I take snapshots all the time. This way it's easy for me to
revert VMs to previous states, clone them, etc. Same goes with my
regular data. And I replicate them a lot.

Hm, what for?  The VMs I have are all different, so there's no point in
cloning them.  And why would I clone my data?  I don't even have the
disk capacity for that and am glad that I can make a backup.

I tend to clone "production" VMs before I start fiddling with them, so that I can test potentially dangerous ideas without any consequences. Clones are "free" - they only start using more space when you introduce some difference between the clone and the original dataset. You can always 'promote' them so they become independent from the original dataset (using more space as required). Cloning is just a tool that you might or might not find useful.

I'm not saying you will feel about ZFS as I do after you try it
out. It presents you a certain set of features, advantages and
disadvantages and it is up to you, and you only, to decide whether you
can benefit from it or not. All I'm saying is that I personally
believe ZFS is worth taking into consideration.

So far, the advantages I'm seeing that would be a benefit are
checksumming and using on-disk caches for writing.  The latter doesn't
seem to be overly relevant.  That means a lot of learning and
experimentation and uncertainties for that one benefit.

I suppose it's all relative. Couple of years ago I switched to FreeBSD (unknown to me before) for my storage VMs only because it had ZFS which I had found to be the only solution to the problems I had at that time. That really meant a lot of learning, experimentation and uncertainties. It paid off for me. I'm not saying it will pay off for you. All I'm saying is 'look, here's this ZFS thing, there's a chance you might find it interesting'. By all means I'm not saying 'this is ZFS, it will solve all your problems and you have to use it'.

So you would be running ZFS on unreliable disks, with the errors being
corrected and going unnoticed, until either, without TLER, the system
goes down or, with TLER, until the errors aren't recoverable anymore and
become noticeable only when it's too late.

ZFS tells you it had problems ("zpool status"). ZFS can also check
entire pool for defects ("zpool scrub", you should do that
periodically).

You're silently loosing more and more redundancy.

I'm not sure what you mean by loosing redundancy.

You don't know whether the data has been written correctly before you
read it.  The more errors there are, the more redundancy you loose
because you have more data that can be read from only a part of the
disks.  If there is an error on another disk with that same data, you
don't know until you try to read it and perhaps find out that you can't.
How many errors for that data it takes depends on the level of
redundancy.

I don't understand your point here. Do you know all your data had been written correctly with any other form of RAID without reading it back?

How do you know when
a disk needs to be replaced?

ZFS tells you it had IO or checksum failures. It may also put your
pool into a degraded state (with one or more disks disconnected from
the pool) with reduced redundancy (just like a regular RAID would
do). SMART also tells you something wrong has happened (or is going
to, probably). And, additionally, when you replace a disk and resilver
(ZFS term for rebuilding) the pool, you know whether all your data was
read and restored without errors.

And how do you know when to replace a disk?  When there's one error or
when there are 50 or 50000 or when the disk has been disconnected?

I believe it's up to you to interpret the data you're presented with and make the right decision. I really wish I could formulate a condition that evaluates to true or false telling me what should I do with a disk.

Does ZFS maintain a list of bad sectors which are not to be used again?

Don't know, but never heard of it. I always thought it's the storage
device's job. Does any file system do that?

I don't know.  It would make sense because there's no telling what the
disk is doing --- the disk might very well re-use a bad sector and find
that just at the time you want to read the data, it's not readable
again.  The disk might continue to disagree with ZFS and insist on
re-using the sector.  Perhaps it figured that it can use the sector
eventually after so many tries to recover the error.

The error might not even be noticed with other file systems, other than
as a delay due to error correction maybe.  That other file system would
deliver corrupt data or correct data, there's no way to know.  Disks
aren't designed for ZFS in the first place.

It's also quite difficult to corrupts the file system
itself:
https://blogs.oracle.com/timc/entry/demonstrating_zfs_self_healing

It shows that there are more checksum errors after the errors were
supposedly corrected.

Not "supposedly". The increasing number only shows the count of
encountered checksum errors. If ZFS could not correct the error, it
would say so.

Ok, and why are there more errors after an error was corrected?  Is the
error being reset at some time or kept indefinitely?

First you import the pool, so ZFS reads only file system's metadata, and some (7) of these reads fail because of wrong checksums. Then the user reads file contents and ZFS discovers (56) more wrong checksums. That's why the number increases. The status message tells you what to do to clear the counters.

Using ZFS does not mean you don't have to do backups. File system type
won't make a difference for a fire inside your enclosure:) But ZFS
makes it easy to create backups by replicating your pool or datasets
("zfs send" lets you create full or incremental backups) to another
set of disks or machine(s).

As another ZFS or as files or archives or as what?  I'm using rsync now,
and restoring a file is as simple as copying it from the backup.

Typically as another ZFS dataset. Replicating ZFS snapshots has one
big advantage for me (besides checksumming, so you know you've made
your backup correctly) - it's atomic, so it either happens or not. It
doesn't mean it's supposed to replace rsync, though. It depends on the
task at hand.

A dataset?  Does it transfer all the data or only what has changed (like
rsync does)?  The task at hand would be to make a backup of my data,
over network, from which it's easy to restore.

I'm afraid I won't be able to explain this in a better way than it is already explained in the docs.

http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/

Those guys don't use ZFS.  They must have very good reasons not to.

They do:
http://www.youtube.com/watch?v=c5ASf53v4lI
http://zfsonlinux.org/docs/LUG11_ZFS_on_Linux_for_Lustre.pdf

And I believe they have lots of good reasons to do so :)

That's some laboratory experimenting with ZFS.  Backblaze uses ext4,
though ZFS would seem to be a very good choice for what they're doing.
How can they store so much data without checksumming, without using ECC
RAM and not experience a significant amount of data corruption?

That's what I found about Backblaze and ZFS (22-07-2011):

We are intrigued by it, as it would replace RAID & LVM as well. But
native ZFS is not available on Linux and we're not looking to switch
to OpenSolaris or FreeBSD, as our current system works great for
us. For someone starting from scratch, ZFS on one of these OSes might
work and we would be interested to know if someone tries it. We're
more likely to switch to btrfs in the future if anything.

Yes, I've read that same statement.  And their actual reasons not to use
ZFS now, three years later with their most recent storage server
upgrade, must be really good because they could cut their costs in half
with it, provided that they're serious about keeping your data save.

http://www.smallnetbuilder.com/nas/nas-features/31541-how-to-build-a-cheap-petabyte-server-revisited

That's just two organizations with similarly sized storage and
different approaches. One uses standard solutions, the other one
ported ZFS to Linux, so they could use it.

I find it interesting that a company which is concerned more about its
costs than anything else doesn't go for a solution, now easily
available, that can cut their costs in half and that an institution
which doesn't appear to be overly concerned with costs goes for that
very same solution despite it's not easily available at all.

It's up to you to define your goals, solutions and level of
assurance. My personal approach is "hope for the best, plan for the
worst".

The problem is that you can plan whatever you want and things turn out
otherwise regardless.  My grandma already knew that.

The corruption wouldn't go unnoticed because they won't be able to
decrypt the data. They'd have to store everything at least twice, and
if they could cut their costs in half or less by not having to do that
through simply using ZFS, why wouldn't they?

Data redundancy is not implied by ZFS itself. You either want
redundancy or not, ZFS is just one way of providing it.

They are using redundancy, and since the kind of redundancy they're
using cannot correct silent data corruption, they must store all their
data not only redundantly but at least twice --- provided that they're
serious about keeping your data save.  If they don't do that, they must
be having quite a few cases in which they cannot decrypt the data or
deliver corrupted data, considering the amounts of data they're dealing
with.  They're not even using ECC RAM ...

What is the actual rate of data corruption or loss prevented or
corrected by ZFS due to its checksumming in daily usage?

I have experienced data corruption due to hardware failures in the
past.

Hardware failures like?

The typical ones. Bad sectors, failed flash memory banks, failed ram modules.


Once is often enough for me and it happened more then once. If I
hadn't done the checksumming myself, I probably wouldn't even have
known about it. Since I started using it, ZFS detected data corruption
several times for me (within a few years). But I don't own a data
center :) Actual error rates might depend on your workload, hardware,
probabilities and lots of other things. Here's something you might
find interesting:

Sure, the more data about failures detected by checksumming we would
collect, the more we might be able to make conclusions from it.  Since
we don't have much data, it's still interesting to know what failure
rates you have seen.  Is it more like 1 error in 50TB read or more like
1 error in 500TB or like 20 in 5TB?

I don't count them, I'd say 1 in 10TB. But that's not professional research-grade statistical data, you shouldn't make decisions on it.

That there's a statistical rate of failure doesn't mean that these
statistical failures are actually seen in daily applications.

http://www.zdnet.com/blog/storage/dram-error-rates-nightmare-on-dimm-street/638

Yes, I've seen that.  It's for RAM, not disk errors detected through ZFS
checksumming.

And RAM has nothing to do with the data on the disks.

Kuba

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.