[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance


  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Sat, 05 Jul 2014 12:23:23 +0100
  • Delivery-date: Sat, 05 Jul 2014 11:24:16 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 07/05/2014 03:57 AM, lee wrote:
Kuba <kuba.0000@xxxxx> writes:

W dniu 2014-07-04 19:11, lee pisze:
Kuba <kuba.0000@xxxxx> writes:

"Rollback" doesn't sound very destructive.

For me "rollback" always meant "revert to some previous state" and for
me it sounds very destructive - at least for the "current state" from
which you are reverting.

It still doesn't sound destructive.

Then I dare say you don't understand what it actually means.

How can a file system protect you from executing a destructive
operation?

It can try by warning you.

Does "rm" sound destructive or try to warn you? It just does what you
tell it to do.

It's not a file system and has options to warn you.  The options aren't
enabled by default because it won't make much sense.  For a file systen,
it would make sense to get a warning like "this will destroy your
current data" when you issue a command that would perform a rollback and
to have an option to disable the warning.

mkfs.ext4 doesn't ask you "are you sure" before you tell it to create the FS on a block device that already contains an ext4 FS. Why would you expect anything else to?

If you need that feature you are using the wrong OS.

And how are snapshots better than copying the
data?

Snapshots are just snapshots, making them does not copy your data
(well, in fact, ZFS is a COW file system, so making a snapshot may
result in actually copying your data later on, if it's needed, but
it's not copying as in "making a backup"). Replicating a snapshot
results in creation of another dataset identical to the original
snapshot. It's just a one more way of making full or incremental
backups.

So it's making a backup and not making a backup?  What are snapshots
good for when I can't restore from them, i. e. use them for backups?

You can restore from them. But a backup should mean a copy on a different machine, not subject to being lose of the entire machine is lost (e.g. a power surge that fries every component).

What if I need to access a file that's in the snapshot:  Do I
need to restore the snapshot first?

Usually you can "cd .zfs" directory, which contains subdirectories
named after your snapshots, and inside that directories you have
complete datasets just like the ones you took the snapshots of. No
rollback/restoring/mounting is necessary.

And that also works when the file system the snapshot was created from
doesn't exist anymore, or when the disks with the FS the snapshot was
made from have become inaccessible, provided that the snapshot was made
to different disks?

Snapshot cannot exist independently of the filesystem it is a snapshot of. All live on the same ZFS pool (equivalent to the RAID array).

any other FS. I have the same feeling about ZFS as Gordan - once you
start using it, you cannot imagine making do without it.

Why exactly is that?  Are you modifying your storage system all the time
or making snapshots all the time?

Yes, I take snapshots all the time. This way it's easy for me to
revert VMs to previous states, clone them, etc. Same goes with my
regular data. And I replicate them a lot.

Hm, what for?  The VMs I have are all different, so there's no point in
cloning them.  And why would I clone my data?  I don't even have the
disk capacity for that and am glad that I can make a backup.

I don't use cloning either, because it prevents deletion of the original instance being cloned. But snapshots are extremely useful. Take a snapshot, do the upgrade inside domU, and if it breaks things, shut down the domU, roll back the snapshot, and restart it. It'll be as if nothing happened.

I'm not saying you will feel about ZFS as I do after you try it
out. It presents you a certain set of features, advantages and
disadvantages and it is up to you, and you only, to decide whether you
can benefit from it or not. All I'm saying is that I personally
believe ZFS is worth taking into consideration.

So far, the advantages I'm seeing that would be a benefit are
checksumming and using on-disk caches for writing.  The latter doesn't
seem to be overly relevant.  That means a lot of learning and
experimentation and uncertainties for that one benefit.

Sounds like you are afraid of learning.
But if you don't use backups and snapshots already and don't intend to begin using them, then you are right, you are probably not going to see much benefit.

So you would be running ZFS on unreliable disks, with the errors being
corrected and going unnoticed, until either, without TLER, the system
goes down or, with TLER, until the errors aren't recoverable anymore and
become noticeable only when it's too late.

ZFS tells you it had problems ("zpool status"). ZFS can also check
entire pool for defects ("zpool scrub", you should do that
periodically).

You're silently loosing more and more redundancy.

I'm not sure what you mean by loosing redundancy.

You don't know whether the data has been written correctly before you
read it.

You never do anyway. Phantom writes happen. You could get Seagate drives that have the Write-Read-Verify feature, enable it at the cost of halving write speed, and live with the fact that your disk failure rate will be several times what you would see with other brands of disks. Or you could just live with the redundancy being provided by parity or mirror disks at a higher level.

The more errors there are, the more redundancy you loose
because you have more data that can be read from only a part of the
disks.

ZFS immediately repairs all errors it encounters, so this is not a valid point.

If there is an error on another disk with that same data, you
don't know until you try to read it and perhaps find out that you can't.
How many errors for that data it takes depends on the level of
redundancy.

Sure, but you are in the worse situation with traditional RAID.

How do you know when
a disk needs to be replaced?

ZFS tells you it had IO or checksum failures. It may also put your
pool into a degraded state (with one or more disks disconnected from
the pool) with reduced redundancy (just like a regular RAID would
do). SMART also tells you something wrong has happened (or is going
to, probably). And, additionally, when you replace a disk and resilver
(ZFS term for rebuilding) the pool, you know whether all your data was
read and restored without errors.

And how do you know when to replace a disk?  When there's one error or
when there are 50 or 50000 or when the disk has been disconnected?

In most cases, when the SMART on the disk reports the disk has failed or the disk stops responding.

Does ZFS maintain a list of bad sectors which are not to be used again?

Don't know, but never heard of it. I always thought it's the storage
device's job. Does any file system do that?

I don't know.  It would make sense because there's no telling what the
disk is doing --- the disk might very well re-use a bad sector and find
that just at the time you want to read the data, it's not readable
again.

Disks are _expected_ to deal with sector reallocations internally. If they don't, they are broken. Disk sectors are all addressed through a translation layer, and you don't have any way of telling when a sector has been moved (except maybe by doing performance timings on seeks between adjecent sectors), because the sector address is logical rather than physical. It's even less transparent with SSDs which shift data around all the time to improve wear leveling.

The disk might continue to disagree with ZFS and insist on
re-using the sector.  Perhaps it figured that it can use the sector
eventually after so many tries to recover the error.

That's a failed disk. If the sector came up as readable only after extensive scrubbing (ignoring the fact that this doesn't actually happen), it should still get reallocated to a healthy sector.

Now, some disks _might_ re-write the same sector and check whether the data sticks to it and reuse it if the signal quality is above some threshold, but you'll have to disassemble your disk's firmware to know for sure.

The error might not even be noticed with other file systems, other than
as a delay due to error correction maybe.  That other file system would
deliver corrupt data or correct data, there's no way to know.  Disks
aren't designed for ZFS in the first place.

Other file systems will fare worse. ext4 doesn't automatically handle bad sectors for you, either. Disks aren't made for ZFS - ZFS was made specifically to deal with the fact that disks are crap.

It's also quite difficult to corrupts the file system
itself:
https://blogs.oracle.com/timc/entry/demonstrating_zfs_self_healing

It shows that there are more checksum errors after the errors were
supposedly corrected.

Not "supposedly". The increasing number only shows the count of
encountered checksum errors. If ZFS could not correct the error, it
would say so.

Ok, and why are there more errors after an error was corrected?  Is the
error being reset at some time or kept indefinitely?

You didn't understand the test. Some errors were picked up at import stage, but all that gets checked at import stage is the pool metadata, not the entire disk's contents. The find command went and read all the data on the FS, but if you had snapshots, some errors might still be in them that don't get found by checking the data in the "head" instance. For a full pool check you use the zfs scrub command.

Using ZFS does not mean you don't have to do backups. File system type
won't make a difference for a fire inside your enclosure:) But ZFS
makes it easy to create backups by replicating your pool or datasets
("zfs send" lets you create full or incremental backups) to another
set of disks or machine(s).

As another ZFS or as files or archives or as what?  I'm using rsync now,
and restoring a file is as simple as copying it from the backup.

Typically as another ZFS dataset. Replicating ZFS snapshots has one
big advantage for me (besides checksumming, so you know you've made
your backup correctly) - it's atomic, so it either happens or not. It
doesn't mean it's supposed to replace rsync, though. It depends on the
task at hand.

A dataset?  Does it transfer all the data or only what has changed (like
rsync does)?  The task at hand would be to make a backup of my data,
over network, from which it's easy to restore.

It is generally more efficient than rsync. rsync has to go and check metadata of every file at source and destination, even when transferring incrementally. If you use incremental send between snapshots, the amount of disk I/O ZFS has to do is much lower, and since it doesn't need to check metadata on each side, there is a lot less network I/O, too.

http://www.smallnetbuilder.com/nas/nas-features/31541-how-to-build-a-cheap-petabyte-server-revisited

That's just two organizations with similarly sized storage and
different approaches. One uses standard solutions, the other one
ported ZFS to Linux, so they could use it.

I find it interesting that a company which is concerned more about its
costs than anything else doesn't go for a solution, now easily
available, that can cut their costs in half and that an institution
which doesn't appear to be overly concerned with costs goes for that
very same solution despite it's not easily available at all.

That doesn't mean anything at all. I know of many companies where they use rented servers rather than their own because accountants prefer opex to capex, even if over 3+ years the reduction in total cost would be huge.

I also know of many companies who use virtual cloud infrastructure, even though the performance hit for the substantial part of their workload is half of what they would get on bare metal servers, which in turn makes owned bare metal servers cheaper.

Solutions that aren't optimal are used all the time for all kinds of spurious and non-technical reasons.

It's up to you to define your goals, solutions and level of
assurance. My personal approach is "hope for the best, plan for the
worst".

The problem is that you can plan whatever you want and things turn out
otherwise regardless.  My grandma already knew that.

Expect the unexpected and be prepared. Every boy scout already knows that.

Just because something might go wrong regardless of how well prepared you are doesn't justify not being prepared at all.

What is the actual rate of data corruption or loss prevented or
corrected by ZFS due to its checksumming in daily usage?

I have experienced data corruption due to hardware failures in the
past.

Hardware failures like?

I am quite certain I have seen in-RAM data corruption before, that when it occurs in the commit charge, will cause on-disk data corruption (not detectable once it's on disk, your application will just get corrupted data back since the corrupted data is correctly stored as corrupted data on disk).

I have also seen files get corrupted on disk due to latent disk errors, through traditional RAID. Nothing logs an error, there is no change in the file, but the application crashes. When I located all the files being involved in the operation and pulled the backups months back, I found that 512 bytes in the file has changed between two backups with no obvious metadata changes (modification time). This is a fairly typical example of what happens when a disk write goes astray.

The opposite problem is a phantom write when the write doesn't make it to the disk - head floats too far from the platter and the write doesn't stick. This, too, happens a lot more than most people realize.

Once is often enough for me and it happened more then once. If I
hadn't done the checksumming myself, I probably wouldn't even have
known about it. Since I started using it, ZFS detected data corruption
several times for me (within a few years). But I don't own a data
center :) Actual error rates might depend on your workload, hardware,
probabilities and lots of other things. Here's something you might
find interesting:

Sure, the more data about failures detected by checksumming we would
collect, the more we might be able to make conclusions from it.  Since
we don't have much data, it's still interesting to know what failure
rates you have seen.  Is it more like 1 error in 50TB read or more like
1 error in 500TB or like 20 in 5TB?

According to manufacturers, one unrecoverable error every 10^14 bits read. That equates to one unrecoverable sector every 11TB. This is the statistical average. On some models it'll be worse.

The following articles provide some good info:

http://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf

http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.