[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance

To: xen-users@xxxxxxxxxxxxx
From: Gordan Bobic <gordan@xxxxxxxxxx>
Date: Mon, 07 Jul 2014 10:17:22 +0100
Delivery-date: Mon, 07 Jul 2014 09:18:13 +0000
List-id: Xen user discussion <xen-users.lists.xen.org>

On 07/06/2014 02:42 PM, lee wrote:

Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 07/05/2014 03:57 AM, lee wrote:

Kuba <kuba.0000@xxxxx> writes:

W dniu 2014-07-04 19:11, lee pisze:

Kuba <kuba.0000@xxxxx> writes:

"Rollback" doesn't sound very destructive.


For me "rollback" always meant "revert to some previous state" and for
me it sounds very destructive - at least for the "current state" from
which you are reverting.


It still doesn't sound destructive.


Then I dare say you don't understand what it actually means.


I know what it means, and it doesn't sound very destructive.

Rolling back your file system state to a previous point in time doesn'tsound destructive to you? It doesn't convey the meaning that you willlose changes to the file system made since the point you are rollingback to?

mkfs.ext4 doesn't ask you "are you sure" before you tell it to create
the FS on a block device that already contains an ext4 FS. Why would
you expect anything else to?


Because ZFS makes a point of keeping data save.  And there are some
warnings you get from xfs.


Not from user error. Backups keep data safe from user error.

So it's making a backup and not making a backup?  What are snapshots
good for when I can't restore from them, i. e. use them for backups?


You can restore from them. But a backup should mean a copy on a
different machine, not subject to being lose of the entire machine is
lost (e.g. a power surge that fries every component).


You can't make snapshots to different machines?


You can send snapshots, once you have taken them, to a different machine.

Snapshot cannot exist independently of the filesystem it is a snapshot
of. All live on the same ZFS pool (equivalent to the RAID array).


So they are like an incremental backup with very limited use.

No more limited use than any backup you keep on the same disks as theworking copy of the data. But it is certainly more convenient thanalternatives within those constraints.

So far, the advantages I'm seeing that would be a benefit are
checksumming and using on-disk caches for writing.  The latter doesn't
seem to be overly relevant.  That means a lot of learning and
experimentation and uncertainties for that one benefit.


Sounds like you are afraid of learning.
But if you don't use backups and snapshots already and don't intend to
begin using them, then you are right, you are probably not going to
see much benefit.


It's not a question of being afraid of learning but a question of
risking to loose my data.  To make snapshots of VMs, I'd have to make
backups of them and to somehow recreate them with ZFS, and the swap
partitions they have might be a problem.

Not really - snapshots will give you crash level consistency. You cannotroll a snapshot back underneath a running machine any more than you canplug a different rootfs disk into a running machine and expect it tokeep running normally. If you need to roll back, you shut down the VM,roll back the volume, then restart the VM. It'll do it's usual journalreplay on an unclean shutdown and come up in the state it was in whenyou took the snapshot of the FS.

But then, dom0 and the VMs are
on a RAID-1, so I'd have to make backups of everything, change to JBOD,
figure out how to boot from ZFS and how to restore from the backup.  Any
idea how to do that?  Does ZFS provide swap partitions?  If not, I'd
have to put them on RAID devices, but I wouldn't have any.


Swap for dom0 or for domUs?

For dom0, as I said before, I use RAID1 for the /boot and rootfs. Itypically put this on a RAID1 disk set, where the dom0's swap could alsolive (when I absolutely need swap, I use a relatively small zram,because most of the time swapping is a great way to cripple your machineinto a completely unusable state that is often worse than a processOOM-ing and dying.

For domU, you put it on whatever volume the rest of the domU filesystemsare on.

While the server is down, I don't have internet access.  You recommended
against booting from ZFS.

That was my personal recommendation. Many people use ZFS for root aswell, but it requires using a custom patched grub and for reasons Iexplained before, I don't find root on ZFS to be of sufficient benefitto make it worth the extra effort.

The more errors there are, the more redundancy you loose
because you have more data that can be read from only a part of the
disks.


ZFS immediately repairs all errors it encounters, so this is not a
valid point.


It may not encounter all the errors there are (until it's too late to
repair them), so the point remains valid.

No it doesn't, because the point is relative to other availablesolutions, all of which fare far worse under the circumstancesdiscussed. A hardware RAID controller will typically kick out disksbased on relatively low error thresholds. ZFS will try to hold ontodisks as long as they are responsive to the kernel (within SCSI commandtimeouts), which means that it will try to maintain redundancy muchbetter, and will keep fixing all the errors it encounters in the meantime.

And how do you know when to replace a disk?  When there's one error or
when there are 50 or 50000 or when the disk has been disconnected?


In most cases, when the SMART on the disk reports the disk has failed
or the disk stops responding.


I don't believe those smart numbers.

I believe them on Seagates and HGSTs. I take them with a healthy dose ofdistrust on WDs and Samsungs.

When the disk has failed or
doesn't respond, it's obvious that it needs to be replaced.  Before that
happens, you might see some number of errors ZFS has detected.  You just
ignore those?

Depends on how many and how often. If half a dozen errors show up on aweekly scrub once with no more the next week, it's probably OK. If itcomes up with a few hundred on a scrub, I re-run the scrub and it comesup with a few hundred more, then there's a good chance the disk isfailing. SMART info and any bus resets logged against that disk insyslog may also provide more information (e.g. it could be a duff SATAcable).

Does ZFS maintain a list of bad sectors which are not to be used again?


Don't know, but never heard of it. I always thought it's the storage
device's job. Does any file system do that?


I don't know.  It would make sense because there's no telling what the
disk is doing --- the disk might very well re-use a bad sector and find
that just at the time you want to read the data, it's not readable
again.


Disks are _expected_ to deal with sector reallocations internally. If
they don't, they are broken. Disk sectors are all addressed through a
translation layer, and you don't have any way of telling when a sector
has been moved (except maybe by doing performance timings on seeks
between adjecent sectors), because the sector address is logical
rather than physical. It's even less transparent with SSDs which shift
data around all the time to improve wear leveling.


So you never know what the disk is doing, and there's nothing to prevent
silent loss of redundancy, other than scrubs.


Scrubs and normal use.

How often does your RAID controller scrub the array to check for errors?If it finds that in a particular RAID5 stripe the data doesn't match theparity, but none of the disks return an error, does it trust that thedata is correct or the parity is correct? If parity, which combinationof data blocks does it assume are correct, and which block needs to berepaired? ZFS can recover from this even with n+1 redundancy becauseeach data stripe has a checksum independent of the parity, so it ispossible to establish which combination of surviving data+parity blocksis the correct one, and which blocks need to be re-built.

https://blogs.oracle.com/timc/entry/demonstrating_zfs_self_healing


It shows that there are more checksum errors after the errors were
supposedly corrected.


Not "supposedly". The increasing number only shows the count of
encountered checksum errors. If ZFS could not correct the error, it
would say so.


Ok, and why are there more errors after an error was corrected?  Is the
error being reset at some time or kept indefinitely?


You didn't understand the test. Some errors were picked up at import
stage, but all that gets checked at import stage is the pool metadata,
not the entire disk's contents. The find command went and read all the
data on the FS, but if you had snapshots, some errors might still be
in them that don't get found by checking the data in the "head"
instance. For a full pool check you use the zfs scrub command.


The test didn't explain this, so how are ppl who don't know what ZFS
does how supposed to understand it?


Basic common sense based on what the commands do.

You can see that the data can still
be read and that the number of errors has gone up.  That the number of
errors has increased contradicts that the errors have been fixed.

Only if you have no clue how file systems, RAID, and disk accesses work.In which case you should be using an OS designed for people with thatlevel of interest in understanding.

http://www.smallnetbuilder.com/nas/nas-features/31541-how-to-build-a-cheap-petabyte-server-revisited

That's just two organizations with similarly sized storage and
different approaches. One uses standard solutions, the other one
ported ZFS to Linux, so they could use it.


I find it interesting that a company which is concerned more about its
costs than anything else doesn't go for a solution, now easily
available, that can cut their costs in half and that an institution
which doesn't appear to be overly concerned with costs goes for that
very same solution despite it's not easily available at all.


That doesn't mean anything at all. I know of many companies where they
use rented servers rather than their own because accountants prefer
opex to capex, even if over 3+ years the reduction in total cost would
be huge.

I also know of many companies who use virtual cloud infrastructure,
even though the performance hit for the substantial part of their
workload is half of what they would get on bare metal servers, which
in turn makes owned bare metal servers cheaper.

Solutions that aren't optimal are used all the time for all kinds of
spurious and non-technical reasons.


I don't find that convincing.  Companies aren't willing to pay their
employees salaries that would even cover the costs of living, and they
are always trying to save money in any way they can.


Which is why they are getting the quality of the solutions described.

When you look at
to what lengths backblaze claims to have gone to to keep costs low, it
is entirely inconceivable that they would skip out on something that
would save them half their costs for spurious or non-technical reasons.

You'd think so. OTOH I regularly do consultancy for clients that runsystems that are similarly "inconcievable" (in the most generic sensepossible, not limited to any specific file system or application).People who know what they are doing are sufficiently few andsufficiently expensive, that they are seldom there when the system isfirst being designed.

It's up to you to define your goals, solutions and level of
assurance. My personal approach is "hope for the best, plan for the
worst".


The problem is that you can plan whatever you want and things turn out
otherwise regardless.  My grandma already knew that.


Expect the unexpected and be prepared. Every boy scout already knows that.


When they grow up, they find out that it doesn't work.

So your view is to not bother taking precautions? Do you wear a seatbeltwhen driving? And if so, is that because you plan to crash?

Just because something might go wrong regardless of how well prepared
you are doesn't justify not being prepared at all.


There are lots of reasons for not being prepared for everything, and
being prepared for everything when things go wrong nonetheless can be
difficult to justify.

Depends on what's at stake. If you know what you are doing the overheadsof ensuring timely recoverability are not particularly significant.

What is the actual rate of data corruption or loss prevented or
corrected by ZFS due to its checksumming in daily usage?


I have experienced data corruption due to hardware failures in the
past.


Hardware failures like?


I am quite certain I have seen in-RAM data corruption before, that
when it occurs in the commit charge, will cause on-disk data
corruption (not detectable once it's on disk, your application will
just get corrupted data back since the corrupted data is correctly
stored as corrupted data on disk).

I have also seen files get corrupted on disk due to latent disk
errors, through traditional RAID. Nothing logs an error, there is no
change in the file, but the application crashes. When I located all
the files being involved in the operation and pulled the backups
months back, I found that 512 bytes in the file has changed between
two backups with no obvious metadata changes (modification time). This
is a fairly typical example of what happens when a disk write goes
astray.

The opposite problem is a phantom write when the write doesn't make it
to the disk - head floats too far from the platter and the write
doesn't stick. This, too, happens a lot more than most people realize.


That it /can/ happen is one thing, how often it /does/ happen is
another.  Without knowing the actual rate, it's difficult to judge how
big the benefit of checksumming is.

Once is often enough for me and it happened more then once. If I
hadn't done the checksumming myself, I probably wouldn't even have
known about it. Since I started using it, ZFS detected data corruption
several times for me (within a few years). But I don't own a data
center :) Actual error rates might depend on your workload, hardware,
probabilities and lots of other things. Here's something you might
find interesting:


Sure, the more data about failures detected by checksumming we would
collect, the more we might be able to make conclusions from it.  Since
we don't have much data, it's still interesting to know what failure
rates you have seen.  Is it more like 1 error in 50TB read or more like
1 error in 500TB or like 20 in 5TB?


According to manufacturers, one unrecoverable error every 10^14 bits
read. That equates to one unrecoverable sector every 11TB. This is the
statistical average. On some models it'll be worse.

The following articles provide some good info:

http://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf

http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf


They don't answer the question, either.

So you didn't read the articles, then. Graph (b) in Figure 3. of thesecond article shows the number of latent sector errors per GB over 18months of use, by disk model. So depending on your disk you could begetting a silent disk error as often as once per 100GB. Unrecoverablesector errors (i.e. non latent disk errors) are on top of that.

How much data can you, in daily
usage, read/write from/to a ZFS file system with how many errors
detected and corrected only due to the checksumming ZFS does?

See above. Depending on disk make/model, potentially as high as one per100GB on some disk models.


Gordan


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

Follow-Ups:
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee

References:
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Gordan Bobic
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Kuba
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Kuba
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Gordan Bobic
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee

Prev by Date: Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
Next by Date: Re: [Xen-users] Can not init netfront
Previous by thread: Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
Next by thread: Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.