[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance


  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Mon, 07 Jul 2014 10:17:22 +0100
  • Delivery-date: Mon, 07 Jul 2014 09:18:13 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 07/06/2014 02:42 PM, lee wrote:
Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 07/05/2014 03:57 AM, lee wrote:
Kuba <kuba.0000@xxxxx> writes:

W dniu 2014-07-04 19:11, lee pisze:
Kuba <kuba.0000@xxxxx> writes:

"Rollback" doesn't sound very destructive.

For me "rollback" always meant "revert to some previous state" and for
me it sounds very destructive - at least for the "current state" from
which you are reverting.

It still doesn't sound destructive.

Then I dare say you don't understand what it actually means.

I know what it means, and it doesn't sound very destructive.

Rolling back your file system state to a previous point in time doesn't sound destructive to you? It doesn't convey the meaning that you will lose changes to the file system made since the point you are rolling back to?

mkfs.ext4 doesn't ask you "are you sure" before you tell it to create
the FS on a block device that already contains an ext4 FS. Why would
you expect anything else to?

Because ZFS makes a point of keeping data save.  And there are some
warnings you get from xfs.

Not from user error. Backups keep data safe from user error.

So it's making a backup and not making a backup?  What are snapshots
good for when I can't restore from them, i. e. use them for backups?

You can restore from them. But a backup should mean a copy on a
different machine, not subject to being lose of the entire machine is
lost (e.g. a power surge that fries every component).

You can't make snapshots to different machines?

You can send snapshots, once you have taken them, to a different machine.

Snapshot cannot exist independently of the filesystem it is a snapshot
of. All live on the same ZFS pool (equivalent to the RAID array).

So they are like an incremental backup with very limited use.

No more limited use than any backup you keep on the same disks as the working copy of the data. But it is certainly more convenient than alternatives within those constraints.

So far, the advantages I'm seeing that would be a benefit are
checksumming and using on-disk caches for writing.  The latter doesn't
seem to be overly relevant.  That means a lot of learning and
experimentation and uncertainties for that one benefit.

Sounds like you are afraid of learning.
But if you don't use backups and snapshots already and don't intend to
begin using them, then you are right, you are probably not going to
see much benefit.

It's not a question of being afraid of learning but a question of
risking to loose my data.  To make snapshots of VMs, I'd have to make
backups of them and to somehow recreate them with ZFS, and the swap
partitions they have might be a problem.

Not really - snapshots will give you crash level consistency. You cannot roll a snapshot back underneath a running machine any more than you can plug a different rootfs disk into a running machine and expect it to keep running normally. If you need to roll back, you shut down the VM, roll back the volume, then restart the VM. It'll do it's usual journal replay on an unclean shutdown and come up in the state it was in when you took the snapshot of the FS.

But then, dom0 and the VMs are
on a RAID-1, so I'd have to make backups of everything, change to JBOD,
figure out how to boot from ZFS and how to restore from the backup.  Any
idea how to do that?  Does ZFS provide swap partitions?  If not, I'd
have to put them on RAID devices, but I wouldn't have any.

Swap for dom0 or for domUs?

For dom0, as I said before, I use RAID1 for the /boot and rootfs. I typically put this on a RAID1 disk set, where the dom0's swap could also live (when I absolutely need swap, I use a relatively small zram, because most of the time swapping is a great way to cripple your machine into a completely unusable state that is often worse than a process OOM-ing and dying.

For domU, you put it on whatever volume the rest of the domU filesystems are on.

While the server is down, I don't have internet access.  You recommended
against booting from ZFS.

That was my personal recommendation. Many people use ZFS for root as well, but it requires using a custom patched grub and for reasons I explained before, I don't find root on ZFS to be of sufficient benefit to make it worth the extra effort.

The more errors there are, the more redundancy you loose
because you have more data that can be read from only a part of the
disks.

ZFS immediately repairs all errors it encounters, so this is not a
valid point.

It may not encounter all the errors there are (until it's too late to
repair them), so the point remains valid.

No it doesn't, because the point is relative to other available solutions, all of which fare far worse under the circumstances discussed. A hardware RAID controller will typically kick out disks based on relatively low error thresholds. ZFS will try to hold onto disks as long as they are responsive to the kernel (within SCSI command timeouts), which means that it will try to maintain redundancy much better, and will keep fixing all the errors it encounters in the meantime.

And how do you know when to replace a disk?  When there's one error or
when there are 50 or 50000 or when the disk has been disconnected?

In most cases, when the SMART on the disk reports the disk has failed
or the disk stops responding.

I don't believe those smart numbers.

I believe them on Seagates and HGSTs. I take them with a healthy dose of distrust on WDs and Samsungs.

When the disk has failed or
doesn't respond, it's obvious that it needs to be replaced.  Before that
happens, you might see some number of errors ZFS has detected.  You just
ignore those?

Depends on how many and how often. If half a dozen errors show up on a weekly scrub once with no more the next week, it's probably OK. If it comes up with a few hundred on a scrub, I re-run the scrub and it comes up with a few hundred more, then there's a good chance the disk is failing. SMART info and any bus resets logged against that disk in syslog may also provide more information (e.g. it could be a duff SATA cable).

Does ZFS maintain a list of bad sectors which are not to be used again?

Don't know, but never heard of it. I always thought it's the storage
device's job. Does any file system do that?

I don't know.  It would make sense because there's no telling what the
disk is doing --- the disk might very well re-use a bad sector and find
that just at the time you want to read the data, it's not readable
again.

Disks are _expected_ to deal with sector reallocations internally. If
they don't, they are broken. Disk sectors are all addressed through a
translation layer, and you don't have any way of telling when a sector
has been moved (except maybe by doing performance timings on seeks
between adjecent sectors), because the sector address is logical
rather than physical. It's even less transparent with SSDs which shift
data around all the time to improve wear leveling.

So you never know what the disk is doing, and there's nothing to prevent
silent loss of redundancy, other than scrubs.

Scrubs and normal use.

How often does your RAID controller scrub the array to check for errors? If it finds that in a particular RAID5 stripe the data doesn't match the parity, but none of the disks return an error, does it trust that the data is correct or the parity is correct? If parity, which combination of data blocks does it assume are correct, and which block needs to be repaired? ZFS can recover from this even with n+1 redundancy because each data stripe has a checksum independent of the parity, so it is possible to establish which combination of surviving data+parity blocks is the correct one, and which blocks need to be re-built.

https://blogs.oracle.com/timc/entry/demonstrating_zfs_self_healing

It shows that there are more checksum errors after the errors were
supposedly corrected.

Not "supposedly". The increasing number only shows the count of
encountered checksum errors. If ZFS could not correct the error, it
would say so.

Ok, and why are there more errors after an error was corrected?  Is the
error being reset at some time or kept indefinitely?

You didn't understand the test. Some errors were picked up at import
stage, but all that gets checked at import stage is the pool metadata,
not the entire disk's contents. The find command went and read all the
data on the FS, but if you had snapshots, some errors might still be
in them that don't get found by checking the data in the "head"
instance. For a full pool check you use the zfs scrub command.

The test didn't explain this, so how are ppl who don't know what ZFS
does how supposed to understand it?

Basic common sense based on what the commands do.

You can see that the data can still
be read and that the number of errors has gone up.  That the number of
errors has increased contradicts that the errors have been fixed.

Only if you have no clue how file systems, RAID, and disk accesses work. In which case you should be using an OS designed for people with that level of interest in understanding.

http://www.smallnetbuilder.com/nas/nas-features/31541-how-to-build-a-cheap-petabyte-server-revisited

That's just two organizations with similarly sized storage and
different approaches. One uses standard solutions, the other one
ported ZFS to Linux, so they could use it.

I find it interesting that a company which is concerned more about its
costs than anything else doesn't go for a solution, now easily
available, that can cut their costs in half and that an institution
which doesn't appear to be overly concerned with costs goes for that
very same solution despite it's not easily available at all.

That doesn't mean anything at all. I know of many companies where they
use rented servers rather than their own because accountants prefer
opex to capex, even if over 3+ years the reduction in total cost would
be huge.

I also know of many companies who use virtual cloud infrastructure,
even though the performance hit for the substantial part of their
workload is half of what they would get on bare metal servers, which
in turn makes owned bare metal servers cheaper.

Solutions that aren't optimal are used all the time for all kinds of
spurious and non-technical reasons.

I don't find that convincing.  Companies aren't willing to pay their
employees salaries that would even cover the costs of living, and they
are always trying to save money in any way they can.

Which is why they are getting the quality of the solutions described.

When you look at
to what lengths backblaze claims to have gone to to keep costs low, it
is entirely inconceivable that they would skip out on something that
would save them half their costs for spurious or non-technical reasons.

You'd think so. OTOH I regularly do consultancy for clients that run systems that are similarly "inconcievable" (in the most generic sense possible, not limited to any specific file system or application). People who know what they are doing are sufficiently few and sufficiently expensive, that they are seldom there when the system is first being designed.

It's up to you to define your goals, solutions and level of
assurance. My personal approach is "hope for the best, plan for the
worst".

The problem is that you can plan whatever you want and things turn out
otherwise regardless.  My grandma already knew that.

Expect the unexpected and be prepared. Every boy scout already knows that.

When they grow up, they find out that it doesn't work.

So your view is to not bother taking precautions? Do you wear a seatbelt when driving? And if so, is that because you plan to crash?

Just because something might go wrong regardless of how well prepared
you are doesn't justify not being prepared at all.

There are lots of reasons for not being prepared for everything, and
being prepared for everything when things go wrong nonetheless can be
difficult to justify.

Depends on what's at stake. If you know what you are doing the overheads of ensuring timely recoverability are not particularly significant.

What is the actual rate of data corruption or loss prevented or
corrected by ZFS due to its checksumming in daily usage?

I have experienced data corruption due to hardware failures in the
past.

Hardware failures like?

I am quite certain I have seen in-RAM data corruption before, that
when it occurs in the commit charge, will cause on-disk data
corruption (not detectable once it's on disk, your application will
just get corrupted data back since the corrupted data is correctly
stored as corrupted data on disk).

I have also seen files get corrupted on disk due to latent disk
errors, through traditional RAID. Nothing logs an error, there is no
change in the file, but the application crashes. When I located all
the files being involved in the operation and pulled the backups
months back, I found that 512 bytes in the file has changed between
two backups with no obvious metadata changes (modification time). This
is a fairly typical example of what happens when a disk write goes
astray.

The opposite problem is a phantom write when the write doesn't make it
to the disk - head floats too far from the platter and the write
doesn't stick. This, too, happens a lot more than most people realize.

That it /can/ happen is one thing, how often it /does/ happen is
another.  Without knowing the actual rate, it's difficult to judge how
big the benefit of checksumming is.
>
Once is often enough for me and it happened more then once. If I
hadn't done the checksumming myself, I probably wouldn't even have
known about it. Since I started using it, ZFS detected data corruption
several times for me (within a few years). But I don't own a data
center :) Actual error rates might depend on your workload, hardware,
probabilities and lots of other things. Here's something you might
find interesting:

Sure, the more data about failures detected by checksumming we would
collect, the more we might be able to make conclusions from it.  Since
we don't have much data, it's still interesting to know what failure
rates you have seen.  Is it more like 1 error in 50TB read or more like
1 error in 500TB or like 20 in 5TB?

According to manufacturers, one unrecoverable error every 10^14 bits
read. That equates to one unrecoverable sector every 11TB. This is the
statistical average. On some models it'll be worse.

The following articles provide some good info:

http://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf

http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf

They don't answer the question, either.

So you didn't read the articles, then. Graph (b) in Figure 3. of the second article shows the number of latent sector errors per GB over 18 months of use, by disk model. So depending on your disk you could be getting a silent disk error as often as once per 100GB. Unrecoverable sector errors (i.e. non latent disk errors) are on top of that.

How much data can you, in daily
usage, read/write from/to a ZFS file system with how many errors
detected and corrected only due to the checksumming ZFS does?

See above. Depending on disk make/model, potentially as high as one per 100GB on some disk models.

Gordan


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.