[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance

  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Wed, 09 Jul 2014 10:02:44 +0100
  • Delivery-date: Wed, 09 Jul 2014 09:02:55 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 2014-07-09 00:45, lee wrote:

I was thinking of errors detected by ZFS, though.  What if you see a
few: Do you replace the disk, or do you wait until there are so many?

Depends on the rate at which they are showing up. If every week the
same disk throws a few errors, then yes, it is a good candidate for
replacing. But usually there are other indications in SMART and
syslog, e.g. command timeouts, bus resets, and similar.

Hm, interesting ... Would you say that there is a correlation between
timeouts/bus resets and errors detected by ZFS?  Like no significant
numbers of ZFS-detected errors showing up before the timeouts/resets

It depends on how the disk is failing. A lot of the time bus
timeouts will result in chechsum errors in ZFS as it was unable
to retrieve the data off the disk, but that could be caused by
a number of failures, some more critical than others. It could
be something as trivial as a marginal SATA cable, or it could
be the disk failing to read back a sector. Or it could be a disk
becoming completely unresponsive (in which case the kernel will
disconnect it and ZFS will show it as failed and put the vdev
into degraded state.

or does ZFS keep a list of sectors not to use anymore?

As I said before, disk's handle their own defects, and have done for
the past decade or two. File systems have long had no place in keeping
track of duff sectors on disks.

So ZFS may silently loose redundancy (and in a bad case data), depending on what the disks do. And there isn't any way around that, other than
increasing redundancy.

How do you define "silently"?

"Silently" as in "not noticed" because ZFS doesn't detect the errors
before attempting to read. When a disk behaves badly, ZFS would have to
assume that data has been written correctly while it hasn't.  For that
data, there is no redundancy because it has been "silently" lost (or
never existed).

In that case, yes, without reading the data back, you can never
be completely sure. If this is important to you, you will need to
buy disks with Write-Read-Verify feature.

In practical terms, however, there is no way (nor reason) to
distinguish between sectors that were written wrong (or not at all)
and those that got corrupted after being written. The only
metric that matters is whether the data is there when you want to
access it.

Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.