[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance


  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Wed, 09 Jul 2014 08:52:49 +0100
  • Delivery-date: Wed, 09 Jul 2014 07:53:08 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

On 2014-07-09 02:13, lee wrote:

But then, dom0 and the VMs are
on a RAID-1, so I'd have to make backups of everything, change to JBOD, figure out how to boot from ZFS and how to restore from the backup. Any
idea how to do that?  Does ZFS provide swap partitions?  If not, I'd
have to put them on RAID devices, but I wouldn't have any.

Swap for dom0 or for domUs?

All have swap partitions.

For dom0, as I said before, I use RAID1 for the /boot and rootfs.

And leave the rest of the disks unused?

Do whatever you want with the remaining space. ZFS can use both
partitions and block devices.

For domU, you put it on whatever volume the rest of the domU
filesystems are on.

Without swap partitions?

No, partition the domU virtual disk inside the domU in any
way you like, including swap partitions.

A hardware RAID controller will typically kick out disks based on
relatively low error thresholds. ZFS will try to hold onto disks as
long as they are responsive to the kernel (within SCSI command
timeouts), which means that it will try to maintain redundancy much
better, and will keep fixing all the errors it encounters in the
meantime.

Which is better?  In both cases, another disk could fail shortly after
the first one has.

Failure has degrees. Having two partially failing disks (failed sectors)
in an n+1 redundancy array may still yield a complete copy of the data.
ZFS will keep those disks for as long as they are responsive, while
rebuilding data onto a new disk, and pick whatever data is healthy on
each of the old disks.

How often does your RAID controller scrub the array to check for
errors? If it finds that in a particular RAID5 stripe the data doesn't
match the parity, but none of the disks return an error, does it trust
that the data is correct or the parity is correct? If parity, which
combination of data blocks does it assume are correct, and which block
needs to be repaired? ZFS can recover from this even with n+1
redundancy because each data stripe has a checksum independent of the
parity, so it is possible to establish which combination of surviving
data+parity blocks is the correct one, and which blocks need to be
re-built.

Interesting question --- are you saying the hardware RAID controller has
no way of knowing which data is good because it uses parity information
merely to be able to reconstruct data when a part of that data is not
available anymore while ZFS uses checksums on each part of the data
which not only allows it to reconstruct the data when a part of it is
unavailable, but it also can know which part of the data is good because
it assumes that the data for which the checksums match is good?

Yes, that is exactly what I'm saying.

https://blogs.oracle.com/timc/entry/demonstrating_zfs_self_healing

You can see that the data can still
be read and that the number of errors has gone up. That the number of
errors has increased contradicts that the errors have been fixed.

Only if you have no clue how file systems, RAID, and disk accesses
work. In which case you should be using an OS designed for people with
that level of interest in understanding.

That's what I said: When you don't know ZFS, you see the contradiction.
Common sense makes you at least suspicious when you are supposed to
assume that an error has been fixed and see more errors showing up.

The entire premise is wrong - you cannnot meaningfully gain information
from a test without understanding the test.

When you look at
to what lengths backblaze claims to have gone to to keep costs low, it
is entirely inconceivable that they would skip out on something that
would save them half their costs for spurious or non-technical reasons.

You'd think so.

I got an email from them, and they're saying they are considering using
ZFS and that the software they're using does checksumming.  How exactly
it does it wasn't said.  They also said their encryption software is
closed source and that it's up to you to trust them or not.

So we can only guess who has access to all the data they store.

Or you could encrypt your data before becking it up. If you use
something like encfs, you  can back up the underlying encrypted
data rather than the unencrypted data. That way it doesn't matter
how they store it.

What is the actual rate of data corruption or loss prevented or
corrected by ZFS due to its checksumming in daily usage?

The following articles provide some good info:

http://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf

http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf

They don't answer the question, either.

So you didn't read the articles, then.

I looked at them.

Graph (b) in Figure 3. of the second article shows the number of
latent sector errors per GB over 18 months of use, by disk model. So
depending on your disk you could be getting a silent disk error as
often as once per 100GB. Unrecoverable sector errors (i.e. non latent
disk errors) are on top of that.

It doesn't answer the question.

How much data can you, in daily
usage, read/write from/to a ZFS file system with how many errors
detected and corrected only due to the checksumming ZFS does?

See above. Depending on disk make/model, potentially as high as one
per 100GB on some disk models.

Potentially, theoretically, no ZFS involved ...

You are using ZFS, so do you see this one error per 100GB?  Or what do
you see?

It's not something I log/graph. Maybe I should add it to my zabbix
setup...


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.