[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance



Gordan Bobic <gordan@xxxxxxxxxx> writes:

> On 07/05/2014 03:57 AM, lee wrote:
>> Kuba <kuba.0000@xxxxx> writes:
>>
>>> W dniu 2014-07-04 19:11, lee pisze:
>>>> Kuba <kuba.0000@xxxxx> writes:
>>>>
>>>> "Rollback" doesn't sound very destructive.
>>>
>>> For me "rollback" always meant "revert to some previous state" and for
>>> me it sounds very destructive - at least for the "current state" from
>>> which you are reverting.
>>
>> It still doesn't sound destructive.
>
> Then I dare say you don't understand what it actually means.

I know what it means, and it doesn't sound very destructive.

> mkfs.ext4 doesn't ask you "are you sure" before you tell it to create
> the FS on a block device that already contains an ext4 FS. Why would
> you expect anything else to?

Because ZFS makes a point of keeping data save.  And there are some
warnings you get from xfs.

>> So it's making a backup and not making a backup?  What are snapshots
>> good for when I can't restore from them, i. e. use them for backups?
>
> You can restore from them. But a backup should mean a copy on a
> different machine, not subject to being lose of the entire machine is
> lost (e.g. a power surge that fries every component).

You can't make snapshots to different machines?

> Snapshot cannot exist independently of the filesystem it is a snapshot
> of. All live on the same ZFS pool (equivalent to the RAID array).

So they are like an incremental backup with very limited use.

>> So far, the advantages I'm seeing that would be a benefit are
>> checksumming and using on-disk caches for writing.  The latter doesn't
>> seem to be overly relevant.  That means a lot of learning and
>> experimentation and uncertainties for that one benefit.
>
> Sounds like you are afraid of learning.
> But if you don't use backups and snapshots already and don't intend to
> begin using them, then you are right, you are probably not going to
> see much benefit.

It's not a question of being afraid of learning but a question of
risking to loose my data.  To make snapshots of VMs, I'd have to make
backups of them and to somehow recreate them with ZFS, and the swap
partitions they have might be a problem.  But then, dom0 and the VMs are
on a RAID-1, so I'd have to make backups of everything, change to JBOD,
figure out how to boot from ZFS and how to restore from the backup.  Any
idea how to do that?  Does ZFS provide swap partitions?  If not, I'd
have to put them on RAID devices, but I wouldn't have any.

While the server is down, I don't have internet access.  You recommended
against booting from ZFS.

>> The more errors there are, the more redundancy you loose
>> because you have more data that can be read from only a part of the
>> disks.
>
> ZFS immediately repairs all errors it encounters, so this is not a
> valid point.

It may not encounter all the errors there are (until it's too late to
repair them), so the point remains valid.

>> And how do you know when to replace a disk?  When there's one error or
>> when there are 50 or 50000 or when the disk has been disconnected?
>
> In most cases, when the SMART on the disk reports the disk has failed
> or the disk stops responding.

I don't believe those smart numbers.  When the disk has failed or
doesn't respond, it's obvious that it needs to be replaced.  Before that
happens, you might see some number of errors ZFS has detected.  You just
ignore those?

>>>> Does ZFS maintain a list of bad sectors which are not to be used again?
>>>
>>> Don't know, but never heard of it. I always thought it's the storage
>>> device's job. Does any file system do that?
>>
>> I don't know.  It would make sense because there's no telling what the
>> disk is doing --- the disk might very well re-use a bad sector and find
>> that just at the time you want to read the data, it's not readable
>> again.
>
> Disks are _expected_ to deal with sector reallocations internally. If
> they don't, they are broken. Disk sectors are all addressed through a
> translation layer, and you don't have any way of telling when a sector
> has been moved (except maybe by doing performance timings on seeks
> between adjecent sectors), because the sector address is logical
> rather than physical. It's even less transparent with SSDs which shift
> data around all the time to improve wear leveling.

So you never know what the disk is doing, and there's nothing to prevent
silent loss of redundancy, other than scrubs.

>>>>> https://blogs.oracle.com/timc/entry/demonstrating_zfs_self_healing
>>>>
>>>> It shows that there are more checksum errors after the errors were
>>>> supposedly corrected.
>>>
>>> Not "supposedly". The increasing number only shows the count of
>>> encountered checksum errors. If ZFS could not correct the error, it
>>> would say so.
>>
>> Ok, and why are there more errors after an error was corrected?  Is the
>> error being reset at some time or kept indefinitely?
>
> You didn't understand the test. Some errors were picked up at import
> stage, but all that gets checked at import stage is the pool metadata,
> not the entire disk's contents. The find command went and read all the
> data on the FS, but if you had snapshots, some errors might still be
> in them that don't get found by checking the data in the "head"
> instance. For a full pool check you use the zfs scrub command.

The test didn't explain this, so how are ppl who don't know what ZFS
does how supposed to understand it?  You can see that the data can still
be read and that the number of errors has gone up.  That the number of
errors has increased contradicts that the errors have been fixed.

> It is generally more efficient than rsync. rsync has to go and check
> metadata of every file at source and destination, even when
> transferring incrementally. If you use incremental send between
> snapshots, the amount of disk I/O ZFS has to do is much lower, and
> since it doesn't need to check metadata on each side, there is a lot
> less network I/O, too.

So I wouldn't need rsync anymore --- that would be a really useful
feature then.

>>> http://www.smallnetbuilder.com/nas/nas-features/31541-how-to-build-a-cheap-petabyte-server-revisited
>>>
>>> That's just two organizations with similarly sized storage and
>>> different approaches. One uses standard solutions, the other one
>>> ported ZFS to Linux, so they could use it.
>>
>> I find it interesting that a company which is concerned more about its
>> costs than anything else doesn't go for a solution, now easily
>> available, that can cut their costs in half and that an institution
>> which doesn't appear to be overly concerned with costs goes for that
>> very same solution despite it's not easily available at all.
>
> That doesn't mean anything at all. I know of many companies where they
> use rented servers rather than their own because accountants prefer
> opex to capex, even if over 3+ years the reduction in total cost would
> be huge.
>
> I also know of many companies who use virtual cloud infrastructure,
> even though the performance hit for the substantial part of their
> workload is half of what they would get on bare metal servers, which
> in turn makes owned bare metal servers cheaper.
>
> Solutions that aren't optimal are used all the time for all kinds of
> spurious and non-technical reasons.

I don't find that convincing.  Companies aren't willing to pay their
employees salaries that would even cover the costs of living, and they
are always trying to save money in any way they can.  When you look at
to what lengths backblaze claims to have gone to to keep costs low, it
is entirely inconceivable that they would skip out on something that
would save them half their costs for spurious or non-technical reasons.

>>> It's up to you to define your goals, solutions and level of
>>> assurance. My personal approach is "hope for the best, plan for the
>>> worst".
>>
>> The problem is that you can plan whatever you want and things turn out
>> otherwise regardless.  My grandma already knew that.
>
> Expect the unexpected and be prepared. Every boy scout already knows that.

When they grow up, they find out that it doesn't work.

> Just because something might go wrong regardless of how well prepared
> you are doesn't justify not being prepared at all.

There are lots of reasons for not being prepared for everything, and
being prepared for everything when things go wrong nonetheless can be
difficult to justify.

>>>> What is the actual rate of data corruption or loss prevented or
>>>> corrected by ZFS due to its checksumming in daily usage?
>>>
>>> I have experienced data corruption due to hardware failures in the
>>> past.
>>
>> Hardware failures like?
>
> I am quite certain I have seen in-RAM data corruption before, that
> when it occurs in the commit charge, will cause on-disk data
> corruption (not detectable once it's on disk, your application will
> just get corrupted data back since the corrupted data is correctly
> stored as corrupted data on disk).
>
> I have also seen files get corrupted on disk due to latent disk
> errors, through traditional RAID. Nothing logs an error, there is no
> change in the file, but the application crashes. When I located all
> the files being involved in the operation and pulled the backups
> months back, I found that 512 bytes in the file has changed between
> two backups with no obvious metadata changes (modification time). This
> is a fairly typical example of what happens when a disk write goes
> astray.
>
> The opposite problem is a phantom write when the write doesn't make it
> to the disk - head floats too far from the platter and the write
> doesn't stick. This, too, happens a lot more than most people realize.

That it /can/ happen is one thing, how often it /does/ happen is
another.  Without knowing the actual rate, it's difficult to judge how
big the benefit of checksumming is.

>>> Once is often enough for me and it happened more then once. If I
>>> hadn't done the checksumming myself, I probably wouldn't even have
>>> known about it. Since I started using it, ZFS detected data corruption
>>> several times for me (within a few years). But I don't own a data
>>> center :) Actual error rates might depend on your workload, hardware,
>>> probabilities and lots of other things. Here's something you might
>>> find interesting:
>>
>> Sure, the more data about failures detected by checksumming we would
>> collect, the more we might be able to make conclusions from it.  Since
>> we don't have much data, it's still interesting to know what failure
>> rates you have seen.  Is it more like 1 error in 50TB read or more like
>> 1 error in 500TB or like 20 in 5TB?
>
> According to manufacturers, one unrecoverable error every 10^14 bits
> read. That equates to one unrecoverable sector every 11TB. This is the
> statistical average. On some models it'll be worse.
>
> The following articles provide some good info:
>
> http://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf
>
> http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf

They don't answer the question, either.  How much data can you, in daily
usage, read/write from/to a ZFS file system with how many errors
detected and corrected only due to the checksumming ZFS does?

For all we know, that may be zero errors.  Or it could be so many that
everyone not using a file system that does checksumming wants to switch
to one that does immediately.


-- 
Knowledge is volatile and fluid.  Software is power.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.