[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
On 2014-07-09 03:29, lee wrote: When every file occupies at least 4k because that's the block size theFS is using, you can waste a lot of space.ZFS cannot use stripes smaller than (sector size) + (redundancy). i.e. if you use disks with 4KB sectors, and you are writing a 10 byte file on RAIDZ2 (n+2 redundancy, similar to RAID6), that will use 3 sectors (one for the data, plus two for n+2 redundancy), i.e. 12KB. Variable stripe width is there to improve write performance of partial writes.And the checksums go into the same sector? So for writing a file that's4k, two sectors would be used, plus redundancy? If that is so, wouldn't the capacity appear to be increased or to be variable with ZFS, depending on file size? I'm confused now ... I don't know the details of the on-disk format - you'll have to read the code for that. The biggest advantage would be checksumming. I'd be trading that against ease of use and great complexity.Not to mention resistance to learning something new.Not mentioning the risks involved ...Perhaps our experiences differ - mine shows that lying and dying disks pose a sufficiently high risk of data loss that a traditional RAID and file system cannot be trusted with keeping the data safe.That doesn't eliminate the risks. Perhaps I've been lucky --- the more I learn about it, the more I think I should do something. The biggest problem that a lot of the time you wouldn't even be aware that there is a problem with traditional RAID. Same as without ECC memory, it's not just that the errors are uncorrectable, it's the fact that you don't even know if the errors have occurred. Same way you know with any disk failure - appropriate monitoring. Surely that is obvious.It's not obvious at all. Do you replace a disk when ZFS has found 10 errors?Do you replace a disk when SMART is reporting 10 reallocated sectors?No, I'm not using smart. Then is seems you have far bigger configuration issues to address before looking at ZFS. :) You have to exercise some reasonable judgement there, and apply monitoring, just like you would with any other disk/RAID.It's simple with RAID because the disk either fails or not. It's usually simple with disks because it either fails or not. Introducing another indicator which may mean that a disk has "failed a little" doesn't make things simpler. But it does mean that you keep extra redundancy for as long as possible. By keeping a disk with an inccreasing number of failed sectors in the pool, it means that while you are rebuilding to a new drive with the failing disk still attached, that disk is still providing redundancy of all of it's surviving sectors. zfs send produces a data stream that can be applied to another poolusing zfs receive. You can pipe this over ssh or netcat to a differentmachine, or you can pipe it to a different pool locally.So I'd be required to also use ZFS on the receiving side for it to makesense.Indeed.That's kinda evil because if I run into problems with ZFS, it might be good to have used a different file system for backups, and when I haveto restore from the backup, I might get corrupted data because it hasn'tbeen checksumed. You can't have it both ways. Either you use zfs on both ends and get advantage of it's fetures, such as zfs send, or you do things the hard, slow and error prone way with rsync. They claim[1] that they are currently storing over 100 petabytes andhave restored 6.27 billion files. They are expecting to store another 500 petabyte at another datacenter. That's over a hundred, and if they meet their plan, at least 500 detected broken files, so they must know.Mentioning such a thing occurs could be considered bad for business.There's no good way to hide it because ppl would notice when restoring. But they say their software does checksuming, just not how. There are many ways to do that. Tripwire does it. If you just wanted a cross check you could have a process on a write-only system that periodically runs and stores/compares the md5sum of the file stored in, say, it's extended attributes, to the contents of the file. What is the actual rate of data corruption or loss prevented or corrected by ZFS due to its checksumming in daily usage?According to disk manufacturers' own specifications for their owndisks (i.e. assume it's worse), one unrecoverable error in 10^14 bitsread. This doesn't include complete disk failures.That still doesn't answer the question ...If you define what "daily usage" is in TB/day, you will be able to work out how many errors per day you can expect from the numbers I mentioned above.That would be theoretical numbers. That so many errors /can/ occur doesn't mean that they /do/ occur. You are right, it doesn't. But my SMART attributes on disks seem to broadly agree, in terms of relation between values such as total LBAs read and reallocated sector counts (caveat - not all disks have the total LBAs read value in SMART). ZFS is not involved in determining such numbers, either, and there may be more or less errors ZFS detectsthan what the specification says. I'm asking what the rate actually is. As I said, from SMART figures on my many disks the 10^-14 seems to be in the right ball park, although some disks are better than others. I have disks for which the specification says that the MTBF is one million hours (or maybe even two million). That means I would see only one single failure in over a hundred years, and if I had over a hundred of such disks, I would only see one failure within a year. My experience indicates, and studies with large numbers of disks indicate, that this is BS and that the actual failure rate is much different. Indeed, the MTBF figures seem to bear little resembleance to reality, but the unrecoverable error rate figures do seem to tally up with empirical observations, in my experience. _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |