[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
On 07/02/2014 11:45 PM, lee wrote: Gordan Bobic <gordan@xxxxxxxxxx> writes:On 07/01/2014 05:24 PM, lee wrote: ZFS has it's advantages, and it would seem a bad idea to use it with RAID.That isn't true. While it is better to use it with bare disks, using it on top of RAID is still better than using something else because you still at least get to know about errors that creep in, even if ZFS can no longer fix them for you.That's why I'm saying that it seems a bad idea: You have redundancy and you can't fully use it for the redundancy is in the wrong place. If it was in the right place, the errors could be corrected.It's tempting to try it out, and I really like the checksumming it does, and it's also confusing: There's (at least) ZFS and OpenZFS, and Debian requires you to use fuse if you want ZFS, adding more complexity.You haven't done your research thoroughly enough.No, I haven't looked into it thoroughly at all.On Linux there is for all intents and purposes one implementation.Where is this implementation? Is it available by default? I only saw that there's a Debian package for ZFS which involves fuse. http://lmgtfy.com/?q=zfs+linux&l=1 I've been using ZoL since back when the only POSIX layer implementation was from KQ Infotech, which was a rather early alpha grade bodge, and I never saw any forward incompatibility, nor have I ever lost any data to ZFS, which is more than I can say for most other file systems.A very long time ago, I lost data with xfs once. It probably was my own fault, using some mount parameters wrongly. That taught me to be very careful with file system and to prefer file systems that are easy to use, that don't have many or any parameters that need to be considered and basically just do what they are supposed to right out of the box. No file system is immune from user error. If you want optimal performance that's a whole different issue, and there are a lot of things you have to tweak on most of them to achieve that, especially on RAID. Does ZFS do that? Since it's about keeping the data safe, it might have a good deal of protection against user errors. I don't think it's possible to guard against user errors. If you're concerned about user errors, get someone else to manage your machines and not give you the root password. And perhaps the next day after I switch to ZFS, a new feature comes out which would require me to re-create the volumes and to copy the data over yet again, at least if I wanted to use that feature.You're spreading misinformed FUD. There are no "features" that could be added that might require you to rebuild the pool. An existing pool can always be upgraded. There is no way to downgrade, though, so make sure you really want those extra features.See http://open-zfs.org/wiki/Features: "SA based xattrs Improves performance of linux-style (short) xattrs [...] Requires a disk format change and is off by default [...] Note that SA based xattrs are no longer used on symlinks as of Aug 2013 until an issue is resolved." What's the difference between "a disk format change" and "rebuilding the pool"? And how could you predict that nothing changes requiring a rebuild or a format change when there are issues that apparently haven't been resolved in almost a year now and features that haven't been implemented yet? You don't have to rebuild a pool. The existing pool is modified in place and that usually takes a few seconds. Typically the pool version headers get a bump, and from there on ZFS knows it can put additional metadata in place. Similar happens when you toggle deduplication on a pool. It puts the deduplication hash table headers in place. Even you remove the volume that has been deduplicated and don't have any deduplicated blocks afterwards, the headers will remain in place. But that doesn't break anything and it doesn't require rebuilding of a pool. You might enable a new feature and find that it causes problems, but you can't downgrade ... You could have to _use_ a feature that causes problems just because it's available. And features that broken are rare, and non-critical. It seems that ZFS isn't sufficiently mature yet to use it. I haven't learned much about it yet, but that's my impression so far.As I said above - you haven't done your research very thoroughly.I haven't, yet all I've been reading so far makes me very careful. When you search for "zfs linux mature", you find more sources saying something like "it is not really mature" and not many, if any, that would say something like "of course you should use it, it works perfectly". There's a lot of FUD out there, mostly coming from people who have neither tried it nor know what they are talking about. Whatever next? "It must be true because I read it on the internet"? And how about ZFS with JBOD on a hardware RAID controller?That is the recommended way to use it (effectively do away with the RAID part of the controller).That's something else I never tried. What if I make a JBOD and then connect the disks to "normal" on-board SATA ports? Will they be readable just like they were never connected to a RAID controller? That depends on what the RAID controller does to the disks. It is another big unknown that is a reason why RAID controllers are best avoided. Somebody recently got help on the ZFS mailing list recovering from just such a situation, where their new RAID controller clobbered the front of their disks. Since ZFS uses variable width stripes, every write is always a single operation.Which may be completed or not? And what about the on-disk caches and power failures?That's what barriers and the sync settings on each file systems are for. As with any FS, any commits since the last barrier call will be lost. Everything up to the last barrier call is guaranteed to be safe, unless your disk or controllers lie about having commited things. This is not ZFS specific and applies to any FS.IIRC, when I had the WD20EARS in software RAID-5, I got messages about barriers being disabled. I tried to find out what that was supposed to tell me, and it didn't seem to be too harmful, and there wasn't anything I could do about it anyway. What if I use them as JBOD with ZFS and get such messages? No idea, I don't see any such messages. It's probably a feature of your RAID controller driver. It seems that FIS would have to be supported by every HBA because it's the second layer of the SATA protocol. And I thought that NCQ is a feature of the disk itself, which either supports it or not. Why/how would a HBA or PMP interfere with NCQ?Not all HBAs support NCQ, and not all support FIS. They are specific features that have to be implemented on the HBA and PMP and HDD.So the wikipedia article about SATA is wrong? Or how does that work when any of the involved devices does not support some apparently substantial parts of the SATA protocol? You misunderstand. When I say "FIS" I am talking about FIS based switching, as opposed to command based switching. Perhaps a lack of clarity on my part, apologies for that. If your HBA/PMP/HDD do support FIS+NCQ (the models I mentioned do), then the bandwidth is effectively multiplexed on demand. It works a bit like VLAN tagging. A command gets issued, but while that command is completing (~8ms on a 7200rpm disk) you can issue commands to other disks, so multiple commands on multiple disks can be completing at the same time. As each disk completes the command and returns data, the same happens in reverse.And there aren't any (potential) problems with the disks? Each disk would have to happily wait around until it can communicate with the HBA again. SCSI disks were designed for that, but SATA disks?The PMP takes care of it. It works, and it works well. NCQ on most SATA SSDs works in reverse this way because most of the time the disk is faster than the SATA port.You said before that the disks are slower than the port because they spend so much time seeking. Does the PMP have some cache built in so that the disks don't need to wait? Or are the disks designed to wait? I'm not sure what the implementation details are, but there is no difference between what happens with a PMP and without a PMP - the SATA controller migh not always be able to receive a request that had been queued, so the disk has to "wait", e.g. if there's an interrupt storm going on, or there is another bottleneck on the system (e.g. having two 6GBit SATA ports behind a single PCIe lane). Well, yes, the disk has failed when it doesn't return data reliably, so I don't consider that as a problem but as a desirable feature. What does ZFS do? Continue to use an unreliable disk?Until the OS kernel's controller driver decides the disk has stopped responding and kicked it out.As far as I've seen, that doesn't happen. Instead, the system goes down, trying to access the unresponsive disk indefinitely. I see a disk get kicked out all the time. Most recent occurrence was 2 days ago. Hence why TLER is still a useful feature - you don't want your application to end up being made to wait for potentially minutes when the data could be recovered and repaired in a few seconds if the disk would only give up and return an error in a timely manner.So you would be running ZFS on unreliable disks, with the errors being corrected and going unnoticed, until either, without TLER, the system goes down or, with TLER, until the errors aren't recoverable anymore and become noticeable only when it's too late. "zfs status" shows you the errors on each disk in the pool. This should be monitored along with regular SMART checks. Using ZFS doesn't mean you no longer have to monitor for hardware failure, any more than you can not monitor for failure of a disk in a hardware RAID array. Or how unreliable is a disk that spends significant amounts of time on error correction?Exactly - 7 seconds is about 840 read attempts. If the sector read failed 840 times in a row, what are the chances that it will ever succeed?Isn't the disk supposed not to use the failed sector once it has been discovered, meaning that the disk might still be useable? When a sector becomes unreadable, it is marked as "pending". Rad attempts from it will return an error. The next write to it will cause it to get reallocated from the spare sectors the disk comes with. As far as I can tell, some disks try to re-use the sector when a write for it arrives, and see if the data sticks to the sector within the ability of the sector's ECC to recover. If it sticks, it's kept, if it doesn't, it's reallocated. HGST are one exception to the rule - I have a bunch of their 4TB drives, and they only make one 4TB model, which has TLER. Most other manufacturers make multiple variants of the same drive, and most are selectively feature-crippled.You seem to like the HGST ones a lot. They seem to cost more than the WD reds.I prefer them for a very good reason: http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/Those guys don't use ZFS. They must have very good reasons not to. I don't know what they use. 20 years of personal experience also agrees with their findings.My experience agrees with their findings, too, only it's not tied to a particular brand or model other than that Seagate disks failed remarkably often and that I should never have bought Maxtor. HGST has been bought by WD, though. We can only hope that they will continue to make outstanding disks. We can but hope. Gordan _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |