[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
On 07/01/2014 08:23 PM, lee wrote: Gordan Bobic <gordan@xxxxxxxxxx> writes:On 06/29/2014 08:12 AM, lee wrote:Gordan Bobic <gordan@xxxxxxxxxx> writes:On 06/28/2014 08:45 AM, lee wrote: The hardware RAID controller gives me 10fps more with my favourite game I'm playing, compared to software raid. Since fps rates can be rather low (because I'm CPU limited), that means a significant difference.If your game is grinding onto disk I/O during play all is lost anyway. If your CPU and RAM are _that_ constrained, there is probably a better way to spend whatever you might pay for a new caching RAID controller these days.Only I didn't buy the controller new, and I bought it to have a decent amount of ports.Fair re: number of ports, but 2nd hand CPUs don't go for that much on ebay, either, if you are concerned about the CPU hit.What would you put into an AM3 socket which is so much faster than a Phenom 965 to be worthwhile and doesn't have an enormous power consumption? I'd have to switch to Intel --- and that would bring about UEFI which, besides the tremendous hassle, comes with serious security risks. And after reading [1], what could you still buy? [1]: https://www.fsf.org/blogs/community/active-management-technology UEFI comes with extra security risks? Do tell. If anything it is more secure because it requires all firmware to be cryptographically signed, so you cannot have something compromise a firmwre without getting detected. It's not disk I/O or a lack of RAM that limits the fps rates, it's actually the CPU (or the whole combination of CPU, board and RAM) not being able to feed the graphics card fast enough --- or the graphics card being too fast for the CPU, if you want to see it that way. To get a significantly faster system, I'd have to spend ten times or more than what I payed for the controller.Fair enough - you must have got the controller _really_ cheap. Expect to spend a small fortune on a battery replacement when that fails, though. They only typically last a couple of years.Yes, it was a good deal --- you can still get them off ebay for about that price. If the BBU fails, it might be advisable to buy another controller with BBU because it doesn't cost much more than the BBU, and I'd have a spare controller. Or I could retire the controller and perhaps use ZFS or mdraid or, since I have the server now, just get a pair of SSDs. My intention is to reduce the number of disks anyway. The less disks you have, the less can fail. Or you could use them for extra redundancy. The CPU alone would cost more. I didn't expect any change in fps rates and got the improvement as a surprising side effect.That will only apply when you are doing disk I/O at the same time, surely. If you aren't disk disk I/O, then your CPU isn't doing checksumming.Well, I was really surprised about the improvement in fps and have a hard time to believe it myself. There's no good explanation for it other than the theory that the system somehow gets plugged when having to deal with so many disks in software raids, and that would seem to only apply when so much disk I/O is going on. "So much" would probably mean "only very little". Possibly a reduced number of interrupts hitting the CPU. As I said above - clearly you got the RAID controller very cheap. Within that cost envelope it may well have been a reasonable solution - but that doesn't take away the points about data safety and disk feature requirements for use with hardware RAID (TLER).True --- luckily, the WD20EARS hold up remarkably well. One out of three failed, after about three years, probably because it would take too long to recover an error. I replaced it and haven't checked it any further yet; it might still be usable. Isn't that the green model that spins down so aggressively that it causes premature spindle and actuator failure? I have a duff one on the shelf that randomly loses different sectors on every scan. But because WD firmwares lie about reallocations (or worse, it just keeps reusing bad sectors as long as it thinks the data has stuck), it never actually reads as failed in SMART. I have observed this on WD disks going back several generations, which is why I avoid them wherever possible. Samsungs seem to do something similar. The only "honest" disks seem to be HGST and Seagate, but experience with Seagate over the last 6 years shows them to be far too unreliable for my liking. As to whether I prefer honest but unreliable vs. dishonest but more reliable disks - it's a tough call, but luckily I don't have to make that decision at the moment. Disks with crippled firmware will deliberately not respond to resets while trying to do sector recovery, again, purely to prevent working around the fact the disk doesn't report TLER. When this happens, one of two things will occur: 1) The controller will kick out the disk and carry on without it, thus losing the redundancy.That's probably what happened with the failed disk.2) The controller might be clever enough to not kick out the disk that doesn't support TLER (theoretical - you are at the mercy of the closed-spec RAID card firmware that may or may not do something sensible), but the only other thing it can do is wait for the disk to return. But until the disk returns, the card will block further I/O. So in reality you have a choice between losing redundancy on the first pending sector you encounter and the machine becoming unresponsive for prolonged periods whenever you encounter a pending sector. With software RAID you can at least choose between the two (by selecting the disk command timeout appropriately), and managing the situation (e.g. setting up a process to monitor for disks that have been kicked out of the array and automatically re-adding them (if they become responsive again) to restore the redundancy).Is that a good idea to do, continue to use a disk that spends quite a while on error recovery? Sector failures shouldn't be that common. If the disk keeps losing sectors that regularly it will eventually run out of them (unless it's a WD), and SMART will eventually just complain about the disk having failed. And you will probably notice the errors in the log (unless you use a RAID card that has no logging facilities) and degraded performance long before the disk deems itself to have failed. And have you seen what happens when a SATA disk becomes unresponsive? The kernel will, apparently indefinitely, try heavily to reset the SATA link and flood you with messages about it to the point where your only choice is to try to shut the system down. Not indefinitely - it will try to reset it a few times, and if that fails it will just stop trying to talk to it. I had a Seagate become unresponsive on one of my servers yesterday, and ZFS dealt with it by kicking out the disk from the pool without any external intervention. Hot unplug-replug of the disk woke it up. I probably should replace that disk, but I have enough redundancy and mirrored copies of that server to not have to worry about that sort of thing too much. I've had that happen a while ago with a very old disk. I wonder why the kernel behaves that badly. The disk that failed wasn't relevant at all for keeping the system running. I'd rather have the kernel try maybe ten times and then give up, and letting you have an option to make it try again. I wonder what happens when that happens with a disk on a PMP. Will the whole set on the PMP become inaccessible because the port gets blocked? At least my PMPs are reasonably well behaved. You lose a disk, but you don't lose the whole set. How come you use such a card? Couldn't you use the on-board SATA ports and connect a multiplier to them?Most south-bridge based SATA controllers don't support FIS, andMy understanding of [2] is that the SATA protocol doesn't work at all without FIS. All SATA controllers would have to support it then. [2]: https://en.wikipedia.org/wiki/Frame_Information_Structure#FIS FIS _switching_ definitely doesn't work on all SATA controllers. Intel ICH doesn't support it, and many others don't, either. performance takes a massive nosedive when you lose command interlieving between the disks.Huh? You mean if a SATA controller doesn't support FIS, it would leave you unable to use all but one disk behind a PMP connected to it because switching between the disks won't be possible? No, I mean it will issue a command and have to wait for it to complete. Then switch to a different disk, issue a command and wait for it. As opposed to issuing a command, switching to a different disk and issuing a command, then waiting for both of them to return. I have a theory that when you have a software RAID-5 with three disks and another RAID-1 with two disks, you have to move so much data around that it plugs up the system, causing slowdowns.Yes, you have to send twice the data down the PCIe bus, but in practice this creates a negligible overhead unless your machine is massively short of CPU and PCIe bandwidth.With five disks in total in different arrays you'd have to send five times the data, wouldn't you? In RAID5? No, only redundancy of data extra. Mirroring is the worst case, and that only doubles the I/O. and even on RAID6 I don't trust the closed, opaque, undocumented implementation that might be in the firmware, isIt's a big disadvantage of hardware RAID that you can't read the data when the controller has failed, unless you have another, compatible controller at hand. Did you check the sources of ZFS so that you can trust it?I have looked at the ZoL and zfs-fuse sources in passing when looking at various patches I wanted to apply (and writing a small one of my own for zfs-fuse, to always force 4KB ashift), but I wouldn't say that I have looked through the source enough to trust it based purely on my reading of the code. But I suspect that orders of magnitude more people have looked at the ZFS source than have looked at the RAID controller firmware source.It's entirely possible, that, in total, 5000 people have looked at the ZFS sources, and that 4995 of them have looked at the same 5% of it, with the exception of the 5 ZFS developers (or how many there are) who have read it all. It's a bitchy argument --- but it's possible :) Which is still probably more than the number of people that looked at the firmware source. :) For a typical example of the sort of errors I'm talking about, say you have a hardware RAID5 array and a ZFS RAIDZ1 pool. Take one disk off each controller, write some random data over parts of it (let's be kind, don't overwrite the RAID card's headers, which a phantom write could theoretically do). Now put the disks back into their pools, and read the files whose data you just overwrite. ZFS will spot the error, restore the data and hand you a good copy back. RAID controller will most likely give you back duff data without even noticing something is wrong with it.It would start a rebuild once you plug the disk back in --- provided that you hotplugged it or had the array up while the disk was removed. Otherwise, we'd have to try out what would happen, which can be different from controller to controller. I was talking about swapping disks while the machine was shut down. The RAID controller would never know the disk was even unplugged. The aim was to simulate silent data corruption. You could run ZFS on top of the RAID logical volume, but because ZFS would have no visibility of the raw disks and redundancy underneath, there is nothing it can do to scrub out the bad data. But even in that stack, ZFS would still at least notice the data is corrupted and refuse to return it to the application.Only if you do not happen to write correct checksums for the data that was overwritten --- thought it seems unlikely enough to happen. That, too, would get detected because the checksum wouldn't match the data. What does ZFS do when it finds block A and block B of data D, both blocks with a correct checksum but containing different data? Will it let you decide which block to use, or refuse to deliver the data, or not notice and return some data instead of D which is composed of whatever is in the blocks A and B? I'm not sure. You would have to really hand craft that kind of a data corruption. A scrub would find the discrepancy, but I'm not sure what it would do about it. One block would probably win since both are "valid". What would a hardware RAID controller do? :) no longer fit for purpose with disks of the kind of size that ship today.How would that depend on the capacity of the disks? More data --> more potential for errors --> more security required?The data error rates have been stagnant at one unrecoverable sector in 10^-14 bits read. That's one bad sector on about 10TB of data. If you are using 4TB drives in 3-disk RAID5, if you lose a disk you have to read back 8TB of data to rebuild the parity onto the replaced disk. If you are statistically going to get one bad block every 10TB of reads, it means you have 80% chance of losing some data during rebuilding that array. Additionally, rebuilding times have been going up with disk size which increases both the time of degraded performance during rebuild and the probability of failure of another disk during the rebuild.It makes me wonder why such errors haven't become a widespread serious problem yet --- not everyone is using ZFS. They have become a widespread serious problem. People just don't notice the data corruption often. So with VMware, you'd have to get certified hardware.You wouldn't _have_ to get certified hardware. It just means that if you find that there is a total of one motherboard that fits your requirements and it's not on the certified list, you can plausibly take your chances with it even if it doesn't work out of the box. I did that with the SR-2 and got it working eventually in a way that would never have been possible with ESX.Are you saying that for your requirements you couldn't use VMware, which makes it irrelevant whether the hardware is certified for it or not?Both. What I wanted to do couldn't be done on ESX within the same resource constraints. I wanted to use a deduplicated ZFS pool for my VM images, and there is no ESX port of ZFS. Additionally, there would have been no way for me to work around my hardware bugs with ESX because I couldn't write patches to work around the problem.The idea is that you wouldn't have had to work around hardware bugs when you'd be using certified hardware. Sure, but there was no certified hardware at the time that met all of my requirements. So I went with the best compromise I could: hardware with the features I needed and an open source hypervisor that I might have a remote chance of patching if I had to work around hardware bugs. Why would they think that virtualization benefits things that require high performance?I don't. But many people think it makes negligible difference because they never did their testing properly.How much difference does it make, actually? I mean just the virtualization, without a number of other VMs doing something. Like you could plan on a particular hardware for a particular workload and get performance X with it. Let's say you plan 4 CPUs/8GB RAM and buy hardware that has 2x4 CPUs/64GB RAM. Now you somehow limit the software to use only 4 CPUs/8GB of that hardware and measure the performance --- perhaps physically take out one CPU and install only 8GB. Then you plug in the second CPU and all the RAM, but use a VM, give it 4 CPUs/8GB and measure the performance again. How much less performance will you get? The difference is quite substantial. Here are some tests I did a couple of years ago: http://www.altechnative.net/2012/08/04/virtual-performance-part-1-vmware/I did some testing (ESX) with MySQL 3 years ago for a customer and the results were approximately a 40% degradation at saturation point, clock for clock. I also did testing (also ESX) with MySQL for a different client last week, as they are planning to move to "the cloud", and the performance drop, clock-for-clock, was about 35%. It is definitely not negligible. _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |