[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Home Xen hypervisor for master's project



Mechanical and power supply issues aside, ECC errors are the most common reason for component replacement in the field. Memory does fail. Furthermore, cosmic radiation, as far-fetched as it sounds, is a real problem and a single bit flip on non-ECC systems will trigger a system panic (if you're lucky, only the VM will fail, if unlucky, the whole system will reboot). Analyzing crash dumps to find the offending module is rather unpleasant to say the least (depending on how the OS handles memory access), and you won't know if the problem was with the hardware or a random occurrence.

I've seen systems which had a constant zero or one on one of the memory module pins (lines) due to vibration, resulting in constant error messages. With ECC, which reported the exact bit, it was trivial to diagnose and resolve. With standard memory, it would have panicked every time the server was booted -- the frustration would have been unbelievable.

If anything, ECC does NOT enhance performance. If anything, you'll get lower performance as I've seen no ECC modules which went beyond JEDEC specified DDR standards (for frequency and latency). I.e., the fastest ECC DDR3 memory is 1333 MHz CL9. You can get 1866 MHz CL7 and something like 2400 MHz CL9, but only non-ECC, so when it comes to performance, ECC limits it (aside from the obvious delays due to outages).

One thing which you should bear in mind, and which you should check before springing for expensive equipment: check how memory errors are reported on the system you are building. FWIW, I know that ECC errors with AMD under Linux are very readable (when building the kernel, you have the option to choose human-readable ECC error reporting, which is available for AMD only). Syncfloods on HT are also diagnosable. I don't know how it looks with Intel under Linux, although I can confirm that memory errors are mostly easily diagnosable under Solaris, both with Intel and with AMD.

Just so you know, if you go with AMD, you don't have to get Opteron to get ECC. All 890FX system boards support ECC (some mainstream boards based on other chipsets might not allow it in BIOS), and nearly all support IOMMU. With Intel, you have to get a Xeon, paying more in the process. I do use AMD-based PC at home, I'm not an AMD employee. If I was going with brand loyalty based on the companies I work for, I'd be recommending Intel

Marek

Dnia 20-02-2011 o 09:17:56 James Harper <james.harper@xxxxxxxxxxxxxxxx> napisał(a):


Hi Joseph, guys,

I must say that ECC is kinda market bull. I did used ECC enabled server
and regular servers built from consumer parts. I've noticed no
difference whatsoever.


You do understand what ECC is right? It's not a performance thing, it's an error detection/recovery mechanism.

When everything is working properly you won't notice a thing. When you get a single bit memory error you shouldn't notice any problems apart from a message about a faulty memory module which you can then replace. When you get a double bit memory error you'll know that you've had a memory error instead of getting a random crash or data corruption problem with no idea of the cause.

I've seen ECC catch memory errors a few times, so people aren't just making this stuff up.

James



--
Używam klienta poczty Opera Mail: http://www.opera.com/mail/

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.