Please note that the main purpose of ECC is not to reduce RAM error rate and make it look more reliable, but to help the system stop the process when an unrecoverable memory error occurs as opposed to propagating it and resulting in unpredictable outcomes. The change in the effect of failure is what matters most, not the probability of it. Without ECC, there's often no clear way to realize that the result of a computation is valid or garbage and should be discarded.
(Of course, in extreme scenarios, like at Google scale, even ECC can fail to fail due to multibit errors, but in almost all non-pathological scenarios, SECDED[1] is enough to catch all erroneous cases.)
Exactly, you want to know when the error is due to memory.
Intel deciding that consumers (including those buying Haswell-E CPUS) do not need ECC really irks me. Textbook market segmentation from a near monopoly.
Currently you can not have your cake and eat it:
You cannot have the best single-thread performance (offered by overclocking Haswell-E series or Skylake 6700k) and have ECC.
So if one is building the ultimate workstation, you have a hard choice, do you go with X99 chipset(no ECC but can overclock) or do you go to the server motherboards with C610 chipsets which are quite limited as far as consumer interests are.
Interesting are the Intel mobile Xeons which now provide a venue for ECC on a laptop.
Textbook market segmentation? Xeons have ECC, larger thermal envelope, and some additional testing. Sure they are identical silicon.
Generally if you are willing to give up a single clock bin in exchange for ECC you end up with a cheaper (and cooler) system that's more reliable. Generally if you want the cheapest 4c/8t CPU it's a xeon, NOT an i7.
I don't feel particularly artifically segmented. Additionally the high end desktop motherboards tend to be more expensive than the server boards. Often I find a nice server board at $180 and the nice desktop boards are often another $100. Sure they are marketed to gamers, but I really just want a nice reliable power and cooling and it's not clear which of the cheaper desktop boards are really going to last 24/7 for 5 years.
Today I'd buy the E3-1270 for $339 over the $350 i7-6700k. Keep in mind the k chips are a premium AND they don't come with a fan like the non-k chips do. Sure it's 3.6 - 4 GHz instead of 4.0 to 4.2 GHz, not a particularly noticeable difference, especially since that both thermally throttle as needed.
I think ECC is well justified because it doesn't just detect dimm errorrs, but also motherboard errors, cpu errors, and socket (dimm or cpu) errors. If a node randomly crashes/hangs it's very hard to track down why... unless you have ECC and often will help you pin it down. I'd much rather see something strange show up in mcelog than wait for a hang, or worse a corruption.
Most of my "ecc" errors have actually been motherboard, socket, or (in AMDs case) CPUs. When I look at larger samples some dimms are WAY less reliable than others. Strongly implying it's not high energy particles, but something out of spec.
If it weren't a market segmentation strategy, just like limiting the RAM capacity, then there'd be equivalent 'server' chips for most current 'desktop' feature sets and vice versa. However that is clearly not the case, both in my own shopping experience and in the experience of Jeff Atwood (this is in fact something he complains about in this very article).
ECC would require running and connecting a few more traces, but that would /surely/ be offset by not having to create as many layouts or source/stock as many parts. In the past AMD used to have a competing/selling point of /all/ of their CPUs supporting ECC ram. Today that is not the case, as they too have mirrored (colluded?) Intel's market segmentation strategy.
You have to give props for Asrock as they are frequently offering unique motherboard features (like that mini-ITX X99 etc).
However, it still does not change the fact that Xeon v3 models are slower(tops at 3.9Ghz Turbo) single threaded than Haswell-E, so it is the same story just a different chipset, either you get fast single threaded performance with Haswell-E or you get Xeon with ECC.
(1) IIRC, some
operating systems, seeing
some ECC errors, maybe
just the uncorrectable ones
or maybe also the correctable
ones, moved to mark the associated
memory, or block of memory,
as faulty, maybe stopping the
(applications)
program using that memory, and continued on. Is this done with current operating systems?
(2)
What would Windows Server do
with a thread, process, address space or whatever the heck that
encountered a memory error
detected by ECC, especially
one that was uncorrectable?
I'm eager to know since I'm
eager to build a server,
with ECC memory, and run
Windows Server in production.
(1) I have never heard of this behavior, correctable errors will be reported via MCA on Intel and uncorrectable ones will reset the system (and probably be logged in some firmware log).
(2) So as far is I know, the normal consequence of a detected multi bit error will be a system reset.
Yeah, agreed. I found myself wondering why he kept referring to "reliability". Perhaps he defines it to include data integrity, but it didn't come across that way to me. I kept thinking to myself "nobody I know uses ECC just to increase reliability".
I believed that Google from the beginning implemented their own checksum codes in their software to regularly verify the data processed or communicated on their non-ECC computers. And I doubt the open-source software the article author uses does the same?
I know that Jeff is a demigod to some people, but I interpret this article as: "As a software guy, I don't really understand why I need this fancy hardware, so this can't be important". IMO he's wrong.
The margins between working and non-working DRAM these days are extremely small. E.g. Rowhammer demonstrated that even user-space programs could readily obliterate main memory, without even trying very hard to do so.[1]
But, maybe in this case he's right. It's not like "open source Internet forum software" is anything that's mission critical. If there's an occasional garble in a character or two, will the latte-swilling hipsters even notice? :-)
Just like the original Google servers he points to. Who cares if they occasionally screwed up in reporting search results, because they didn't have ECC memory. Overall the experience was still 100x better than using something like Altavista.
What Jeff is trying to say is: if ECC is so desperately needed to prevent memory errors that are supposedly happening all the time, why isn't ECC in every computer everywhere?
The average consumer knows that more "jigabits" are better and more "jigahertz" is better (see Intel NetBurst for how badly that can go wrong).
See a link elsewhere in this tread, someone posted a memory error presentation that talked about FIT, failures in time. But the average consumer doesn't know what that is.
Hence we get a race to the bottom. PC assemblers are willing to sell their mothers into slavery if it can save them $0.05 in build cost. ECC doesn't fit into that narrative.
BTW ECC is "in every computer" nowadays. As yet another poster mentioned, Intel CPUs use ECC internally to protect their caches.
There's at least two broad classes of error correction and detection: at-rest and in-flight.
Each storage hierarchy component (RAM, SSD, CPU caches, etc.) and interconnection (chip-to-chip, add-on card, cable to another box) needs to be looked at for risk of nondetection/data loss based on risk consequences of the intended use.
For example, billing database servers for a successful company probably should use RAID array/SAN/NAS (say RAID6 or ZFS with RAIDZ3) and Chipkill ECC memory on an enterprise-class box with decent vendor support.
CDN boxes for serving free, static content can be almost anything.
For larger shops, they have the economies of scale to ask from OEMs and ODMs to build custom boxes that are more optimized than COTS gear at Dell, HP or CDW.
When Jeff's venture takes off, they might explore gear customized for running Ruby and/or partnering with 37signals and the like to have OEMs/ODMs folks develop better performing gear and open source it like Facebook has.
Cost is king in a commodity market where IBM, et. al. left for more profitable waters.
Dell was pretty good at shaving pennies and providing WalMart-ized desktops and servers.
I think the offerings need to be optimized and reduce and cut features to just what's necessary based on actual, intended uses rather than guessing or throwing every possible feature into a retail desktop or offering a blizzard of different, poorly-explained SKUs (what's the diff btwn A78Z-VX and A78C-VX+?)
"Non-parity" RAM probably started becoming common around the 1993-1995 period when DRAM demand was increasing and prices was not falling much. For example, 4Mbit DRAM was costing more than $10 per chip during this period. Nowadays Intel uses it for market segmentation.
Just to be clear SECDED ECC doesn't protect you against row hammer and similar memory disturbance attacks.
DDR4 implemented some mitigation against such attacks as well as some additional soft ECC mechanisms but as these types of attacks are fairly new it's not quite yet clear as how effective they are.
We once had a new server with all new hardware which had weird problems and kept crashing mysteriously. Memory tests showed no errors, so we were all tearing our hair out. We took the server offline and set it to test continously – still no errors. After running Memtest86 on nothing but test #4, for about a day or so – then a few memory errors showed up. Replaced memory, problem gone, server started working.
Memory errors are especially insidious compared to how common they are. ECC is worth it.
I remember circa 1999 having a database server which had a stuck bit in memory. The bit happened to be placed in the page cache, so it subtly corrupted disk writes resulting in the database throwing checksum errors. It took an insane amount of time to even diagnose where the problem could be. We of course thought it was the disks themselves and tried many variations of disks and external RAID cards. Finally, one run of memtest86 found the real problem, and I threw away the memory and motherboard and replaced it with one capable of ECC RAM.
I forget now why we even thought to build a server without ECC RAM, but I sure learned my lesson after that.
I tried to catch soft errors for about a year on a couple of Linux boxes I had. They were both desktop form factor machines, one being used as a home server and one as a desktop at work.
I had a background process [1] on each that simply allocated a 128 MB buffer, filled it with a known data pattern, and then went into an infinite loop that slept a while, woke up and checked the integrity of the buffer, and if any of the data had changed logged the change and restored the data pattern.
Based on the error rates I'd seen published, I expected to catch a few errors. For example, using the rate that Tomte's comment [2] cites I think I'd expect about 6 errors a year.
I never caught an error.
I also have two desktops with ECC (a 2008 Mac Pro and a 2009 Mac Pro). I've used the 2008 Mac Pro every working day since I bought it in 2008, and the 2009 Mac Pro every day since I bought it in 2009. Neither of them has ever reported correcting an error.
I have no idea why I have not been able to see an error.
As soon as you have a power fluctuation, air conditioning malfunction, or a few dirty caused short cuts, you'll get enough errors to converge on the published average.
Just wait, and relax. You'll get there eventually.
IEC 61508 documents an estimate of 700 to 1200 fit/MBit (fit = "failure in time"; per 10e-9 hours of operation) and gives the following sources:
a) Altitude SEE Test European Platform (ASTEP) and First Results in CMOS 130 nm SRAM. J-L. Autran,
P. Roche, C. Sudre et al. Nuclear Science, IEEE Transactions on Volume 54, Issue 4, Aug. 2007
Page(s):1002 - 1009
b) Radiation-Induced Soft Errors in Advanced Semiconductor Technologies, Robert C. Baumann, Fellow,
IEEE, IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER
2005
c) Soft errors' impact on system reliability, Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor,
2004
d) Trends And Challenges In VLSI Circuit Reliability, C. Costantinescu, Intel, 2003, IEEE Computer Society
e) Basic mechanisms and modeling of single-event upset in digital microelectronics, P. E. Dodd and L. W.
Massengill, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 583–602, Jun. 2003.
f) Destructive single-event effects in semiconductor devices and ICs, F. W. Sexton, IEEE Trans. Nucl. Sci.,
vol. 50, no. 3, pp. 603–621, Jun. 2003.
g) Coming Challenges in Microarchitecture and Architecture, Ronen, Mendelson, Proceedings of the IEEE,
Volume 89, Issue 3, Mar 2001 Page(s):325 – 340
h) Scaling and Technology Issues for Soft Error Rates, A Johnston, 4th Annual Research Conference on
Reliability Stanford University, October 2000
i) International Technology Roadmap for Semiconductors (ITRS), several papers.
If that's correct, the math is simple: you have bit flips in your PC about once a day.
It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.
> It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.
Also, most modern processors use ECC for their caches (even when the main memory is non-ECC) and they serve the vast majority of memory requests, so it is unlikely that intermediate values in a tight computation are affected by non-ECC RAM. That adds to the "silentness" aspect of the bit flip in consumer systems.
These things do happen with a reasonable amount of frequency. I used to work at a division of a major memory manufacturer that dealt with writing tests to find these DIMMs that exhibited these sorts of failures - the semiconductor industry calls them "variable retention transfers". (Aside: numerous PhDs in the field of semiconductor physics have built prosperous careers trying to understand why these soft failures happen. Short answer: we have some theories, but we don't really know.) It was provably worth millions of dollars to be able to screen for this sort of phenomenon, because a Google or an Apple or an IBM would return a whole manufacturing lot of your bleeding edge, high-margin DIMMs if they found one bit error in one chip of one lot. Each lot was shipping for millions and millions of dollars.
Anyone who've managed even a modest amount of servers with ECC RAM for a reasonable amount of time has surely seen ECC events in their hardware logs. Most of these are one-time errors that never happen again on the same server, ever.
Without ECC these errors would have unknown consequences. They could happen in some unused region of memory, or they could happen in a dirty page in the filesystem cache. It's not fun to discover that your filesystem has been silently corrupted a unknown time after the fact.
Maybe Google doesn't need ECC. Their data is duplicated across several machines and it's extremely unlikely that a few corrupt servers would lead to any data loss.
However, on a smaller scale (and just like RAID) it's cheaper to have ECC than add more servers for extra redundancy.
What he's saying is essentially "The code I write/the platform I choose scales poorly over multiple cores. Therefor I decide to blame the hardware, and skip features that are good for me"
People need to adapt to a world where we have more cores instead of faster execution per core. You can't compare late 90's growth in execution speed per core with the situation we have today.
Write software for an environment where the number of cores scale, instead of an environment where the execution speed of a single core is more important.
> What he's saying is essentially "The code I write/the platform I choose scales poorly over multiple cores. Therefor I decide to blame the hardware, and skip features that are good for me"
Is that so bad? He's writing and hosting the code, and he's paying the bill to do it. Seems to me he should be able to pick how to do it.
This cannot possibly be right. There was a DC21 talk regarding DNS request misfires due to bit flips in non-ECC DRAM, and the researcher was able to collect a surprisingly large number of requests on the basis of this.
Importantly, those DNS packets go through a number of systems that are not clients or servers. Wifi, microwave antennae, undersea cables, consumer routers, unpowered hubs, you name it. It's hard to know whether these bit flips are actually coming from cosmic rays or EM interference or rare decompression bugs.
If soft errors are rare, parity checking, without correction, might be more useful. It's better to have a server fail hard than make errors. In a "cloud" service, the systems are already in place to handle a hard failure and move the load to another machine. Unambiguous hardware failure detection is exactly what you want.
In practice, you basically get one-bit error correction 'for free' when you have enough redundancy to detect two-bit soft errors. Simple parity can only detect one bit flip, so if you want to catch two-bit errors, you might as well correct one-bit errors you find on your way at no extra cost.
I don't think that data corruption was a huge issue for Google back then (really early on). Corrupt data? Big whoop. Re-index the internet in another X hours, and it's gone. I doubt they had much persistent storage as most of their data was transient and well, the Internet.
Also, I still see "fire hazard" when I look at the early Google racks. No idea how Equinix let them get away with it. Too much ivory tower going on there. Not enough "you know we're liable if we burn down the colo with that crap, right?"
There is no extra chance of a short circuit before the power supply. After the power supply the power is limited, either by explicit current limiting or just because they are switching power supplies where transformer saturation limits the power.
So you could have a PCB fire, but PCBs are made to be flame retardant. You could have a wire insulation fire, but the amount of material would be so low that it wouldn't be able to start a fire anywhere else.
So I am basically saying there isn't really anything there that could sustain a fire and that there isn't a lot of energy to start ignition in the first place.
Fun fact: I had my desktop get unreliable for a few weeks until I finally thought it might be just dust build up in the case. Opened it up and found a huge scorch mark on the motherboard where a capacitor had clearly burned up.
Cardboard breaks down over time. It turns into particulate matter that goes airborne into really hot server intakes and comes out tiny little burning embers.
If it didn't burn down Google's stuff, it could have burned down other people's gear. I have decades of experience here; I'm not an ivory tower nerd. Any datacenter/colo provider worth a salt will jump on you immediately for having cardboard in your environment. DRT makes you unbox everything outside the various colos and won't even let cardboard enter.
The auto-ignition temp of paper is over 200C. The maximum junction temperature of most electronics is somewhere around 100C. This this literally could not of ever happened unless the equipment was already on fire.
I'll leave the idea that cardboard breaks down fast enough to be noticed over the life of a server to someone more knowledgeable. I note that there was no mention of cardboard in the article.
The Xeon E3-1270 v5 goes from 3.6 to 4.0 GHz and only costs 10% more than the i7-6700 (3.4-4.0 GHz)
Also, the Xeon E3-1230 v5 goes from 3.4 to 3.8 GHz (same base clock) and costs less than the Core i7-6700.
In general, you should never buy non-Xeon CPUs if you have the choice, both for desktop and for servers, since ECC memory is essential if you don't want to have a significant chance of having to replace your RAM after discovering mysterious problems with your system.
where do you create that checksum? If its on a computer without ECC, you will just checksum the data including the error, then write that data including the error to disk.
What happens to the data after you have read it to memory and successfully verified the checksum? You probably process it in memory, and have no idea afterwards if the changes are due to your code, or because of errors.
Of course you can now propose to also checksum and check the data while it is in memory. Which is basically what ECC does, in hardware, for cheap, requiring no CPU cycles.
(Of course, in extreme scenarios, like at Google scale, even ECC can fail to fail due to multibit errors, but in almost all non-pathological scenarios, SECDED[1] is enough to catch all erroneous cases.)
[1]: http://cr.yp.to/hardware/ecc.html