To ECC or Not to ECC

tgma · on Nov 21, 2015

Please note that the main purpose of ECC is not to reduce RAM error rate and make it look more reliable, but to help the system stop the process when an unrecoverable memory error occurs as opposed to propagating it and resulting in unpredictable outcomes. The change in the effect of failure is what matters most, not the probability of it. Without ECC, there's often no clear way to realize that the result of a computation is valid or garbage and should be discarded.

(Of course, in extreme scenarios, like at Google scale, even ECC can fail to fail due to multibit errors, but in almost all non-pathological scenarios, SECDED[1] is enough to catch all erroneous cases.)

[1]: http://cr.yp.to/hardware/ecc.html

sireat · on Nov 21, 2015

Exactly, you want to know when the error is due to memory.

Intel deciding that consumers (including those buying Haswell-E CPUS) do not need ECC really irks me. Textbook market segmentation from a near monopoly.

Currently you can not have your cake and eat it:

You cannot have the best single-thread performance (offered by overclocking Haswell-E series or Skylake 6700k) and have ECC.

So if one is building the ultimate workstation, you have a hard choice, do you go with X99 chipset(no ECC but can overclock) or do you go to the server motherboards with C610 chipsets which are quite limited as far as consumer interests are.

Interesting are the Intel mobile Xeons which now provide a venue for ECC on a laptop.

sliken · on Nov 21, 2015

Textbook market segmentation? Xeons have ECC, larger thermal envelope, and some additional testing. Sure they are identical silicon.

Generally if you are willing to give up a single clock bin in exchange for ECC you end up with a cheaper (and cooler) system that's more reliable. Generally if you want the cheapest 4c/8t CPU it's a xeon, NOT an i7.

I don't feel particularly artifically segmented. Additionally the high end desktop motherboards tend to be more expensive than the server boards. Often I find a nice server board at $180 and the nice desktop boards are often another $100. Sure they are marketed to gamers, but I really just want a nice reliable power and cooling and it's not clear which of the cheaper desktop boards are really going to last 24/7 for 5 years.

Today I'd buy the E3-1270 for $339 over the $350 i7-6700k. Keep in mind the k chips are a premium AND they don't come with a fan like the non-k chips do. Sure it's 3.6 - 4 GHz instead of 4.0 to 4.2 GHz, not a particularly noticeable difference, especially since that both thermally throttle as needed.

I think ECC is well justified because it doesn't just detect dimm errorrs, but also motherboard errors, cpu errors, and socket (dimm or cpu) errors. If a node randomly crashes/hangs it's very hard to track down why... unless you have ECC and often will help you pin it down. I'd much rather see something strange show up in mcelog than wait for a hang, or worse a corruption.

Most of my "ecc" errors have actually been motherboard, socket, or (in AMDs case) CPUs. When I look at larger samples some dimms are WAY less reliable than others. Strongly implying it's not high energy particles, but something out of spec.

mjevans · on Nov 21, 2015

If it weren't a market segmentation strategy, just like limiting the RAM capacity, then there'd be equivalent 'server' chips for most current 'desktop' feature sets and vice versa. However that is clearly not the case, both in my own shopping experience and in the experience of Jeff Atwood (this is in fact something he complains about in this very article).

ECC would require running and connecting a few more traces, but that would /surely/ be offset by not having to create as many layouts or source/stock as many parts. In the past AMD used to have a competing/selling point of /all/ of their CPUs supporting ECC ram. Today that is not the case, as they too have mirrored (colluded?) Intel's market segmentation strategy.

vid · on Nov 21, 2015

Asrock's X99 boards can accept Xeons and ECC memory.

sireat · on Nov 23, 2015

Nice catch!

You have to give props for Asrock as they are frequently offering unique motherboard features (like that mini-ITX X99 etc).

However, it still does not change the fact that Xeon v3 models are slower(tops at 3.9Ghz Turbo) single threaded than Haswell-E, so it is the same story just a different chipset, either you get fast single threaded performance with Haswell-E or you get Xeon with ECC.

Take these somewhat similar chips: http://ark.intel.com/products/82931/Intel-Core-i7-5930K-Proc... http://ark.intel.com/products/81900/Intel-Xeon-Processor-E5-...

Why can't we have 5930k with ECC support? Because then some thrifty IT managers would buy those for their low end server needs.

bjwbell · on Nov 21, 2015

The newer (haswell or later) Intel gpus DO have ECC for the gpu caches. I guess marketing missed that one or engineering insisted.

graycat · on Nov 21, 2015

Good thoughts. Thanks.

Two questions:

(1) IIRC, some operating systems, seeing some ECC errors, maybe just the uncorrectable ones or maybe also the correctable ones, moved to mark the associated memory, or block of memory, as faulty, maybe stopping the (applications) program using that memory, and continued on. Is this done with current operating systems?

(2) What would Windows Server do with a thread, process, address space or whatever the heck that encountered a memory error detected by ECC, especially one that was uncorrectable?

I'm eager to know since I'm eager to build a server, with ECC memory, and run Windows Server in production.

geococcyxc · on Nov 21, 2015

(1) I have never heard of this behavior, correctable errors will be reported via MCA on Intel and uncorrectable ones will reset the system (and probably be logged in some firmware log).

(2) So as far is I know, the normal consequence of a detected multi bit error will be a system reset.

trav4225 · on Nov 21, 2015

Yeah, agreed. I found myself wondering why he kept referring to "reliability". Perhaps he defines it to include data integrity, but it didn't come across that way to me. I kept thinking to myself "nobody I know uses ECC just to increase reliability".

acqq · on Nov 21, 2015

I believed that Google from the beginning implemented their own checksum codes in their software to regularly verify the data processed or communicated on their non-ECC computers. And I doubt the open-source software the article author uses does the same?

PhantomGremlin · on Nov 21, 2015

I know that Jeff is a demigod to some people, but I interpret this article as: "As a software guy, I don't really understand why I need this fancy hardware, so this can't be important". IMO he's wrong.

The margins between working and non-working DRAM these days are extremely small. E.g. Rowhammer demonstrated that even user-space programs could readily obliterate main memory, without even trying very hard to do so.[1]

But, maybe in this case he's right. It's not like "open source Internet forum software" is anything that's mission critical. If there's an occasional garble in a character or two, will the latte-swilling hipsters even notice? :-)

Just like the original Google servers he points to. Who cares if they occasionally screwed up in reporting search results, because they didn't have ECC memory. Overall the experience was still 100x better than using something like Altavista.

[1] https://en.wikipedia.org/wiki/Row_hammer

theandrewbailey · on Nov 21, 2015

What Jeff is trying to say is: if ECC is so desperately needed to prevent memory errors that are supposedly happening all the time, why isn't ECC in every computer everywhere?

PhantomGremlin · on Nov 21, 2015

That question is very easily answered.

The average consumer knows that more "jigabits" are better and more "jigahertz" is better (see Intel NetBurst for how badly that can go wrong).

See a link elsewhere in this tread, someone posted a memory error presentation that talked about FIT, failures in time. But the average consumer doesn't know what that is.

Hence we get a race to the bottom. PC assemblers are willing to sell their mothers into slavery if it can save them $0.05 in build cost. ECC doesn't fit into that narrative.

BTW ECC is "in every computer" nowadays. As yet another poster mentioned, Intel CPUs use ECC internally to protect their caches.

bro-stick · on Nov 21, 2015

There's at least two broad classes of error correction and detection: at-rest and in-flight.

Each storage hierarchy component (RAM, SSD, CPU caches, etc.) and interconnection (chip-to-chip, add-on card, cable to another box) needs to be looked at for risk of nondetection/data loss based on risk consequences of the intended use.

For example, billing database servers for a successful company probably should use RAID array/SAN/NAS (say RAID6 or ZFS with RAIDZ3) and Chipkill ECC memory on an enterprise-class box with decent vendor support.

CDN boxes for serving free, static content can be almost anything.

For larger shops, they have the economies of scale to ask from OEMs and ODMs to build custom boxes that are more optimized than COTS gear at Dell, HP or CDW.

When Jeff's venture takes off, they might explore gear customized for running Ruby and/or partnering with 37signals and the like to have OEMs/ODMs folks develop better performing gear and open source it like Facebook has.

bro-stick · on Nov 21, 2015

Cost is king in a commodity market where IBM, et. al. left for more profitable waters.

Dell was pretty good at shaving pennies and providing WalMart-ized desktops and servers.

I think the offerings need to be optimized and reduce and cut features to just what's necessary based on actual, intended uses rather than guessing or throwing every possible feature into a retail desktop or offering a blizzard of different, poorly-explained SKUs (what's the diff btwn A78Z-VX and A78C-VX+?)

Related, see also: http://cr.yp.to/hardware/ecc.html

yuhong · on Nov 21, 2015

"Non-parity" RAM probably started becoming common around the 1993-1995 period when DRAM demand was increasing and prices was not falling much. For example, 4Mbit DRAM was costing more than $10 per chip during this period. Nowadays Intel uses it for market segmentation.

Spooky23 · on Nov 21, 2015

Because the consequence of failure in most desktop scenarios is low, and doesn't justify the cost for mainstream use cases.

It does matter for stuff like big databases and ERP.

dogma1138 · on Nov 21, 2015

Just to be clear SECDED ECC doesn't protect you against row hammer and similar memory disturbance attacks.

DDR4 implemented some mitigation against such attacks as well as some additional soft ECC mechanisms but as these types of attacks are fairly new it's not quite yet clear as how effective they are.

teddyh · on Nov 21, 2015

We once had a new server with all new hardware which had weird problems and kept crashing mysteriously. Memory tests showed no errors, so we were all tearing our hair out. We took the server offline and set it to test continously – still no errors. After running Memtest86 on nothing but test #4, for about a day or so – then a few memory errors showed up. Replaced memory, problem gone, server started working.

Memory errors are especially insidious compared to how common they are. ECC is worth it.

rwmj · on Nov 21, 2015

I remember circa 1999 having a database server which had a stuck bit in memory. The bit happened to be placed in the page cache, so it subtly corrupted disk writes resulting in the database throwing checksum errors. It took an insane amount of time to even diagnose where the problem could be. We of course thought it was the disks themselves and tried many variations of disks and external RAID cards. Finally, one run of memtest86 found the real problem, and I threw away the memory and motherboard and replaced it with one capable of ECC RAM.

I forget now why we even thought to build a server without ECC RAM, but I sure learned my lesson after that.

beachstartup · on Nov 21, 2015

i wouldn't even call a machine without ecc a server or workstation. more like a consumer device that's been given a job it can't do.

teddyh · on Nov 21, 2015

This was many, maybe more than 10, years ago.

tzs · on Nov 21, 2015

I tried to catch soft errors for about a year on a couple of Linux boxes I had. They were both desktop form factor machines, one being used as a home server and one as a desktop at work.

I had a background process [1] on each that simply allocated a 128 MB buffer, filled it with a known data pattern, and then went into an infinite loop that slept a while, woke up and checked the integrity of the buffer, and if any of the data had changed logged the change and restored the data pattern.

Based on the error rates I'd seen published, I expected to catch a few errors. For example, using the rate that Tomte's comment [2] cites I think I'd expect about 6 errors a year.

I never caught an error.

I also have two desktops with ECC (a 2008 Mac Pro and a 2009 Mac Pro). I've used the 2008 Mac Pro every working day since I bought it in 2008, and the 2009 Mac Pro every day since I bought it in 2009. Neither of them has ever reported correcting an error.

I have no idea why I have not been able to see an error.

[1] http://pastebin.com/Bv56kVwC

[2] https://news.ycombinator.com/item?id=10600308

Ono-Sendai · on Nov 21, 2015

Did you check the resulting (dis)assembly? If you compile with optimisations the reading (and maybe writing) to the RAM buffer may be optimised away.

marcosdumay · on Nov 21, 2015

As soon as you have a power fluctuation, air conditioning malfunction, or a few dirty caused short cuts, you'll get enough errors to converge on the published average.

Just wait, and relax. You'll get there eventually.

yuhong · on Nov 21, 2015

That is normal of course, and the published error rates are over large amounts of RAM I think.

tshtf · on Nov 21, 2015

Soft errors are fairly common; in fact it allows for problems in DNS resolution such as Bitsquatting: https://www.defcon.org/images/defcon-19/dc-19-presentations/...

Anyone who has bought a popular bitsquatted domain name can attest to this.

baby · on Nov 21, 2015

Also errors in packets signatures from TLS handshakes (http://cryptologie.net/article/294/factoring-rsa-keys-with-t...)

And I'm sure there are many other vectors of attacks using this flaw.

Tomte · on Nov 21, 2015

IEC 61508 documents an estimate of 700 to 1200 fit/MBit (fit = "failure in time"; per 10e-9 hours of operation) and gives the following sources:

a) Altitude SEE Test European Platform (ASTEP) and First Results in CMOS 130 nm SRAM. J-L. Autran, P. Roche, C. Sudre et al. Nuclear Science, IEEE Transactions on Volume 54, Issue 4, Aug. 2007 Page(s):1002 - 1009

b) Radiation-Induced Soft Errors in Advanced Semiconductor Technologies, Robert C. Baumann, Fellow, IEEE, IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

c) Soft errors' impact on system reliability, Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004

d) Trends And Challenges In VLSI Circuit Reliability, C. Costantinescu, Intel, 2003, IEEE Computer Society

e) Basic mechanisms and modeling of single-event upset in digital microelectronics, P. E. Dodd and L. W. Massengill, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 583–602, Jun. 2003.

f) Destructive single-event effects in semiconductor devices and ICs, F. W. Sexton, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 603–621, Jun. 2003.

g) Coming Challenges in Microarchitecture and Architecture, Ronen, Mendelson, Proceedings of the IEEE, Volume 89, Issue 3, Mar 2001 Page(s):325 – 340

h) Scaling and Technology Issues for Soft Error Rates, A Johnston, 4th Annual Research Conference on Reliability Stanford University, October 2000

i) International Technology Roadmap for Semiconductors (ITRS), several papers.

If that's correct, the math is simple: you have bit flips in your PC about once a day.

It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.

tgma · on Nov 21, 2015

> It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.

Also, most modern processors use ECC for their caches (even when the main memory is non-ECC) and they serve the vast majority of memory requests, so it is unlikely that intermediate values in a tight computation are affected by non-ECC RAM. That adds to the "silentness" aspect of the bit flip in consumer systems.

cushychicken · on Nov 21, 2015

These things do happen with a reasonable amount of frequency. I used to work at a division of a major memory manufacturer that dealt with writing tests to find these DIMMs that exhibited these sorts of failures - the semiconductor industry calls them "variable retention transfers". (Aside: numerous PhDs in the field of semiconductor physics have built prosperous careers trying to understand why these soft failures happen. Short answer: we have some theories, but we don't really know.) It was provably worth millions of dollars to be able to screen for this sort of phenomenon, because a Google or an Apple or an IBM would return a whole manufacturing lot of your bleeding edge, high-margin DIMMs if they found one bit error in one chip of one lot. Each lot was shipping for millions and millions of dollars.

CrLf · on Nov 21, 2015

Anyone who've managed even a modest amount of servers with ECC RAM for a reasonable amount of time has surely seen ECC events in their hardware logs. Most of these are one-time errors that never happen again on the same server, ever.

Without ECC these errors would have unknown consequences. They could happen in some unused region of memory, or they could happen in a dirty page in the filesystem cache. It's not fun to discover that your filesystem has been silently corrupted a unknown time after the fact.

Maybe Google doesn't need ECC. Their data is duplicated across several machines and it's extremely unlikely that a few corrupt servers would lead to any data loss.

However, on a smaller scale (and just like RAID) it's cheaper to have ECC than add more servers for extra redundancy.

wmf · on Nov 21, 2015

Or he could have waited a few months and gotten ECC anyway: http://ark.intel.com/products/88171/Intel-Xeon-Processor-E3-...

yuhong · on Nov 22, 2015

Interestingly, the only vendor which sells 16GB unbuffered ECC DDR4 DIMMs seems to be Crucial: http://www.crucial.com/usa/en/ct16g4wfd8213

wmf · on Nov 24, 2015

E3 DIMMs have always been rare and usually expensive; I wish Intel would enable regular registered DIMMs.

yuhong · on Nov 24, 2015

The funny thing is that they still so expensive when the x8 chips are so cheap.

sebcat · on Nov 21, 2015

What he's saying is essentially "The code I write/the platform I choose scales poorly over multiple cores. Therefor I decide to blame the hardware, and skip features that are good for me"

People need to adapt to a world where we have more cores instead of faster execution per core. You can't compare late 90's growth in execution speed per core with the situation we have today.

Write software for an environment where the number of cores scale, instead of an environment where the execution speed of a single core is more important.

ketralnis · on Nov 21, 2015

> What he's saying is essentially "The code I write/the platform I choose scales poorly over multiple cores. Therefor I decide to blame the hardware, and skip features that are good for me"

Is that so bad? He's writing and hosting the code, and he's paying the bill to do it. Seems to me he should be able to pick how to do it.

vox_mollis · on Nov 21, 2015

This cannot possibly be right. There was a DC21 talk regarding DNS request misfires due to bit flips in non-ECC DRAM, and the researcher was able to collect a surprisingly large number of requests on the basis of this.

Edit: found it: https://www.youtube.com/watch?v=ZPbyDSvGasw

ketralnis · on Nov 21, 2015

Importantly, those DNS packets go through a number of systems that are not clients or servers. Wifi, microwave antennae, undersea cables, consumer routers, unpowered hubs, you name it. It's hard to know whether these bit flips are actually coming from cosmic rays or EM interference or rare decompression bugs.

Animats · on Nov 21, 2015

If soft errors are rare, parity checking, without correction, might be more useful. It's better to have a server fail hard than make errors. In a "cloud" service, the systems are already in place to handle a hard failure and move the load to another machine. Unambiguous hardware failure detection is exactly what you want.

tgma · on Nov 21, 2015

In practice, you basically get one-bit error correction 'for free' when you have enough redundancy to detect two-bit soft errors. Simple parity can only detect one bit flip, so if you want to catch two-bit errors, you might as well correct one-bit errors you find on your way at no extra cost.

scurvy · on Nov 21, 2015

I don't think that data corruption was a huge issue for Google back then (really early on). Corrupt data? Big whoop. Re-index the internet in another X hours, and it's gone. I doubt they had much persistent storage as most of their data was transient and well, the Internet.

Also, I still see "fire hazard" when I look at the early Google racks. No idea how Equinix let them get away with it. Too much ivory tower going on there. Not enough "you know we're liable if we burn down the colo with that crap, right?"

upofadown · on Nov 21, 2015

There is no extra chance of a short circuit before the power supply. After the power supply the power is limited, either by explicit current limiting or just because they are switching power supplies where transformer saturation limits the power.

So you could have a PCB fire, but PCBs are made to be flame retardant. You could have a wire insulation fire, but the amount of material would be so low that it wouldn't be able to start a fire anywhere else.

So I am basically saying there isn't really anything there that could sustain a fire and that there isn't a lot of energy to start ignition in the first place.

XorNot · on Nov 21, 2015

Fun fact: I had my desktop get unreliable for a few weeks until I finally thought it might be just dust build up in the case. Opened it up and found a huge scorch mark on the motherboard where a capacitor had clearly burned up.

So yeah, they're pretty flame retardant.

scurvy · on Nov 21, 2015

Cardboard breaks down over time. It turns into particulate matter that goes airborne into really hot server intakes and comes out tiny little burning embers.

If it didn't burn down Google's stuff, it could have burned down other people's gear. I have decades of experience here; I'm not an ivory tower nerd. Any datacenter/colo provider worth a salt will jump on you immediately for having cardboard in your environment. DRT makes you unbox everything outside the various colos and won't even let cardboard enter.

upofadown · on Nov 21, 2015

>comes out tiny little burning embers.

The auto-ignition temp of paper is over 200C. The maximum junction temperature of most electronics is somewhere around 100C. This this literally could not of ever happened unless the equipment was already on fire.

I'll leave the idea that cardboard breaks down fast enough to be noticed over the life of a server to someone more knowledgeable. I note that there was no mention of cardboard in the article.

scurvy · on Nov 21, 2015

The motherboards were placed directly on cardboard trays. It says that in the article.

devit · on Nov 21, 2015

The article is wrong.

The Xeon E3-1270 v5 goes from 3.6 to 4.0 GHz and only costs 10% more than the i7-6700 (3.4-4.0 GHz)

Also, the Xeon E3-1230 v5 goes from 3.4 to 3.8 GHz (same base clock) and costs less than the Core i7-6700.

In general, you should never buy non-Xeon CPUs if you have the choice, both for desktop and for servers, since ECC memory is essential if you don't want to have a significant chance of having to replace your RAM after discovering mysterious problems with your system.

venomsnake · on Nov 21, 2015

Isn't it simple enough calculation:

Will someone die if the data gets corrupted? No - then no ECC should be enough. And you should have checksums everywhere anyway.

jo909 · on Nov 21, 2015

where do you create that checksum? If its on a computer without ECC, you will just checksum the data including the error, then write that data including the error to disk.

What happens to the data after you have read it to memory and successfully verified the checksum? You probably process it in memory, and have no idea afterwards if the changes are due to your code, or because of errors.

Of course you can now propose to also checksum and check the data while it is in memory. Which is basically what ECC does, in hardware, for cheap, requiring no CPU cycles.