One thing I'd like to understand better about DDR5 is how well the built-in ECC ...

wmf · on Oct 6, 2020

I expect Intel to still cripple some aspect of DDR5 ECC on consumer chips; maybe it will correct errors but the memory controller won't report them. Or maybe it's possible to disable ECC even though it's already implemented.

I also expect servers to use two levels of ECC to provide chipkill and also to keep server RAM more expensive than consumer.

DCKing · on Oct 6, 2020

Ryan Smith in the link above seems to suggest error correction will be done transparently anyway, so it seems like it won't reported to the OS. So it doesn't look like Intel could cripple it even if they wanted to.

throw0101a · on Oct 6, 2020

> Ryan Smith in the link above seems to suggest error correction will be done transparently anyway, and it won't reported to the OS.

I once heard a rant from someone on how not reporting this to the OS is really bad for diagnosing issues, even soft errors that are auto-healed. (It could have been from Bryan Cantrill, but couldn't say for sure.)

DCKing · on Oct 6, 2020

I do think that there will still be some people interested in ECC detection and reporting, but that's not really why I'm interested in ECC memory. The dividing line "enterprise = error correction with monitoring / non-enterprise = silent error correction" is much more sensible than "non-enterprise = no error correction at all, good luck" IMO.

Isn't the primary reason why people are looking into diagnostics is because it's very hard to determine whether ECC is working in the first place? Because it depends on the particular hardware setup? If the spec states that all DDR5 is supposed to have internal error correction anyway, then I'm happy to take for granted error correction is working until I read about the scandals of non-spec cheap DDR5 :)

throw0101a · on Oct 6, 2020

Yes, I do not run things at a scale that would need that, but I would appreciate at least a toggle to have it available if needed: default=quiet(er) would be fine for most cases.

BoppreH · on Oct 6, 2020

Zebras All the Way Down - Bryan Cantrill, Uptime 2017

http://www.youtube.com/watch?v=fE2KDzZaxvE&t=35m40s

throw0101a · on Oct 6, 2020

Yup. There's no rant like a Bryan Cantrill rant. :)

NelsonMinar · on Oct 6, 2020

One of the great ironies of modern computers is we dispensed with ECC just as we started ballooning out the size of RAM and shrinking the transistors so that single bit errors were more likely. I'd be very grateful for system ECC.

klodolph · on Oct 6, 2020

I wouldn’t be surprised if at some level that physics has forced the manufacturer’s hand—that previously low error rates are now unacceptable when you multiply them by 64GB.

lucb1e · on Oct 6, 2020

I actually tried really hard to get RAM to corrupt a bit for a school project and didn't manage a single bit flip.

How often have you actually heard of data corruption due to non-ECC memory? Either yourself, any degree of 'friend of a friend', or perhaps a study that looked into the matter with more success than I had. I don't mean a newspaper story because exceptional cases are reported because they're rare exceptions rather than common enough that we'd be likely to come across it in our lifetimes.

deegles · on Oct 6, 2020

Bunch of links to papers here: https://stackoverflow.com/questions/2580933/cosmic-rays-what...

e.g. "2009 Google's paper "DRAM Errors in the Wild: A Large-Scale Field Study" says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM after my calculations. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year". "

lucb1e · on Oct 7, 2020

> 1 - 5 bit errors per hour for 8GB of RAM after my calculations

That is way off from what I'm seeing. When launching Factorio I use 90% of my 8GB RAM and never once have I noticed data corruption, and I could tell you how many hours I've played but that would be embarrassing.

The test I did in school with heated-up RAM (the internet said that's when flips should occur more often) also wrote many many gigabytes without a single failure.

Not sure what hardware or temperatures that source is running but it's not DDR3/DDR4 at heats below hairdryer melting temperature because that's where I had to stop the experiment with zero failures.

anfilt · on Oct 8, 2020

I would have to find the paper again, but CPU caches can mask the error rate. The cached values also can overwrite any corruption with correct values. This has interesting side effect of protecting commonly accessed data structures and function pointers from causing out right crashing. Same applies to commonly used values in a computation.

Unless you get a bit flip in data structure pointer or function pointer, it just adds an error to computation, but does not just out crash.

Also we are talking only a handful of errors out billions of calculations.

-Edit Also swap space may keep very rarely accessed data from corruption on the other end of spectrum.

Yeroc · on Oct 6, 2020

And there are ways to manipulate RAM access patterns to induce errors as described in the initial Rowhammer attack paper plus later RAMBleed papers. Hopefully this newer DDR version is designed to be resistant to this type of attack.

DCKing · on Oct 6, 2020

I've seen bit flips reported in edac utils on a system with ECC memory. People routinely try to induce memory errors by overclocking their memory (to verify ECC is working). The triggering of bit flips is the very foundation of the Rowhammer attack (yes, I know Rowhammer can circumvent ECC with advanced techniques). Error correcting codes are used in networking environments, CPU caches, hard drives, anywhere but main memory.

Not sure why memory bit flips have the reputation of being such an edge case. It could be that it was an edge case 20 years ago, but it clearly isn't anymore. Computer memory has changed too.

lucb1e · on Oct 7, 2020

If ECC is supposed to be a security measure then I can see the point, but aside from intentional flipping (by an attacker), a blanket statement like "it clearly isn't [an edge case] anymore" doesn't strike me as true. For it being a security measure, though, shouldn't it compute a much stronger checksum than one or two bits like ECC usually does?

ficklepickle · on Oct 7, 2020

I've experienced errors that likely propagated through memory errors.

I have a ZFS/NFS server on a 2012 i7 with 4gb RAM. I use it primarily to store various torrents (up to ~250GB each).

I have had my torrent client find single-chunk errors in a couple of torrents I was seeding (twice over a few TB worth of torrents). I recall reading ZFS filesystem over NFS is particularly prone to this. I did some worried searching and remember finding it was likely caused by memory errors being persisted to disk, but I don't have any links handy anymore.

I likely would not have noticed the corruption if the torrent client hadn't alerted me.

jtl999 · on Oct 7, 2020

I was able to to induce single and multi bit errors into DDR4 RAM by increasing the clock frequency, reducing timings, and undervolting.

shaklee3 · on Oct 7, 2020

It's happened to me on a GPU. I had ecc disabled for a speed increase, and got a reproducible bit error.

paulie_a · on Oct 7, 2020

There was a great defcon talk about this. Basically using unicode and registering domain names with a bit flip resulted in results. Like email at Microsoft and some other major companies.

I think the title was dns squatting but can't find it at the moment

norenh · on Oct 7, 2020

Bitsquatting? https://www.youtube.com/watch?v=9WcHsT97suU

paulie_a · on Oct 7, 2020

That's the one, thanks for finding it. It really is fascinating

lucb1e · on Oct 7, 2020

That was actually the inspiration for my experiment with this in school, and I also setup one or two domains to catch bit flips but never got any hits. It's a complete myth as far as I have been able to tell (research tells me otherwise by using huge setups, and there are a few commenters here that seem to have first-hand experience, some seeming more trustworthy than others, but clearly not a majority of people). I get that it's (obviously) more than a myth, but I'm not sure it deserves the goo-goo eyes that it seems to trigger with many engineers, either. It's a neat feature slightly above the gimmick status but gets way more attention.

pixl97 · on Oct 6, 2020

I've seen a ton of it in the field. In general the ram stability is so bad that large operating systems fail to boot from corruption, so most of the time it doesn't get to the data corruption phase.

fomine3 · on Oct 7, 2020

I read. https://news.ycombinator.com/item?id=24656541

im3w1l · on Oct 6, 2020

Did you try overclocking your system?

lucb1e · on Oct 7, 2020

I think the motherboard didn't allow that, but it has been a few years and I don't recall the exact details. Thanks for the pointer nevertheless.

paulmd · on Oct 7, 2020

hit the RAM with a hair dryer and you'll get some pretty fast

kasabali · on Oct 6, 2020

What I read around is that's not the enterprise ECC, it's more akin to ECC bits used in flash memory. It'll allow manufacturers to play fast and loose with memory.

DCKing · on Oct 6, 2020

I guess what I don't understand then is what big advantage "enterprise ECC" has left over this DDR5 "non-enterprise ECC". (Seriously, why is ECC "enterprise"? Everybody wins with memory error correction.) If regular DDR5 can correct single bitflips, it is on par in correction capabilities with "enterprise ECC" DDR4.

Maybe this won't allow for the detection of multiple flips, and maybe won't even report single bit flips to the OS (it'll just fix them silently). I suppose there's no big need to support detection and reporting for the vast majority of use cases. Ryan Smith at Anandtech in the link above says as much: "Between the number of bits per chip getting quite high, and newer nodes getting successively harder to develop, the odds of a single-bit error is getting uncomfortably high. So on-die ECC is meant to counter that, by transparently dealing with single-bit errors."

But for my purposes, if just the correction capabilities are on par with DDR4 ECC I'd be absolutely fine with that. And I guess that goes for many people. Even while using ECC memory now at home, I'm not monitoring the correction statistics and I'm guessing few people do in general. It might as well be silent today if you ask me.

Freaky · on Oct 6, 2020

> I guess what I don't understand then is what big advantage "enterprise ECC" has left over this DDR5 "non-enterprise ECC"

ECC should be end-to-end, so it detects and (hopefully) corrects errors anywhere along the path, not just within a chip.

Step 1 of handling a lot of ECC correction events is to reseat the DIMM, because often it's just an issue with the connection, not actually a memory defect.

And you may not care too much about reports of correction events, but you definitely want to see correction failures reported - the point is, after all, to avoid corruption.

wmf · on Oct 6, 2020

Most servers have chipkill ECC that can survive an entire 4-bit chip going bad so that's more powerful than classic SECDED. I don't know how often chipkill kicks in though.

ip26 · on Oct 6, 2020

I don't know what "enterprise ECC" means, but there are certainly grades of protection, from single bit error detection (parity) thru triple error correct quadruple error detect, and inline vs non-inline correction schemes. (for the latter, the machine has to stop, go back, fix the error, & resume, potentially at significant performance cost)

fomine3 · on Oct 7, 2020

On SSD, we never receive each error corrected notification maybe because it causes too many errors everyday. I expect also DRAM may go such device.

mc32 · on Oct 6, 2020

So in other words, error prone memory modules don’t have to be tossed and can be sold because they will be self correcting...?

kasabali · on Oct 6, 2020

That's what I understand. Frequencies will be higher and manufacturing process denser later in DDR5 lifecycle so they must be expecting more errors.

nine_k · on Oct 6, 2020

A number of Celerons 3XXX and 4XXX support ECC. They are intended for storage systems, like a NAS, or industrial electronics.

DCKing · on Oct 6, 2020

If you could point me to a <20W machine with ECC memory that I can buy online like a mere mortal human person, then I'd very interested!

The only thing that comes to mind in that category for me is the PC Engines APU2. The APU2 is a very neat piece of kit, don't get me wrong, but it being the only option is not great either.

osamagirl69 · on Oct 7, 2020

Intel released the Xeon D series to fill this niche https://ark.intel.com/content/www/us/en/ark/products/series/...

The D-1529 can be TDP limited to 20W (at a whopping 1.3GHz), but if you goal is idle power consumption under 20w you are better off with one of the 35w TDP D-1602 since it is only a dual core and has the lowest standby consumption. The D-1518 is also a popular choice, and basically the same as the D-1529 but with a 35W tdp limit so it can turbo to 2.2GHz.

You can get a barebones motherboard (even mini-ITX is available for making small machines) for about $500 or a complete computer for a bit under $1000 ex https://www.newegg.com/supermicro-sys-e300-8d-intel-xeon-pro...

note - the xeon-d series is not socketed, so the motherboard price includes the CPU. Make sure to get board that takes full size DIMMS since there is not a lot of ECC laptop memory on the market.

DCKing · on Oct 7, 2020

I'm aware of the Xeon D line, but sadly that was also squarely aimed at the enterprise albeit maybe a smaller scale enterprise. It's not something you can easily go out and buy any more, and when you do it's both very expensive, still PC sized, and not any more power efficient at idle or configured TDP when compared to a new Intel workstation or new AMD setup.

nine_k · on Oct 7, 2020

EBay offers a number of Xeon D boards at reasonable prices, many with extensive passive cooling. They have a lot of life left in them, to my mind, with good enough power efficiency for a box with normally very low CPU load. No moving parts, nothing to break. Yes, they are not speed demons, but low TDP presumes that anyway. ECC RAM support is there, though.

ryzenecc10 · on Oct 7, 2020

I have one of these fanless Ryzen V1605B iBOX-V1000M bricks with 32GB ECC running Ubuntu 20.04:

https://mitxpc.com/products/ibox-v1000

I bought it from that page. I wanted fanless silence and I wanted ECC and it took me a long time to find this machine but found it I did and it works. Linux reports seeing the ECC is indeed enabled (edac-utils?). Measured with a Kill-a-Watt meter it does consume more than 20W (maybe 30ish IIRC?). I actually disabled some Linux power management though to get bluetooth mice and keyboards to not lag after being idle for 5 seconds so my system might draw more power than others. Drives a 4K display buttery smooth, can watch all the YouTube and Netflix you want on Linux.