I actually tried really hard to get RAM to corrupt a bit for a school project an...

deegles · on Oct 6, 2020

Bunch of links to papers here: https://stackoverflow.com/questions/2580933/cosmic-rays-what...

e.g. "2009 Google's paper "DRAM Errors in the Wild: A Large-Scale Field Study" says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM after my calculations. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year". "

lucb1e · on Oct 7, 2020

> 1 - 5 bit errors per hour for 8GB of RAM after my calculations

That is way off from what I'm seeing. When launching Factorio I use 90% of my 8GB RAM and never once have I noticed data corruption, and I could tell you how many hours I've played but that would be embarrassing.

The test I did in school with heated-up RAM (the internet said that's when flips should occur more often) also wrote many many gigabytes without a single failure.

Not sure what hardware or temperatures that source is running but it's not DDR3/DDR4 at heats below hairdryer melting temperature because that's where I had to stop the experiment with zero failures.

anfilt · on Oct 8, 2020

I would have to find the paper again, but CPU caches can mask the error rate. The cached values also can overwrite any corruption with correct values. This has interesting side effect of protecting commonly accessed data structures and function pointers from causing out right crashing. Same applies to commonly used values in a computation.

Unless you get a bit flip in data structure pointer or function pointer, it just adds an error to computation, but does not just out crash.

Also we are talking only a handful of errors out billions of calculations.

-Edit Also swap space may keep very rarely accessed data from corruption on the other end of spectrum.

Yeroc · on Oct 6, 2020

And there are ways to manipulate RAM access patterns to induce errors as described in the initial Rowhammer attack paper plus later RAMBleed papers. Hopefully this newer DDR version is designed to be resistant to this type of attack.

DCKing · on Oct 6, 2020

I've seen bit flips reported in edac utils on a system with ECC memory. People routinely try to induce memory errors by overclocking their memory (to verify ECC is working). The triggering of bit flips is the very foundation of the Rowhammer attack (yes, I know Rowhammer can circumvent ECC with advanced techniques). Error correcting codes are used in networking environments, CPU caches, hard drives, anywhere but main memory.

Not sure why memory bit flips have the reputation of being such an edge case. It could be that it was an edge case 20 years ago, but it clearly isn't anymore. Computer memory has changed too.

lucb1e · on Oct 7, 2020

If ECC is supposed to be a security measure then I can see the point, but aside from intentional flipping (by an attacker), a blanket statement like "it clearly isn't [an edge case] anymore" doesn't strike me as true. For it being a security measure, though, shouldn't it compute a much stronger checksum than one or two bits like ECC usually does?

ficklepickle · on Oct 7, 2020

I've experienced errors that likely propagated through memory errors.

I have a ZFS/NFS server on a 2012 i7 with 4gb RAM. I use it primarily to store various torrents (up to ~250GB each).

I have had my torrent client find single-chunk errors in a couple of torrents I was seeding (twice over a few TB worth of torrents). I recall reading ZFS filesystem over NFS is particularly prone to this. I did some worried searching and remember finding it was likely caused by memory errors being persisted to disk, but I don't have any links handy anymore.

I likely would not have noticed the corruption if the torrent client hadn't alerted me.

jtl999 · on Oct 7, 2020

I was able to to induce single and multi bit errors into DDR4 RAM by increasing the clock frequency, reducing timings, and undervolting.

shaklee3 · on Oct 7, 2020

It's happened to me on a GPU. I had ecc disabled for a speed increase, and got a reproducible bit error.

paulie_a · on Oct 7, 2020

There was a great defcon talk about this. Basically using unicode and registering domain names with a bit flip resulted in results. Like email at Microsoft and some other major companies.

I think the title was dns squatting but can't find it at the moment

norenh · on Oct 7, 2020

Bitsquatting? https://www.youtube.com/watch?v=9WcHsT97suU

paulie_a · on Oct 7, 2020

That's the one, thanks for finding it. It really is fascinating

lucb1e · on Oct 7, 2020

That was actually the inspiration for my experiment with this in school, and I also setup one or two domains to catch bit flips but never got any hits. It's a complete myth as far as I have been able to tell (research tells me otherwise by using huge setups, and there are a few commenters here that seem to have first-hand experience, some seeming more trustworthy than others, but clearly not a majority of people). I get that it's (obviously) more than a myth, but I'm not sure it deserves the goo-goo eyes that it seems to trigger with many engineers, either. It's a neat feature slightly above the gimmick status but gets way more attention.

pixl97 · on Oct 6, 2020

I've seen a ton of it in the field. In general the ram stability is so bad that large operating systems fail to boot from corruption, so most of the time it doesn't get to the data corruption phase.

fomine3 · on Oct 7, 2020

I read. https://news.ycombinator.com/item?id=24656541

im3w1l · on Oct 6, 2020

Did you try overclocking your system?

lucb1e · on Oct 7, 2020

I think the motherboard didn't allow that, but it has been a few years and I don't recall the exact details. Thanks for the pointer nevertheless.

paulmd · on Oct 7, 2020

hit the RAM with a hair dryer and you'll get some pretty fast