Invalidating the caches is kind of a cringe inducing approach on this (actual) problem. Especially in HPC radiation related single event upsets have become a real problem. If you do the math, all the silicon area devoted to memory (DRAM, caches, registers) adds up, and what you've got is essentially particle detector.
Compared to the effective volume of a purpose designed one (ATLAS, CMS, Super Kaminokade, etc.) rather small, but a particle detector nevertheless.
A couple of months / years ago, there was an article (also linked here on HN, IIRC) that did a few back of the envelope calculations regarding expected event rates. IIRC it was something on the order of 1 event per day per 10^12 transistors. (EDIT: not the one I thought of but blows the same horn: http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf )
Also radiation hardened software has been researched (and still is). Essentially the idea is to not only have redundant, error correcting memory, but also redundant, error correcting computation. NASA has some publications on that. e.g. https://ti.arc.nasa.gov/m/pub-archive/1075h/1075%20(Mehlitz)...
An article I read awhile ago addressed an interesting correlation between transistor process size, the physical size of the dram module and the expected failure rates.
"As transistor sizes have shrunk, they have required less and less electrical charge to represent a logical bit. So the likelihood that one bit will "flip" from 0 to 1 (or 1 to 0) when struck by an energetic particle has been increasing. This has been partially offset by the fact that as the transistors have gotten smaller they have become smaller targets so the rate at which they are struck has decreased.
More significantly, the current generation of 16-nanometer circuits have a 3D architecture that replaced the previous 2D architecture and has proven to be significantly less susceptible to SEUs. Although this improvement has been offset by the increase in the number of transistors in each chip, the failure rate at the chip level has also dropped slightly. However, the increase in the total number of transistors being used in new electronic systems has meant that the SEU failure rate at the device level has continued to rise."
It depends on what you mean by "most". Even shielding 95% may not be sufficient, if the remaining 5% is too strong. It is a tradeoff - the thicker the plating, the better protection, but beyond some limit the plating gets too heavy, which is very costly especially when it is to be put in orbit.
And vendors who put ECC on the data paths but not on the cache itself, on the assumption that check it while it is going in, its a cache so it is short lived, and you're window is small enough to meet your reliability goals.
That goes out the window when you sleep though because you don't know how long you've been waiting to restart. And the P(bitflip) is a function of time. If you sleep too long you are non-spec compliant for silent data corruption, and since there isn't a way to know how long you sleep, the only "safe" option is to invalidate the cache and reload it.
Sad but an understandable approach. The downside is your wake from sleep is slower by the amount of time it takes to warm up the cache.
NASA calls out preparing for this specifically in their software engineering requirements. My understanding is that this was added specifically to address bit flips from radiation effects.
You may be interested to learn of arXiv:1510.07655, "Detecting particles with cell phones: the Distributed Electronic Cosmic-ray Observatory" by Vandenbroucke et al.
I don't know if any novel results have come out of this kind of thing.
I (thought I) signed up to be informed of beta releases, but never heard anything. I just checked their website[0] and it mentions a beta app, but that seems to just go to a signup page.
Most bit flip events are apparently due to alpha particles from radioactive decay in the package (source - worked for a company which used 'low alpha compound' in some of our ICs to put off having to implement HW error correction) , which would be a big confounding signal. I imaging physics experiments work hard to avoid this, or at least correct for it.
Probably not, due to the low fidelity of any data from this. "real" particle detectors can trace the decay chain, momentum changes, precise energy levels, etc. Processor cache flips etc can only say "a particle with energy >= X was here", which isn't super useful given the relative frequency of occurrence.
Why is this in such an inconvenient form? If I have x gb of ram how many unavoidable memtest errors should I expect per hour of testing? It seems like that could be used to tell us the minimum amount of time to run the test.
I used to work in DRAM manufacturing. I believe 100% in bitflips -- we saw them in testing all the time. I also believe the rate at which bitflips are caused by gamma rays is absolutely dwarfed by the rate at which bitflips are caused by manufacturing defects. It's really easy to manufacture a weak cell when you're making billions of tiny capacitors at the same time. SRAM might also see lots of bitflips, but that seems even more likely than DRAM to be driven by manufacturing defects, given the greater complexity of the SRAM cell.
>the rate at which bitflips are caused by gamma rays is absolutely dwarfed by the rate at which bitflips are caused by manufacturing defects.
back in USSR/Russia gamma rays and aliens were not-an-issue compare to the extremely low reliability of USSR/Russia hardware. The military hardware back then (and some telco hardware built in Russia in 199x (some even for export into Western countries!)) was built as triplicate systems - i.e. primitive quorum/consensus computing.
Similarly, as far as i heard back then, due to low reliability of Itanium especially in the beginning, the Itanium based Tandems were also available as "Tridems".
I maintain a tool that reports errors on Linux systems with ECC memory, and also a website describing the tool (http://mcelog.org). I wrote about patterns in the access logs correlated to time some time ago in my blog:
> I wonder if it’s possible to detect solar flares in these logs. Need to find a good data source for them. Are there any other events generating lots of radiation that may affect servers?
Cosmic ray showers [1] potentially. My prediction is that there should be "clusters" of bit errors every few minutes or so (given sufficient amount of servers...).
I was talking to a sysadmin at a local company running a few thousand servers about getting their ECC error logs in order to look for these, but scraping them apparently wasn't trivially managable for them.
> Bit flips in network surely is many orders of magnitude worse
Yes, but because it's so prevalent that people expect it, they've added checksums on multiple levels, so the network actually performs better. In this particular instance (DNS queries), it's very unlikely that the data was corrupted in transit: "We believe that UDP checksums are effective at preventing 'bitsquat' attacks and other types of errors that occur after a DNS query leaves a DNS resolver and enters the network."
That doesn't take into consideration of flaky home-routers (note that they do recalculate the checksum). So the checksum might be processed on bad data.
For a bad connection it will also be quite common that the checksum will happen to be valid. TCP/UDP checksums are quite bad.
One thing to keep in mind is that the network protocols usually use quite weak and similar checksums. To the extent that there are layer combinations that reliably produce errors that are not detectable by TCP checksum.
Has anyone published results on the impact of large scale TLS deployment on reliability? I’ve seen some older papers on errors which TCP missed but it seems like making a hard fail would be a nice win as the crypto overhead keeps getting cheaper.
I saw them at multiple places in the stack. Usually the Host header showed my domain which rules out the network. Rarely would the Host header be the original domain though that doesn't narrow it down to the network.
The end of gp’s Article points to some work by a verisign engineer that put the network probability very low. So my interpretation was these but flips were happening on device before the network stack was dispatched at a url.
So Ethernet is protected by CRC which is quite good, but it is stripped at both ends and I would not trust the routing chips at all. Some core routers used low reliability FPGAs for example..
If you require data integrity you cannot rely on ethernet's FCS. Personally I trust the core router far more than I trust the cheaper stuff around it, particularly the stuff my ISP gave me for 'free'.
The youtube video[1] of his defcon talk is super interesting -although some of his concerns about serving malicious scripts from bitsquatted domains are thankfully mitigated by widespread adoption of TLS.
> Bitsquat traffic represents a slice of normal traffic
Wow. That's pretty amazing. With enough analysis, you could (possibly) recover the traffic pattern for major websites. That alone seems to have a lot of interesting implications.
1. Not all single char typos are bitflips and when you test it statistically you get more traffic for bitflips than typos. It's a bit more complicated, some typos are really common because they're close to the key.
Source: I have some domains that are bitsquats of some high traffic domains. I get access token headers when I do a raw dump of the traffic. (I turned off the servers after satisfying my curiosity.)
Want to stop randos from getting some tiny portion of your traffic's access tokens? Use client side keys and send signed / encrypted access tokens or even full requests.
Yeah I know. It's so far down the list it doesn't matter. But every combination of github.com, microsoft.com, etc is bitsquatted, so this attack can be considered ongoing. This is doubly true if you're on a domain like .co.uk which is a stupid subdomain of .uk and now that .uk addresses are available someone can buy ko.uk and bitsquat the entire fucking country.
An interesting control would have been to flip a single bit
but also flip a single char so that it flips several bits and compare the results.
edit: Reading the linked article, they describe one of their validation method:
> The Host header contains the domain the HTTP client resolved to connect to the bitsquat server. If the Host header matches the original domain, the corruption occurred on the red path (DNS path). If the Host header matches a bitsquat domain, the corruption occurred on the blue path
(content path).
While i like the sibling comment's network errors answer better then mine I didn't see any reasoning other then a unfounded statement[1] in the article or on a quick skim for "typo" in the referenced whitepaper:
[1] "These requests were not typos or other manually entered URLs"
"All of these requests used only four domains in
the HTTP Host header, as shown in Table 4.
Three out of four domains contain more than one bit error, ruling out a simple
mistype of fbbdn.net for fbcdn.net. "
None of them were typed in. I was bitsquatting a CDN that no one would type into the browser bar. Everything was some resource referenced from a page. If you checked the original page, the correct URL was present in that page.
Embedded people deal with this all the time. One class of solutions involves a checker task running continuously, which verifies the integrity of the data structures, kind of like a poor man's ECC. Really important code generally does everything three times, so there's a tie breaker in case there's a temporary fault in code or memory. I've seen this done with macros in ways that result in pretty wild code, like running a computation three times, storing each of those results three times, and then comparing the resulting nine outputs three times. That was in a diving related application, so it's not crazy to do all that work over and over since it had to be right.
Complex embedded systems like your cellphone's baseband processor usually just give up at some point and suicide a task or even the whole OS if they detect a problem. For a while I had a Qualcomm debugger attached to the internal cell modem I had in a netbook I was working on, and the baseband crashed all the time due to hardware faults. I thought I had a bad chip for a while until I realized it never happened when I left it in an underground parking lot.
This. Cache is the least of their worries in aerospace. It's common to see satellite IC's dosed up so high on the ol' Gamma that the silicon MOSFET junctions themselves start disintegrating.
A bunch of years ago Cisco had an issue with some RAM in a new switch model, I think it was in the 65xx. They where crashing randomly but only in certain places in the world. Cisco spent tons of money on this. No idea. They brought in a physics professor. The devices with the most issues were located in countries up near the artic circle. Cosmic Rays caused a bit flip in this particular set of RAM due to something in its design. Sorry for the light details, it's been years.
I also worked at a switch manufacturer. We had some ASICs from one of the big companies. Had crashes that we could not explain at all. We knew it was not us. Proved that bits where flipping in the switch ASIC. Turn out they had forgot to spec low alpha solder. Alpha partics will not go through your skin, but when it is layered right on to the chip....oops.
I used to work for two switch chip companies also. I know both of them has SW/logic in the switch chips guard against random errors in SRAM - "Alpha particles". I have seen detail test report that ran the system in nuclear lab and graph out the level of radiation level vs impact on the system error/recoverable error rate.
I knew big customers (such as Cisco) can ask for such test reports and they do get it.
It would be very interesting experiment to position some modem/old (7, 14, 28 nm chips) electronics (cell phones, raspberry Pi + solar panels) to various distant near nuclear accident site, monitor them remotely, classify and log the failure rate over the months/ years.
We would take the switches to the labs and have the "shoot" various types of particles other radiation sources at it to check resistance in both HW and SW. Depending on what you did sometime it takes months get get the device back as it has to "cool down".
Way back in the late 90s IBM had a problem with alpha-source-contaminated plastics in their SRAM chips. Those chips were used as caches in Sun SPARC processor modules. IBM told some customers, but not Sun. This caused random bitflips in the processor cache, leading to assorted failures and crashes in what was supposed to be reliable UNIX servers.
So ... here's what I'm thinking, as a complete layman with respect to how radiation affects memory devices. RAM is DRAM, i.e. dynamic RAM. It has to get automatically refreshed relatively frequently.
So, maybe (again, me being a layman) what happens is that usually gamma rays hit a DRAM cell, but haven't imparted enough energy to cause a flip. A millisecond later the cell gets refreshed erasing what little influence the gamma ray had. No harm done. A flip would only occur if enough particles hit the cell within the refresh time frame. That's of course possible, but more rare.
Contrast this with processor cache. On-die cache is most likely SRAM, Static RAM. It doesn't get refreshed. So the slight voltage errors caused by gamma rays can slowly build up over time.
Perhaps this normally isn't an issue, because even though the cache is SRAM and doesn't get refreshed automatically, it'll get "refreshed" by virtue of being cache. i.e. as long as the processor is busy the cache is constantly getting re-written with new cache lines.
But that won't hold true when the processor is asleep. The cache will be sitting idle, making itself susceptible to accumulated charges. Thus the likelihood of a gamma flip is greatly increased.
All of that crude logic aside there's one caveat:
> he workaround was removed once the problem was fixed in microcode or in a later processor stepping.
So ... either everything I said is a load of bollocks and actually this was a processor bug that some CPU engineer mistook as gamma flips, or maybe my theory is correct and they changed the CPU to occasionally wake up and "refresh" its cache automatically.
> Contrast this with processor cache. On-die cache is most likely SRAM, Static RAM. It doesn't get refreshed. So the slight voltage errors caused by gamma rays can slowly build up over time.
Static RAM is basically a flip-flop. It's a bistable circuit that's actively held in a stable state. Single-event upsets work by, essentially, putting the energy into the circuit required to make it transition into the opposite state, i.e. basically the same way the SRAM cell is written to.
If it's 3.3V logic, and the voltage dips for some reason to 3.0V, it will immediately rise back up to 3.3V. A DRAM cell would stay at 3.0V until the next refresh cycle. That's what I mean by a layman's "continuously refreshed". It has constant input power keeping the voltage at the same level.
Yes, it is. That's what makes it fundamentally different from DRAM, which isn't being continuously refreshed, which is why the memory controller has to manually refresh DRAM at frequent intervals. DRAM has much simpler cells, at the cost of more complex control logic.
I recommend you just take a look at Wikipedia or something for an explanation of SRAM. Each bit is typically implemented using six transistors, four of which form a loop of two inverters.
They are continuously powered, which causes the continuous refresh the parent was talking about.
There is no continuous "refresh". The circuit is bistable which means that the system has two states in which, once reached, it will remain until some energy is expended to change that.
Imagine it like two valleys with a hill between. Rolling a ball from the OFF valley to the ON valley requires some energy. If it's not enough the ball rolls back into the valley it's currently in.
The process is entirely analog, ie, there is no refresh circuit that looks at the voltage and says "that's almost a ON, better fresh up the voltage". The output of the circuit is digital. (You can play with the R/S latches of most SRAM on an oscilloscope and it's quite fun, the output of non-integrated SRAM will react in an analog fashion. If it's integrated, ie has a controller, this is not possible sadly)
Until you cross the threshold, the circuit will simply slide back to the original position, once you cross it, it'll slide into the new position without any additional effort.
That seems awfully nitpicky in this context. Look at russdill's comment, then look at the throwaway, which is reasonable to interpret as stating that DRAM is being continuously refreshed - which it isn't, it happens at discrete intervals.
Anyway, what you're writing isn't wrong but I'd say misses the context of the conversation a bit :)
I'm not mansplaining, I don't even know who you are. You don't know who I am. I'm attempting to make the comment digestible for a broader audience and not only for you. You're not alone on this website. I'm sorry for doing that then.
The explanation you were replying to was more digestible than yours.
Your definition of "refresh" is unhelpfully specific and not particularly correct. The circuit that "looks at a voltage and freshens it up", also known as an amplifier, is just a transistor or pair of transistors.
> The explanation you were replying to was more digestible than yours.
The analogy presented is fairly accurate and easy to understand. GGGGGP talks about loops of inverters and continuous refreshing, the latter of which is invented terminology.
> The circuit is bistable which means that the system has two states in which, once reached, it will remain until some energy is expended to change that.
> Imagine it like two valleys with a hill between. Rolling a ball from the OFF valley to the ON valley requires some energy. If it's not enough the ball rolls back into the valley it's currently in.
The analogy is an okay description of part of what happens, but not why. It also gives a pretty misleading idea of what happens to the voltages. It would be much better if it was combined with a simple explanation of how the inverters actually behave, which would only take a few words. This is why the previous post mentioned them and said to check the wikipedia page.
An amplifier doesn't look at voltage and "freshens it up", in an SRAM cell, the amplifier is running in clipping and unlike a DRAM cell requires no clocking, only supply voltage. A pair of transistors does not behave like that since they operate over current. You need two inverting amplifier, which is more complex and isn't simply "looking at voltage to freshen in up".
If my attempt to make everything more digestible failed in your opinion, then I guess that is it, it's your opinion on the matter.
I described no DRAM refreshing circuit, I describe a SRAM cell, I don't believe that definition of fresh is suitable for the discussion nor is it fitting when talking about SRAM, I also never said it wasn't continuous, ie linear, I did disagree with the wording that made it sound like there is a refresh process that happens repeatedly.
Oh. I thought "The process is entirely analog, ie, there is no refresh circuit that looks at the voltage and says "that's almost a ON, better fresh up the voltage"." was talking about DRAM as a comparison. Sorry about that.
> I did disagree with the wording that made it sound like there is a refresh process that happens repeatedly.
No, SRAM doesn't work that way. DRAM needs refreshing because it's made out of capacitors; if left alone, their charge will stay the same. Whereas SRAM is made of flip-flops; if the charge in an SRAM cell moves towards an intermediate state in between 0 and 1, it will naturally move back to 0 or 1 (whichever is closer) with no refreshing needed, by virtue of its structure.
I found this code as an intern at Microsoft, while the manufacturer is hidden in the post, I'll give you a clue - the company starts with "I" and ends with "ntel"
You do know that the 64 bit version of Windows XP was made for AMD processors before intel had any.
We also had Transmeta. I bet all of them have some kinks just as different models from the same vendor have different bugs etc. that require special handling.
Meltdown/Spectre is just a recent and very visible artifact that would demonstrate this.
If you have a large enough fleet, and log your ECC errors, you have actually built a not-very-sensitive and very expensive scientific instrument- a cosmic ray detector. Physics is awesome.
To answer the question in the OP: yes, the processor cache might be more susceptible than RAM, if the RAM is ECC.
I've heard many stories about bit-flips causing serious problems at higher-elevation sites. Apparently a major installation at either NCAR or UCAR was delayed by a month fighting such problems. While I haven't actually confirmed any of these stories first hand, I've heard enough to believe that a little paranoia is justified.
It would definitely be interesting to know more of the context around this code in Windows - when it existed, and which processor models would enable it. I'm pretty sure some of the x86-compatible processors made by several vendors over a couple of decades weren't ECC all the way through.
That was my thought process as well, that an early silicon spin / chip wasn't ECC because it would change state frequently and the circuit budget wasn't worth a soft error; but the external RAM might be ECC because a longer running state might not have such a refresh/detection of error.
I agree that these days I'm unaware of any "serious" CPU where the data isn't at least protected by at least a parity bit on chip.
"Many processors use error correction codes in the on-chip cache, including the Intel Itanium and Xeon[28] processors, the AMD Athlon, Opteron, all Zen-[29] and Zen+-based[30] processors (EPYC, EPYC Embedded, Ryzen and Ryzen Threadripper), and the DEC Alpha 21264.[23][31]"
https://en.wikipedia.org/wiki/ECC_memory#Cache
I prefer to always use ECC and layers of defenses where persistence of data is in question. Jeff Atwood's reference does point to other options, assuming that validation is baked in to the storage process and distributed in such a way as to identify and correct errors across a distributed infrastructure rather than a single system; the distributed nature means it could be more resilient and the validation of data at rest / comparing results is arguably a higher level of integrity than just ECC can provide. https://blog.codinghorror.com/to-ecc-or-not-to-ecc/
Bit flips are real.
I used to see them on my (admittedly low end) webserver.
Eg. There were occasional errors like "myOfunction not found".
A quick glance on a ASCII table shows that the original function name "my_function" is indeed one bitflip away (0x4F vs 0x5F)
If you were seeing errors like that with any regularity there's almost certainly a hardware issue on your server. Cosmic ray bitflips are really rare, and tend to be basically invisible until you're monitoring an entire datacenter.
When I was fresh out of college, I worked as a contractor for a prominent agricultural equipment manufacturer. I was responsible for building out the touch-screen interface for the radio (a Qt app). I was told by an engineer who worked for the equipment manufacturer that my application wasn't good enough because needed to be able to operate correctly in the face of arbitrary bit flips "from lightning strikes"--I kindly asked her to show me the requirements which was sufficient to get her to relent, but that was still the wackiest request I've ever received.
The requirement request is a great way to push back on feature creep. There's a lot of cargo culting that goes on in the "protection against bit-flips". You sometimes have to go a step further and ask what error rate are you required to be below. Once you have that number, you can start asking what your current error rate is without mitigations, and how much a given mitigation will reduce your error rate.
My favorite entry in that problem space is metastability[1].
Do you interface two different clock domains(which is basically most things)? Guess what all of your computing is built on the "chance" that bits won't flip.
Granted, statistics make this pretty solid but kinda blew my mind when I first stumbled across it.
Yup, a large portion of hardware design is based on getting below a required maximum failure rate. For metastability, you just keep adding more flip-flops. BTW, the cache invalidation request may be due to this. They figured they could more easily reach their time between failure interval if they could discount time during S1.
I once saw a postmortem where a server process mysteriously tried to delete whole data (fortunately no actual data was lost). After much confusion, the conclusion was that a cosmic ray flipped a single bit in a register, making it point to 8 bytes past the correct address in C++ virtual function table. As a result, instead of calling UpdateRow(), the process executed DeleteTable().
Of course cosmic rays don't exactly leave a trace, so we will never know.
I've heard stories from the supercomputing folks about trying to put their machine rooms underneath parking structures, to get the added protection of layers and layers of concrete overhead.
Apparently 95% of "bit flipping cosmic rays" are neutrons:
> At the Earth's surface approximately 95% of the particles capable of causing soft errors are energetic neutrons with the remainder composed of protons and pions.
Sounds about right. I've been a part of a radiation test giving a SoC gamma exposure for total irradiating dose measurement. I didn't see a single upset the whole test.
I doubt think the programmer knew one way or the other, just doing what they were told by the manufacturer, and gamma ray is probably easier to explain to anyone after the 1950s/60s than muon.
I mean after all, did any superhero get their powers from muons?
May is most noted for having identified the cause of the "alpha particle problem", which was affecting the reliability of integrated circuits as device features reached a critical size where a single alpha particle could change the state of a stored value
Compared to the effective volume of a purpose designed one (ATLAS, CMS, Super Kaminokade, etc.) rather small, but a particle detector nevertheless.
A couple of months / years ago, there was an article (also linked here on HN, IIRC) that did a few back of the envelope calculations regarding expected event rates. IIRC it was something on the order of 1 event per day per 10^12 transistors. (EDIT: not the one I thought of but blows the same horn: http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf )
Also radiation hardened software has been researched (and still is). Essentially the idea is to not only have redundant, error correcting memory, but also redundant, error correcting computation. NASA has some publications on that. e.g. https://ti.arc.nasa.gov/m/pub-archive/1075h/1075%20(Mehlitz)...