This reminds me of a post a few months back about Voyager 2, where NASA traced the issue back to a single bit flip (and fixed it!)
Engineers successfully reset a computer onboard Voyager 2 that caused an unexpected data pattern shift, and the spacecraft resumed sending properly formatted science data back to Earth on Sunday, May 23. Mission managers at NASA's Jet Propulsion Laboratory in Pasadena, Calif., had been operating the spacecraft in engineering mode since May 6. They took this action as they traced the source of the pattern shift to the flip of a single bit in the flight data system computer that packages data to transmit back to Earth. In the next week, engineers will be checking the science data with Voyager team scientists to make sure instruments onboard the spacecraft are processing data correctly.
I'm craving more info. This is such an interesting dilemma to be in, it always boggles and then blows my mind to think about how hard it is to design something perfect enough to never have to physically touch it again in order to keep it working for decades, from a zillion miles away.
Are the A and B computers identical? From the last sentence, it would appear that way (B will become primary and if A can be repaired it will be the new backup, implying they are interchangeable). Why does it take so long to switch to the backup? Was the backup serving another purpose and now it needs to be retrofitted to take the place of A? How is this process done?
The problem they have is that high-energy particles can hit random bits in the memory, and possibly change them. As such they have a large amount of radiation shielding, but some things can get through given enough time. They have a second computer that can "take command" in the case of any failure.
The second computer is the same so the same programs can be used (obviously). The thing about Curiosity is that it left Earth with a very minimal program, and bits were sent to it on the journey. When it landed, it used a different program then it does now. Remote programming allows it to carry out a large amount tasks, with a minimal amount of hardware.
The last thing is that everything is double, and triple checked. If something goes wrong, then it goes very wrong, so they ensure that everything is working fine. While the B computer is being used, they'll probably do a full bit-by-bit wipe of the A computer, then load software back on it.
I'd imagine that most of what you'd automate on Earth is done manually and double-checked, just in case. Add that to a half-hour(?) latency and it makes sense that something you'd run a bash script to do here winds up taking a week on Mars.
They must use the equivalent of ECC for flash memory, I would hope. So the theory is that cosmic rays corrupted multiple bits of radiation-hardened, ECC flash memory? Should they have expected that? Is reverting to a backup computer the best option? How many bits were flipped? Couldn't they have used a flash memory controller with the ability to correct more than 1 bit per word?
Do space missions also have problems with bit flips in cpu cache or cpu registers? How do they deal with that?
[in 2.5Gbit of ram,] The maximum hourly error report from Cassini–Huygens in the first month in space was 3072 single-bit errors [in DRAM] per day during a weak solar flare. If the flight recorders had been designed with EDAC words assembled from widely-separated bits, the number of (uncorrectable) multiple-bit errors should average less than one per year. [1]
If flash memory reliability is one uncorrectable error every X years where X is less than hundreds or thousands (under expected environmental conditions), that doesn't seem like a comforting level of reliability if a failure means several days or weeks of using a backup computer.
Disclaimer: I'm a SIL-4 rated programmer and have worked in safety-critical systems for decades.
Yes, its true: radiation can destroy the benefit of having an ECC controller in your design. There are very few hardware methods, short of encasing the entire device in several tonnes of Lead, that will prevent this from happening when you're out there beyond the atmosphere ..
So the solution is, typically, a combination of hardened CPU's (as much shielding as the weight budget allows) plus SOFTWARE to detect the error, and react accordingly.
Memory corruption is something that a SIL-4 or Space-rated software system HAS to check for, actively, continuously. Its quite possible to use a number of techniques to cover the cases as much as possible .. for example, you can have a process which double-checks the Text Segment for running processes and compares it with a known valid CRC for the process. You can use 2-out-of-3 style voting systems, so that redundant decision making can detect problems, and so on.
From years of debugging, I know I am unconsciously biased towards the view that a bug in my code or hardware is the result of a very low-probability event, and strangely biased away from the view that a medium or high-probability series of events occurred for which I didn't perfectly plan.
It always turns out to be the latter.
Therefore, I will predict that this event is NOT a low probability double-bit error brought on by stray radiation that bypassed all safeguards. (Unless those NASA guys are superhuman designers, which I guess could be valid hypothesis.)
Google feels that DRAM errors are about 8% per year from their studies. Cosmic ray interference is common enough to require deliberate software rejection modes when doing something Raman spectroscopy, so it's not like the probability of getting interference in memory chips is low.
It's just normally, the probability you've made a glaring bug while coding is way higher (and punching the power button is much easier then dissecting kernel states to find out if it's really a cosmic ray).
The thing is that a cosmic particle event such as this is NOT an especially low-probability one when you're on Mars. Similar things have happened with Spirit and Opportunity as well, even earlier in their respective missions. The engineers expect things like this to happen. I don't think they have ever been wrong when attributing a glitch to a radiation event.
Configuring the B-side computer to take control of the rover may take ... several days, maybe a week
I wish they had provided more information about what needs to be done. I would think the backup computer would be sort of a "warm standby" that could be switched over pretty quickly, from this it sounds like they need to upload data or software first.
According to NASA, "Curiosity is now operating on its B-side, as it did during part of the flight from Earth to Mars."[0]
Presumably, it retained some programming for recovery (as a backup) from the flight, and they didn't have an exact replica of the software A-Side had on B-Side. That being said, they probably have debugging/test software on which ever is serving as the backup computer in order to diagnose and fix the other one in the event of a problem.
I work at NASA / JPL (but I'm not part of the MSL team, or any flight software team). We learn a good bit about good flight practices though. Here's what probably happened:
Someone found the problem as a behavioral anomaly or weird status bits. It gets reviewed and escalated a few times until the important people know. For class-A missions (I think lower classes too), we have performed fault analysis on each component (and the system as a whole), so that we can match an observed behavior to the actual underlying issue through something like a flowchart. I don't know for sure, but it was probably a bit harder for a bit-flip, but I wouldn't be surprised if they had provisions for that too. However, these high-profile projects are extremely risk-adverse, so they'll review any non-standard commands heavily before transmitting them. teams (plural) probably reviewed this failure before confirming the initial conclusion.
The week was accounting for this through analysis and review process.
(my opinion is my own, and doesn't represent NASA or JPL)
Engineers successfully reset a computer onboard Voyager 2 that caused an unexpected data pattern shift, and the spacecraft resumed sending properly formatted science data back to Earth on Sunday, May 23. Mission managers at NASA's Jet Propulsion Laboratory in Pasadena, Calif., had been operating the spacecraft in engineering mode since May 6. They took this action as they traced the source of the pattern shift to the flip of a single bit in the flight data system computer that packages data to transmit back to Earth. In the next week, engineers will be checking the science data with Voyager team scientists to make sure instruments onboard the spacecraft are processing data correctly.
http://news.ycombinator.com/item?id=1459328