Flash memory issue forces Curiosity rover into safe mode

nbpoole · on March 3, 2013

This reminds me of a post a few months back about Voyager 2, where NASA traced the issue back to a single bit flip (and fixed it!)

Engineers successfully reset a computer onboard Voyager 2 that caused an unexpected data pattern shift, and the spacecraft resumed sending properly formatted science data back to Earth on Sunday, May 23. Mission managers at NASA's Jet Propulsion Laboratory in Pasadena, Calif., had been operating the spacecraft in engineering mode since May 6. They took this action as they traced the source of the pattern shift to the flip of a single bit in the flight data system computer that packages data to transmit back to Earth. In the next week, engineers will be checking the science data with Voyager team scientists to make sure instruments onboard the spacecraft are processing data correctly.

http://news.ycombinator.com/item?id=1459328

ComputerGuru · on March 4, 2013

I'm craving more info. This is such an interesting dilemma to be in, it always boggles and then blows my mind to think about how hard it is to design something perfect enough to never have to physically touch it again in order to keep it working for decades, from a zillion miles away.

Are the A and B computers identical? From the last sentence, it would appear that way (B will become primary and if A can be repaired it will be the new backup, implying they are interchangeable). Why does it take so long to switch to the backup? Was the backup serving another purpose and now it needs to be retrofitted to take the place of A? How is this process done?

I would kill for a "postmortem" by the NASA team!

keeperofdakeys · on March 4, 2013

The problem they have is that high-energy particles can hit random bits in the memory, and possibly change them. As such they have a large amount of radiation shielding, but some things can get through given enough time. They have a second computer that can "take command" in the case of any failure.

The second computer is the same so the same programs can be used (obviously). The thing about Curiosity is that it left Earth with a very minimal program, and bits were sent to it on the journey. When it landed, it used a different program then it does now. Remote programming allows it to carry out a large amount tasks, with a minimal amount of hardware.

The last thing is that everything is double, and triple checked. If something goes wrong, then it goes very wrong, so they ensure that everything is working fine. While the B computer is being used, they'll probably do a full bit-by-bit wipe of the A computer, then load software back on it.

Cushman · on March 4, 2013

I'd imagine that most of what you'd automate on Earth is done manually and double-checked, just in case. Add that to a half-hour(?) latency and it makes sense that something you'd run a bash script to do here winds up taking a week on Mars.

harshreality · on March 4, 2013

They must use the equivalent of ECC for flash memory, I would hope. So the theory is that cosmic rays corrupted multiple bits of radiation-hardened, ECC flash memory? Should they have expected that? Is reverting to a backup computer the best option? How many bits were flipped? Couldn't they have used a flash memory controller with the ability to correct more than 1 bit per word?

Do space missions also have problems with bit flips in cpu cache or cpu registers? How do they deal with that?

[in 2.5Gbit of ram,] The maximum hourly error report from Cassini–Huygens in the first month in space was 3072 single-bit errors [in DRAM] per day during a weak solar flare. If the flight recorders had been designed with EDAC words assembled from widely-separated bits, the number of (uncorrectable) multiple-bit errors should average less than one per year. [1]

If flash memory reliability is one uncorrectable error every X years where X is less than hundreds or thousands (under expected environmental conditions), that doesn't seem like a comforting level of reliability if a failure means several days or weeks of using a backup computer.

[1] http://en.wikipedia.org/wiki/ECC_memory citing http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/15831/1/00...

primitur · on March 4, 2013

Disclaimer: I'm a SIL-4 rated programmer and have worked in safety-critical systems for decades.

Yes, its true: radiation can destroy the benefit of having an ECC controller in your design. There are very few hardware methods, short of encasing the entire device in several tonnes of Lead, that will prevent this from happening when you're out there beyond the atmosphere ..

So the solution is, typically, a combination of hardened CPU's (as much shielding as the weight budget allows) plus SOFTWARE to detect the error, and react accordingly.

Memory corruption is something that a SIL-4 or Space-rated software system HAS to check for, actively, continuously. Its quite possible to use a number of techniques to cover the cases as much as possible .. for example, you can have a process which double-checks the Text Segment for running processes and compares it with a known valid CRC for the process. You can use 2-out-of-3 style voting systems, so that redundant decision making can detect problems, and so on.

gcb0 · on March 4, 2013

And how do you protect the cpu registers and ROM while they are checking the ram?

primitur · on March 4, 2013

Add another couple of CPU's to make a voting system (2 of 3 configuration)..

gcb0 · on March 5, 2013

why not add more memory chips then? and where do you stop?

nwh · on March 4, 2013

Given that they use specially hardened CPUs in their rovers, one would imagine they have error correction on the flash memory too.

woodchuck64 · on March 4, 2013

From years of debugging, I know I am unconsciously biased towards the view that a bug in my code or hardware is the result of a very low-probability event, and strangely biased away from the view that a medium or high-probability series of events occurred for which I didn't perfectly plan.

It always turns out to be the latter.

Therefore, I will predict that this event is NOT a low probability double-bit error brought on by stray radiation that bypassed all safeguards. (Unless those NASA guys are superhuman designers, which I guess could be valid hypothesis.)

XorNot · on March 4, 2013

Google feels that DRAM errors are about 8% per year from their studies. Cosmic ray interference is common enough to require deliberate software rejection modes when doing something Raman spectroscopy, so it's not like the probability of getting interference in memory chips is low.

It's just normally, the probability you've made a glaring bug while coding is way higher (and punching the power button is much easier then dissecting kernel states to find out if it's really a cosmic ray).

ars · on March 4, 2013

Don't forget that there are far more cosmic rays on Mars: Thin atmosphere, plus no magnetic field.

Sharlin · on March 4, 2013

The thing is that a cosmic particle event such as this is NOT an especially low-probability one when you're on Mars. Similar things have happened with Spirit and Opportunity as well, even earlier in their respective missions. The engineers expect things like this to happen. I don't think they have ever been wrong when attributing a glitch to a radiation event.

swah · on March 4, 2013

They aren't super human, but then have more eyes on each line of code that goes into space than most code..

andyjohnson0 · on March 4, 2013

"Unless those NASA guys are superhuman designers, which I guess could be valid hypothesis."

They Write the Right Stuff [1] - a little old now but worth a read.

[1] http://www.fastcompany.com/28121/they-write-right-stuff

topbanana · on March 3, 2013

Bug status set to: WONTFIX-STRAYPARTICLE

przemoc · on March 3, 2013

There is always some excuse, always!

BTW bug trackers generosity of having detailed WONTFIX statuses is yet to be seen... (but I doubt it would help much, though)

ams6110 · on March 4, 2013

Configuring the B-side computer to take control of the rover may take ... several days, maybe a week

I wish they had provided more information about what needs to be done. I would think the backup computer would be sort of a "warm standby" that could be switched over pretty quickly, from this it sounds like they need to upload data or software first.

Shank · on March 4, 2013

According to NASA, "Curiosity is now operating on its B-side, as it did during part of the flight from Earth to Mars."[0]

Presumably, it retained some programming for recovery (as a backup) from the flight, and they didn't have an exact replica of the software A-Side had on B-Side. That being said, they probably have debugging/test software on which ever is serving as the backup computer in order to diagnose and fix the other one in the event of a problem.

[0]: http://www.jpl.nasa.gov/news/news.php?release=2013-078

rhino42 · on March 4, 2013

I work at NASA / JPL (but I'm not part of the MSL team, or any flight software team). We learn a good bit about good flight practices though. Here's what probably happened:

Someone found the problem as a behavioral anomaly or weird status bits. It gets reviewed and escalated a few times until the important people know. For class-A missions (I think lower classes too), we have performed fault analysis on each component (and the system as a whole), so that we can match an observed behavior to the actual underlying issue through something like a flowchart. I don't know for sure, but it was probably a bit harder for a bit-flip, but I wouldn't be surprised if they had provisions for that too. However, these high-profile projects are extremely risk-adverse, so they'll review any non-standard commands heavily before transmitting them. teams (plural) probably reviewed this failure before confirming the initial conclusion.

The week was accounting for this through analysis and review process.

(my opinion is my own, and doesn't represent NASA or JPL)

swah · on March 4, 2013

So, what is the next step after the safe state? Reset and hope the next error happens in 10K years?

(Its also interesting how its much easier to think of a "safe mode" onde the thing has landed... during flight I have no idea what would that be!)

LambdaDriver · on March 4, 2013

Perhaps a silly question: Do they use an error correcting algorithm that scrubs the data and if not, is there a reason?