Mars Helicopter Lands Safely After Serious In-Flight Anomaly

qwertox · on May 28, 2021

> Fortunately, JPL knows exactly what went wrong.

> and while the flight uncovered a timing vulnerability that will now have to be addressed, it also confirmed the robustness of the system in multiple ways.

This is so important. Imagine if it would have crashed and damaged communications, that would have been the end of it. This extra step in engineering effort really payed off and gave Ingenuity a second life, and NASA a really good opportunity to learn important lessons and do further debugging.

Also, the article at mars.nasa.gov is a delight to read, with all the information one would like to know being in there. Not to mention that there is even a video for us to see.

lucb1e · on May 28, 2021

Note that this robustness seems to refer to just the physical build, not that the code had a glitch yet somehow saved the day.

> Flight Six ended with Ingenuity safely on the ground because a number of subsystems – the rotor system, the actuators, and the power system – responded to increased demands to keep the helicopter flying. In a very real sense, Ingenuity muscled through the situation, and while the flight uncovered a timing vulnerability that will now have to be addressed, it also confirmed the robustness of the system in multiple ways.

The code being robust and self-correcting, rather than the increased demands coincidentally being within tolerance levels, would have been more interesting or laudable.

ethbr0 · on May 28, 2021

Afaict, JPL and NASA (and subcontractors) have a very strong historical knowledge base of building maximally-robust systems with limited redundancy hardware (in the sense of not having an unlimited mass budget for more hardware).

But then again, they've been working with semi-autonomous systems about as long as anybody.

https://newatlas.com/apollo-11-guidance-computer/59766/

foobarbecue · on May 28, 2021

The fact that the code disables optical nav at low altitudes is also one of the reasons it landed safely.

snewman · on May 28, 2021

Yes, but in this case that was just good luck. As the article mentions, the low-altitude behavior is because of concerns regarding dust. It was not intended to provide robustness in the face of glitches in the optical navigation system.

shkkmo · on May 28, 2021

Dust concerns were one reason, but avoiding glitches and ensuring the smoothness of position data is another.

> Then, once the vehicle estimates that the legs are within 1 meter of the ground, the algorithms stop using the navigation camera and altimeter for estimation, relying on the IMU in the same way as on takeoff. As with takeoff, this avoids dust obscuration, but it also serves another purpose -- by relying only on the IMU, we expect to have a very smooth and continuous estimate of our vertical velocity, which is important in order to avoid detecting touchdown prematurely.

https://mars.nasa.gov/technology/helicopter/status/298/what-...

qwertox · on May 29, 2021

I wonder if, had Ingenuity disabled the visual system upon noticing timestamp conflicts, the use of the IMU alone would have made the entire situation a more safe one.

solipsism · on May 28, 2021

would have been more interesting or laudable.

I think it's interesting and laudable that the hardware coped. But yeah, on the face of it this sounds very much like brittle software. For 46 seconds the craft's performance suffered from a single frame being dropped. It sounds like either the calculations were smeared over way too much time, or (more likely) the code made too many assumptions that relied on every piece of the pipeline properly behaving, not only in the moment but also over the entire history of the flight.

I'm speaking from extreme ignorance obviously. But this reminds me of a million code reviews I've done where I've asked developers to make fewer assumptions about the state of the system. Often the response is something like "but how could that ever happen?" And my response is always "i have no idea, but shit happens."

I would love to see a postmortem that discussed the specifics of what went wrong in the software, and whether they can attribute the lack of robustness to system design flaws.

qwertox · on May 29, 2021

> This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps.

It was not a single lost frame which caused this issue. The issue was that the glitch then corrupted the timestamp accuracy of the frames that followed. I guess the dropped frame was just a symptom, maybe an initial memory corruption or something like that.

But it was not the fact that the frame dropped and that this missing piece of information then had the negative effects.

qwertox · on May 29, 2021

> The code being robust and self-correcting, rather than the increased demands coincidentally being within tolerance levels, would have been more interesting or laudable.

This is the good thing about the overall outcome of this incident: They now have a chance to learn from this.

It would be interesting to know from NASA if, had Ingenuity crashed and lost all COMMS, they would still know what was the cause of the issue. I mean the wrong timestamps of the images, not what caused them.

It would also be interesting to know what caused them, if it was a bug in the software, or a particle flipping a bit in the timestamping-code.

vzaliva · on May 28, 2021

> Note that this robustness seems to refer to just the physical build, not that the code had a glitch yet somehow saved the day.

I understnad the missing video frame is not just physical build issue. This is also a video of video pipeline and speed estimation implementation.

DrAwdeOccarim · on May 28, 2021

I had the same response while reading it, too. Like, wow, this is telling me all the technical things I wanted to know about the situation. This is one reason I pay for IEEE, they do a great job relaying the facts in summary articles from the main source.

dna_polymerase · on May 28, 2021

For those on mobile: https://mars.nasa.gov/technology/helicopter/status/305/survi...

tgbugs · on May 28, 2021

This is the kind of failure mode that I imagine (now with hindsight) should be mitigated by (re)designing the system in such a way that the timestamp desync would always be detected because each part of the system should have its own internal model of the approximate times that it should be receiving from other systems and never blindly trust them.

I imagine it is sort of like the organisms early in evolution that simply believed their sensory system when it had in fact been hijacked by something that wanted to eat them. There aren't many sensory systems with those issues these days because anything that doesn't have an internal model that it can use for comparison winds up being a snack or crashing on an alien planet.

tremon · on May 28, 2021

We'll have to wait and see if they publish a public post-mortem, but this is a textbook real-time scheduling problem and it surprises me. Scheduling theory teaches to abandon processing (a frame/image in this case) if the result is late, both to avoid doing unnecessary work and to recover from sync/latency issues.

From the minimal descriptions I've seen so far, it seems like they were using producer/consumer work queues for the image processing without further synchronization.

jhgb · on May 28, 2021

Is it "just a scheduling problem"? For example if the system is designed to expect every frame being delivered and the algorithms can't handle "incomplete" data (a frame drop), then you have to redesign your algorithms. From the descriptions of the problem it sounded to me like a similar situation - it expected the N+1-th frame to actually be the N-th frame and treated it as such.

widforss · on May 28, 2021

Say we would assume this is a complete real-time system. How would a result be late? It seems like a failure mode that doesn't exist in that domain (given that all guarantees are verified).

Now, I get that this is probably not a real-time system.

tremon · on May 28, 2021

Yes, it seems the control system isn't real-time anyway, so this discussion is moot. But to answer your question:

A real-time system can still have inputs that can not be guaranteed, and internal processes can have an upper and lower bound for their execution time. So a result would be late if either:

  1) $T_{input} + t_{process_min} > T_{deadline}$
  2) $T_{input} + t_{process_actual} > T_{deadline}$

(where $T$ is a timestamp given by some time source and $t$ is process duration). Condition 1 can be determined on arrival of the data, and the input can be discarded immediately. Condition 2 cannot be determined beforehand, but if the processing isn't finished at $T_{deadline}$, the process/thread could be killed without waiting for it to complete.

Of course, this requires that an accurate deadline can be determined for each input packet. The textbook use case is for rendering live video streams, where stream latency is more important than rendering each frame accurately. This flight control system is a similar use case, since the utility of the camera feed for determining location or drift rapidly declines as the picture ages.

But if the timestamp-keeping isn't accurate as it seems in this case, it really doesn't matter if the system was real-time or not.

jannes · on May 28, 2021

It’s the first time they are using commodity hardware on mars.

I believe the CPU is the same that has been used in a few smartphones.

So it’s not a real-time system.

https://9to5google.com/2021/04/19/nasa-ingenuity-qualcomm-sn...

_joel · on May 28, 2021

There's 2 copies of memory and a RAD hardened FPGA that compares them and looks for bitflips. If it detects one then it restarts, so quickly it can do this midflight with no issues.

NovemberWhiskey · on May 28, 2021

There is no single CPU; it's a multitiered control system with at least two of the tiers automotive/military grade components for realtime systems.

ref. https://news.ycombinator.com/item?id=26907669 for more details for example.

pragmatic8 · on May 28, 2021

Why would a smartphone CPU preclude the use of an RTOS?

GeorgeTirebiter · on May 28, 2021

IT doesn't apriori preclude an RTOS, but multitasking OSs from unix heritage have so much 'free stuff' (networking, instrumentation, high-quality compilers, experience, etc) they are often the choice. For reasons I don't fully understand, and I'm in 'the business', true real-time embedded systems tend to have relatively simple TCP/UDP/IP stacks, simple https, and by simple I mean 'less robust than those on unix'. I think a fair amount of that is due to power budgets - embedded tend to be small or tiny. However, in this case, having that super beefy motor system to drive the blades taking many many watts, there would seem to be little point in saving microwatts in the MPU. So, one could do an RTOS on a smartphone chip, but it's 'fast enough' for human interactions - so the extra work in getting finite and well-defined real-time response doesn't really make much difference in smartphone apps - if the chip is 'fast enough'.

rurban · on May 30, 2021

For RTOS such a chip is always fast enough. The problem is IO latency and software.

A RTOS is typically a much slower and much smaller OS. FreeRTOS e.g consists of 3 small source files, and the control loop might be from 1KHz to 10KHz (for extremely high-dynamic systems). Compare that to the snapdragon loop of 2GHz. Factor 1e6.

https in real-time is obviously a joke. There exist proper networking protocols for hard realtime, not randomized and spiky Ethernet based protocols. Telcos use such proper realtime, slicing protocols.

siftedpixel · on May 28, 2021

Can’t you have a real-time kernel?

numpad0 · on May 28, 2021

When conditions under which the code is verified were violated? Like camera chip was supposed to return pictures, CPU was supposed to increment PC reliably, the board was supposed to not experience more than 5000G, etc? I’m not an expert though

LorenPechtel · on May 28, 2021

It seems to me that the problem was too much scheduling.

If it really were just producer/consumer workflows one frame would have been lost, the program would be comparing images two frames apart and seen it moving twice as fast as it was supposed to, this would have caused a minor excursion but then the next frame would be right and it would very quickly have settled back down to proper flight.

The problem is that it wasn't simply comparing frames, but it had an internal clock the frames were being compared against.

the_duke · on May 28, 2021

> this is a textbook real-time scheduling problem

Drive by question: do you have recommendations on good literature for real time systems?

SV_BubbleTime · on May 28, 2021

I struggle to think that JPL didn’t think of this already.

That in the distance/time code they didn’t take timestamps of each successful frame vs just assuming it’s a 30hz camera so surely each frame will be 30hz!

It seems to me, as someone who does similar embedded work, that if I would account for a missing frame failure, they surely would. I am a professional in embedded timing systems - I am guessing this isn’t the full story or it’s been paraphrased for normal people and something important was lost.

EDIT: Better info below. The photo was lost but it’s timestamp was not. They apparently didn’t tightly couple the image and it’s metadata together (hashes... what are they for!?) so there was some sort of mess up there. Knew there was more to the story.

twhitmore · on May 28, 2021

It's not actually a time "stamp" if you don't keep it on or tightly associated..

A struct or pair of {Image, Timestamp} would be the simplest & most robust solution.

dmurray · on May 28, 2021

I wonder if hashes are discouraged at JPL due to a historical focus on "correct, and provably correct" code. Hash collisions happen.

Yes, you can make collisions arbitrarily unlikely, and there are so many situations where they come in useful, but I can imagine an engineering culture where they are just never an easy tool to reach for.

SV_BubbleTime · on May 28, 2021

> arbitrarily unlikely

IDK, I would call more possibilities than grains of sand in the universe a little more than “arbitrarily unlikely”.

dnautics · on May 28, 2021

You can axiomatize uniqueness of hashes in proof systems.

ethbr0 · on May 28, 2021

This would require a hash input (or combination of inputs) to be unique over a limited / expected domain, right?

ncallaway · on May 28, 2021

I mean, you could take anything as an axiom so there are no real requirements for it.

I would imagine you would want to demonstrate the P(collision|domain+operating lifetime) << acceptable level or risk, and after you’d done that, take the uniqueness of hashes as a axiom for the proof system?

That’s just speculation on my part, since I haven’t done much with formal proofs of working code

dnautics · on May 30, 2021

No you just special case the function call and tell the proof analyzer "this never happens". Strictly speaking, a lie, but probably a sophisticated prover could even be rigged to keep track of those conditions (not in x years, e.g.) if you so chose

richardfey · on May 28, 2021

Exactly my thoughts. This is something I would expect covered by the testing process. I wonder what the Apollo engineers would say about this.

crocal · on May 28, 2021

This is actually compulsory requirement for safe communication protocols in railway according to EN 50159. The correct way to manage this is to reject the outdated samples. Ultimately, this should result in a transition to fallback mode (in this case, emergency landing mode).

GeorgeTirebiter · on May 28, 2021

This is actually all very complicated, because if you're doing e.g. velocity estimation via smoothed differences, missing a point means you've (effectively) inserted a spike-like pseudo-datapoint into your Kalman filter (or whatever) --- it will not totally crap, but will be incorrect until the 'bad' datapoint flows completely through the filter.

crocal · on May 28, 2021

... Indeed. Therefore we add sensor diversity to balance the failure modes and increase availability. Having no sample means that you increase your confidence interval at warp speed to account for all possible scenarios. Very quickly your estimate becomes : « great, I am traveling at 35kph +/- 100kph. Wait. What? »

loa_in_ · on May 28, 2021

That sounds nearly impossible a feature to implement. Simulating a virtual copy of a component that's purpose is to deliver novel data, autonomously?

timmattison · on May 28, 2021

I think it can be simpler than that. If a piece of code is going to do a velocity calculation from motion estimation between two frames it should expect to get new data every 1 / FPS seconds. A local clock could help validate that. If data comes too fast or too slow within some tolerance then the system could signal there is an issue. If the images have time stamps it could check those too.

Now you’d need to hope the inertial guidance is good enough at those altitudes. The article says it doesn’t use image based calculations for landing due to the possibility of kicking up dust.

zoomablemind · on May 28, 2021

> "...This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps..."

It seems that the timestamping is detached from frameshooting. That is, as if there were a burst happening, then the burst's frames get assigned the sequential timestamps. Thus if a burst had a glitched/skipped frame, a part of the series and the subsequent bursts become mis-attributed.

I understand that integrating the timestamping to be atomic to frameshooting is likely to slowdown the burst. So there has to be some expected internal timeline to validate the timeline that results from bursting.

SamBam · on May 28, 2021

> I imagine it is sort of like the organisms ...

Can you describe this further, or link to further reading?

tgbugs · on May 28, 2021

I don't have anything specific in mind that covers this (I'll look around), but maybe take a look at the work by Nicholas Strausfeld for some actual data on early nervous systems [0], and Leigh van Valen for the general theoretical framework [1].

0. https://scholar.google.com/scholar?q=Nicholas+Strausfeld

1. https://www.mn.uio.no/cees/english/services/van-valen/evolut...

kortex · on May 28, 2021

I feel this bug in my bones. I work a lot on WAMI/aerial imagery. I've dealt with this exact problem of timestamp-off-by-one-image. I have 9 cameras with different exposure times all triggered by the same hardware pulse. Fortunately my system is data-collect only so the worst outcome is some wrong filenames.

I learned only way too late in the project that most GigeVision cameras emit timestamps that can be synched using NTP and/or PTP. The camera specs/manual said nothing about this. Only found it by accident looking through the GeniCam XML API.

Moral of the story: ensure your cameras emit timestamps on device. Don't rely on system time.

rurban · on May 30, 2021

Unless you need to sync times. Such as with connected low power devices which need to wake up by themselves. (Like Mars devices running for more than a few years). You cannot wake them up via a ping. Device time is then unreliable, you need to sync with network time, and then distribute time changes to your measurements. But properly.

Google/Amazon do have a similar problem with eventually consistent databases, but low power (NB-IoT) meshes are harder, as synced wakeup's are critical. Don't rely on device time.

jerf · on May 28, 2021

Look at the quality of those images of the ground in that video. Those blades are whirring around at super-high rates to provide lift in the Martian atmosphere and taking 30 pictures per second, but in each exposure you can clearly see the shadow of the blades with very little blurring. And with ambient solar brightness quite a bit lower than Earth's "full sunlight", too. Those are quality cameras.

jccooper · on May 28, 2021

Omnivision OV7251: https://www.ovt.com/sensors/OV7251

Looks to be ~$3 on Digikey: https://www.digikey.com/en/products/detail/omnivision-techno...

Robotbeat · on May 28, 2021

Only a factor of 2.3 lower. Sunlight is extremely bright. Full sunlight on Mars is much brighter than a cloudy day on Earth and an order of magnitude brighter than, say, an office or bedroom on Earth

spywaregorilla · on May 28, 2021

Are these colors accurate to what we would see?

teraflop · on May 28, 2021

The first image, from the Perseverance rover's MastCam-Z, is at least as much "true color" as any ordinary camera. (That is, the precise color might be slightly off due to differences between the camera's sensitivity and your monitor's spectrum, but it should be pretty close.)

The video from Ingenuity itself is taken by a black-and-white camera and has no color information.

LorenPechtel · on May 28, 2021

When you're sending stuff to other planets the cost of what you're sending is usually a trivial part of the cost of the mission. So long as it's off the shelf and not heavier you send the best there is.

zadler · on May 28, 2021

I don’t think the article mentions whether or not they can update the code remotely. Can they patch it?

Edit: appears so: https://astronomynow.com/2021/04/11/nasa-delays-mars-helicop...

ArnoVW · on May 28, 2021

OTA updates of the software are a hard requirement if your equipment is at 30 light minutes.

Actually, I remember reading somewhere that more and more they use FPGA chips, allowing to even "change out the CPU" if needed.

http://www.esa.int/Enabling_Support/Space_Engineering_Techno...

j4mie · on May 28, 2021

Is it still called OTA if there isn't any air?

codeulike · on May 28, 2021

I think it can justifiably have a grander name than OTA.

Lets call it an "OPA" - Other Planet Update

Or perhaps "USTLLTFMMA" (updating software thats literally like thirty fucking million miles away)

NAG3LT · on May 28, 2021

As if Squats and Dusters would ever accept OPA

sunizzle · on May 28, 2021

Underrated comment. #expanse

8bitsrule · on May 29, 2021

OPA Gangnam Style

wereHamster · on May 28, 2021

There is air for the fist and last few dozens of kilometers (at varying density)

TchoBeer · on May 28, 2021

There's very clearly air, seeing as the helicopter is flying.

samsari · on May 28, 2021

Over the Aether?

hvidgaard · on May 28, 2021

That isn't there either.

jk7tarYZAQNpTQa · on May 28, 2021

"The modern concept of the vacuum of space, confirmed every day by experiment, is a relativistic ether. But we do not call it this because it is taboo."

https://en.wikipedia.org/wiki/Aether_theories#Quantum_vacuum

hvidgaard · on May 31, 2021

I am aware of the theory of quantum vacuum, but I generally take Aether to mean the old proposal as the medium light waves travel through. It has been refuted.

snypher · on May 28, 2021

Over-the-antenna maybe, easier than a JTAG cable.

Trufa · on May 28, 2021

Air doesn't have a strict definition right?

wazoox · on May 28, 2021

They were patching the Voyager probes in-flight more than 40 years ago :)

imhoguy · on May 28, 2021

I pray it won't go dead after software flashing like my new laptop with a recent BIOS update ;)

jve · on May 28, 2021

Don't throw your laptop out, yet. Flash it directly. Well, depends on the laptop/chip ofcourse, but do your research: https://youtu.be/Gdehz26lYWM?t=173

ethbr0 · on May 28, 2021

When you've got redundant memory, you can do a lot safer flashes.

Sacrifice redundance, to maintain rollback capability, and flash one bank. Power on, test, and if test fails, restore to previous configuration.

ilogik · on May 28, 2021

it's running linux. apt-get update :)

(not sure what distro they're running)

nemoniac · on May 28, 2021

It runs Fprime

https://nasa.github.io/fprime/

fnord77 · on May 28, 2021

they implemented their own message queue?

crashbunny · on May 28, 2021

this jogged my memory of an episode of linux unplugged podcast many months ago. https://linuxunplugged.com/396

"Tim Canham, the Mars Helicopter Operations Lead, shares Linux’s origins at JPL and how it ended up running on multiple boxes on Mars.

Plus the challenges Linux still faces before its ready for mission-critical space exploration."

edit: just saw someone already mentioned the podcast in this thread

TchoBeer · on May 28, 2021

>Plus the challenges Linux still faces before its ready for mission-critical space exploration

Can I get a tl;dr on this

irjustin · on May 28, 2021

Incredible! on so many levels.

My favorite is we got details. Legitimate details that give us a pretty clear picture where it went wrong and how it can be corrected. No calling it "software glitch" and stopping there. Those kinds of articles frustrate me to no end.

As a very small portion of the HN crowd, believe we are quite pleased.

thanatos519 · on May 28, 2021

The most incredible thing for me about this glitch is that as far as I know it's the first serious error in the entire mission. The live coverage of the landing, and the combined video that followed, was utterly astounding and humbling. Dare mighty things, indeed!

The communications from NASA's Mars teams has been incredibly thorough and transparent. That they share all of the raw images soon after they are received from Mars gives an incredible feeling of inclusion.

mtreis86 · on May 28, 2021

Summary of the issue from another page;

"For the majority of time airborne, the downward-looking navcams takes 30 pictures a second of the Martian surface and immediately feeds them into the helicopter’s navigation system. Each time an image arrives, the navigation system’s algorithm performs a series of actions: First, it examines the timestamp that it receives together with the image in order to determine when the image was taken. Then, the algorithm makes a prediction about what the camera should have been seeing at that particular point in time, in terms of surface features that it can recognize from previous images taken moments before (typically due to color variations and protuberances like rocks and sand ripples). Finally, the algorithm looks at where those features actually appear in the image. The navigation algorithm uses the difference between the predicted and actual locations of these features to correct its estimates of position, velocity, and attitude.

Approximately 54 seconds into the flight, a glitch occurred in the pipeline of images being delivered by the navigation camera. This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps. From this point on, each time the navigation algorithm performed a correction based on a navigation image, it was operating on the basis of incorrect information about when the image was taken. The resulting inconsistencies significantly degraded the information used to fly the helicopter, leading to estimates being constantly “corrected” to account for phantom errors. Large oscillations ensued."

lucb1e · on May 28, 2021

As for why this didn't crash the helicopter, the hardware managed to keep up with the executed corrections and oscillations:

> Flight Six ended with Ingenuity safely on the ground because a number of subsystems – the rotor system, the actuators, and the power system – responded to increased demands to keep the helicopter flying. In a very real sense, Ingenuity muscled through the situation, and while the flight uncovered a timing vulnerability that will now have to be addressed, it also confirmed the robustness of the system in multiple ways.

teekert · on May 28, 2021

Here's a cool relevant podcast (from before flight 6 but gives some nice insights into what it takes to fly that Helicopter on Mars): [0]

Description:

Tim Canham, Mars Helicopter Operations Lead at NASA’s JPL joins us again to share technical details you've never heard about the Ingenuity Linux Copter on Mars. And the challenges they had to work around to achieve their five successful flights.

[0]: https://linuxunplugged.com/406

FranOntanaya · on May 28, 2021

> 30hz

I assume it was just a failure to record the frame, but it'd be quite something if it was a drop frame from 29.97 video.

jerf · on May 28, 2021

Let's put a pin your prediction; it's definitely an interesting possibility!

    54 seconds of 29.97 frames: 1618.38 frames
    54 seconds of 30.00 frames: 1620.00 frames

Toss in a bit of rounding errors and/or tolerances and it's definitely suspiciously close to where a 1 frame error appears between those two.

I'd suggest that the evidence against it would be that while this may be the longest flight on Mars, it seems certain that they'd have taken longer flights on Earth than this. But still, the math nearly checks out.

elliekelly · on May 28, 2021

The tech specs[1] give it a range of “up to 300m” but I wonder whether that range is based on the hardware or the hardware + software.

[1]https://mars.nasa.gov/technology/helicopter/#Deployment

LorenPechtel · on May 28, 2021

Floating point--don't use it unless you truly, truly need it. Why is this not just frame number?

jerf · on May 28, 2021

YouTube answer to why 29.97 exists: https://www.youtube.com/watch?v=3GJUM6pCpew

areoform · on May 28, 2021

Does anyone know if the image was lost due to routine reasons, or due to radiation causing a cascade failure?

orbital-decay · on May 28, 2021

Seems to be the main question with no clear answer yet. One of the purposes of Ingenuity was to demonstrate non-radhard electronics working on Mars, and a frame-dropping glitch could have happened anywhere from software to hardware.

snewman · on May 28, 2021

Whatever the issue, presumably it did not arise during testing. That seems to point toward something that would only happen on Mars... so, yeah, maybe radiation?

geocrasher · on May 28, 2021

It's incredible that the thing works at all- but to add the resilience that they seem to have accidentally built in (switching from visual at landing) and it surviving such a wild flight envelope is just incredible.

I bought a cheap drone on a common importing website and the thing likes to get lost even though it has GPS and is in the perfect clear. It flew into my truck, uncommanded, at full speed. On Earth. With GPS. So for this thing on another planet to do what it just did? Wow.

beefman · on May 28, 2021

Better source: https://mars.nasa.gov/technology/helicopter/status/305/survi...

_Microft · on May 28, 2021

The article links to NASA but here it is anyways. Let me skim it and I might add some additional information here but go and check out the first picture there yourself - it is amazing. It could as well be an aerial photograph of a desert area on Earth by the looks of it.

https://mars.nasa.gov/technology/helicopter/status/305/survi...

OK, so from the NASA article, it seems that they use an inertial measurement unit [0] (think an accelerometer/rotational sensor similiar to what your smartphone uses) that reads out heading and acceleration 500 times per second and sums it up to get the the helicopter's position. This is "dead reckoning" [1]) which unfortunately suffers from accumulating errors. To work around these accumulating errors, they sync their prediction to camera data thirty times per second by predicting how features visible on the surface should have moved in this time. So, sensor fusion [2], in a sense. The images are delivered with timestamps (I would have been surprised if they were not, to be honest) but one of the images went missing and for some reason, its timestamp was not. (Variable not cleared? Are they maybe using parallel queues to match images to timestamps (would be very weird, imo but I am no helicopter scientist?)). Either way, the following timestamps were now attributed to wrong images. I really wonder how that happened and hope they expand on that later. As the system predicts how far a feature in an image should have moved between frames, things were now considerably off (example: from the IMU prediction, a feature should have moved 1 unit per second but with a missing frame and the previous timestamp, it now looks as if it moved 2 units in a second) and the state gets updated with this wrong information.

Here is an interesting bit about safety margins from this article by the way:

Despite encountering this anomaly, Ingenuity was able to maintain flight and land safely [...]. One reason it was able to do so is the considerable effort that has gone into ensuring that the helicopter’s flight control system has ample “stability margin”: We designed Ingenuity to tolerate significant errors without becoming unstable, including errors in timing. This built-in margin was not fully needed in Ingenuity’s previous flights, because the vehicle’s behavior was in-family with our expectations, but this margin came to the rescue in Flight Six.

[0] https://en.wikipedia.org/wiki/Inertial_measurement_unit

[1] https://en.wikipedia.org/wiki/Dead_reckoning

[2] https://en.wikipedia.org/wiki/Sensor_fusion

necovek · on May 28, 2021

None of this explains why are images and timestamps considered in the absolute sense: if they were only considered in relative sense (to the previous one), after surviving the glitch of 1s/30 (which, at the mentioned speed of 4m/s would be 13cm or roughly 5" of distance covered — 11% of the Ingenuity rotor span of 1.2m, or a maximum of <15 degrees tilt on the height of 50cm, though the base height used by the algorithm probably ignores the legs and just takes the body-to-rotors height), it should go back to acting normally.

Even in absolute terms, once you are 110 seconds into the flight, we are talking about an offset of 1s/(30*110s), or 0.03% — if control algorithm can get confused by an error of this magnitude, it's not a very good one.

But honestly, I am sure the algorithm is much better than that, and I believe the explanation is "watered down" for the general populace, and a lot more is in play that's not being shared.

isatty · on May 28, 2021

I used to do GNC: it’s interesting that the slower sensor (the camera) affected the orientation that much (instead of just pos).

jcims · on May 28, 2021

It sounds like the alignment error persisted for the remainder of the flight. Depending on the direction of the slip, the visual system would either record motion that wasn’t yet sensed by the IMU or it would record no motion when there was some recorded by the IMU. In both cases the control loops would likely try to correct for the discrepancy. The large control inputs sound a bit like ’integral windup’ in a PID loop where persistent error between desired and observed states are essentially summed over time resulting in larger control input (in the case of a drone this would be helpful for overcoming wind).

Unless they box that in with limits you could easily get to a point where it starts flying like there’s a chimp on the yoke and either crash or saturate the navigation system and have it fall back/fail entirely.

isatty · on May 28, 2021

The control systems behavior is less interesting to me (yes it could be windup or many other reasons for oscillation) than why the (u?)kf behaved this way. The single frame could have caused a time slip for the rest of the flight but usually there’s something that syncs timestamps before the filter acts on it. I hope we can get a technical postmortem.

jcims · on May 28, 2021

Me too, would be interesting (possibly by virtue of being boring). Hopefully one of these days we'll have a vehicle on Mars taking PRs on github.

wongarsu · on May 28, 2021

An small error in position would lead to a larger error in speed estimate. To correct speed, you have to pitch the helicopter. But the speed keeps being wrong, leading to oscillation.

perryizgr8 · on May 28, 2021

> it seems that they use an inertial measurement unit

They should have just used GPS!

/s

zaxomi · on May 28, 2021

Or GLONASS or Galileo. After all, the G stands for Global. ;)

frosted-flakes · on May 28, 2021

The globe in those cases is Earth, not Mars.

zaxomi · on May 30, 2021

I know. Didn't you notice the ";)" ? The comment was a joke in the same spirit as the parent comment.

GeorgeTirebiter · on May 28, 2021

Because of using multiple, independent sensors (camera in the air, IMU near the ground) a disaster was averted. This is an important lesson: do not rely totally on one sensing modality. ( I wish Elon would learn this lesson: https://jalopnik.com/teslas-removing-radar-for-semi-automate... )

kumarvvr · on May 28, 2021

Very interesting problem.

I wonder, why not have a real time clock capture time stamps for different inputs for same frame, like one when written to memory, one when the trigger for the frame is done, etc and have a comparative analysis. Discard the process when there is a discrepancy, and re-start the captures from beginning, and go into a default hover state when such a fault is found.

fnord77 · on May 28, 2021

I'm surprised that this corner case - the loss of a single frame from the nav camera - wasn't tested.

GuB-42 · on May 28, 2021

Maybe is was just "impossible". Hard real time systems are not supposed to lose frames, sometimes provably so.

You have to make some assumptions. For example, that pure functions will always return the same result with the same arguments. In fact, in real life, it is not always true, especially in space where a cosmic ray can flip a bit. You may try do be defensive to account for an unreliable hardware, but you have to draw the line at some point.

Yes, is should have been tested, but it doesn't surprise me that it wasn't even for a company with as good a track record as JPL.

fnord77 · on May 28, 2021

I'm sure this will be added to their set of test cases :)

GeorgeTirebiter · on May 28, 2021

Corner cases are hard. Here's a little C program; try to figure out what it prints before you compile & run it. It computes a step size between two numbers, given a number of steps:

#include <stdio.h>

int f(unsigned n_steps, int from_val, int to_val){ int step_size = (from_val - to_val) / n_steps; return step_size; }

int main(){ printf("f(3, 0, 30) = %d\n", f(3, 0, 30) ); printf("f(3, 30, 0) = %d\n", f(3, 30, 0) ); }

ordu · on May 29, 2021

> Corner cases are hard.

Yeah, they are, but your example is not good enough, it is like an example from a textbook on traps of C.

Corner cases hard when you are trying to do something new, because if you did it before, you'd know the most of them. Or if someone else did it before and wrote a textbook. =)

Baeocystin · on May 28, 2021

I wonder what the shutter speed for the nav camera used in that GIF. They said the frames themselves are captured at 30Hz, but for the shadow of the rotors to be frozen without blurring, the shutter must be much shorter. Does anyone here know?

sam_bristow · on May 29, 2021

If you enjoy this route of analysis you might find NASA's public 'lessons leaned' database interesting.

https://llis.nasa.gov/

ashtonkem · on May 28, 2021

Looking forward to the first off planet NTSB report.

uwagar · on May 28, 2021

NASA needs some drama for new funding

londons_explore · on May 28, 2021

It looks like the real issue is "we wrote all the camera code ourself rather than using battle tested libraries".

The android camera API for example will tag every frame with the timestamps and exposure settings it was captured with. Unit and integration tests can simulate dropped frames and verify the algorithm still runs as it should.

I would guess they tried to go for direct camera hardware access and ended up writing their own logic to schedule frames, and it contained a bug.

sgt · on May 28, 2021

Yeah let's just find a battle tested library for camera code on Mars. And I'm sure you expect that to be available on npm as well.

0xfaded · on May 28, 2021

Id be happy to find one on earth. Go look at some driver code, it's not uncommon for timestamps to increment by a precomputed expected frame duration.

Clewza313 · on May 28, 2021

And where, pray tell, would you find battle tested libraries for flying a drone on Mars?

I'm also reasonably certain that the drone isn't running Android, and that not doing so is the correct design choice.

perryizgr8 · on May 28, 2021

According to Wikipedia: The helicopter uses a Qualcomm Snapdragon 801 processor with a Linux operating system.

Considering Qualcomm never releases their kernel sources, I am really worried that NASA might have been forced to use a modified Android system.

ludocode · on May 28, 2021

I mean they weren't "forced" to use anything. They chose the chip. If they had different requirements they could have chosen a different chip.

Most likely they just used the specific kernel version with closed binary blobs from Qualcomm for a typical embedded Linux, not the full Android system.

rurounijones · on May 28, 2021

To be Devil's advocate. OP was talking about a VGA camera API only and what an example of a battle-tested camera library already does.

I don't think anyone is seriously suggesting the thing should run android.

But if you are writing something yourself then looking at what existing libraries already (and more importantly their test suites!) do is a good idea.

Maybe they did and maybe they didn't in this case, hard to know with the data available.

serf · on May 28, 2021

>And where, pray tell, would you find battle tested libraries for flying a drone on Mars?

this isn't magic. Image processing isn't a wholly unknown field when under the influence of Martian gravity.

It's a staggering achievement, and a testament to humankind's ability to wrangle technology, but -- to be frank here -- visual odometry is technology from 1960s era spy planes, not beyond-human voodoo.

Sometimes errors are stupid ones, even if the stupidity is causing problems a zillion miles away on distant unknown frontiers being first-explored by human kind.

tl;dr : This visual odometry problem isn't a space problem, it's a computer science problem -- one that has been encountered by numerous people even here on earth.

The repercussions of such a simple non-space-problem might very well turn into space problems when the craft crashes, however -- and that's a damn shame given that this particular set of problems has been encountered-and-fixed by numerous earthlings over and over again.

super tl;dr : as an only half-informed-participant in this discussion I bet this problem is NIH-syndrome-related, a problem NASA has encountered a lot in the past, a problem that's entirely managerial and very hard to eradicate.

numpad0 · on May 28, 2021

Re:tl;dr: Snapdragon camera peripheral is probably not super often used on Earth to fly a drone

fjdncncndb · on May 28, 2021

I would imagine you could reuse code from a flying drone on Earth.

ghhhhhk8899jj · on May 28, 2021

There's no reason to assume that code would contain fewer bugs.

gscott · on May 28, 2021

Next time they need to send up gps sats to orbit Mars...

serf · on May 28, 2021

visual odometry is a fairly common feature for earth-creations (even toy drones) operating in GPS-denied environments.

We have plenty of faux-mars environments to practice these techniques on. NASA even has their own.

many of the toy indoor-use drones use such techniques for attitude/altitude-hold features when under a roof.

mjfisher · on May 28, 2021

Silly spacecraft engineers. If only they had the foresight to run "npm install -g ingenuity-camera-api"

JosephRedfern · on May 28, 2021

It does sound like something you’d expect unit and integration tests to catch, but I am not sure that necessitates the use of off the shelf libraries?

In the context of a space helicopter, I’d guess that deep understanding and ownership of each software component is super important when you’re likely to have to support a debug it from millions of miles away. At this scale, and in this context, doesn’t “rolling your own” critical components, in some cases, make sense?

pragma93 · on May 28, 2021

I can't say for the camera code, but it looks like a lot of the project is reliant on open source software:

- https://github.com/nasa/fprime

- https://github.com/readme/nasa-ingenuity-helicopter

Tarq0n · on May 28, 2021

Scientific cameras are all like that. You get a half-baked SDK if you're lucky, then you have to implement and integrate everything yourself.