This is the kind of failure mode that I imagine (now with hindsight) should be m...

tremon · on May 28, 2021

We'll have to wait and see if they publish a public post-mortem, but this is a textbook real-time scheduling problem and it surprises me. Scheduling theory teaches to abandon processing (a frame/image in this case) if the result is late, both to avoid doing unnecessary work and to recover from sync/latency issues.

From the minimal descriptions I've seen so far, it seems like they were using producer/consumer work queues for the image processing without further synchronization.

jhgb · on May 28, 2021

Is it "just a scheduling problem"? For example if the system is designed to expect every frame being delivered and the algorithms can't handle "incomplete" data (a frame drop), then you have to redesign your algorithms. From the descriptions of the problem it sounded to me like a similar situation - it expected the N+1-th frame to actually be the N-th frame and treated it as such.

widforss · on May 28, 2021

Say we would assume this is a complete real-time system. How would a result be late? It seems like a failure mode that doesn't exist in that domain (given that all guarantees are verified).

Now, I get that this is probably not a real-time system.

tremon · on May 28, 2021

Yes, it seems the control system isn't real-time anyway, so this discussion is moot. But to answer your question:

A real-time system can still have inputs that can not be guaranteed, and internal processes can have an upper and lower bound for their execution time. So a result would be late if either:

  1) $T_{input} + t_{process_min} > T_{deadline}$
  2) $T_{input} + t_{process_actual} > T_{deadline}$

(where $T$ is a timestamp given by some time source and $t$ is process duration). Condition 1 can be determined on arrival of the data, and the input can be discarded immediately. Condition 2 cannot be determined beforehand, but if the processing isn't finished at $T_{deadline}$, the process/thread could be killed without waiting for it to complete.

Of course, this requires that an accurate deadline can be determined for each input packet. The textbook use case is for rendering live video streams, where stream latency is more important than rendering each frame accurately. This flight control system is a similar use case, since the utility of the camera feed for determining location or drift rapidly declines as the picture ages.

But if the timestamp-keeping isn't accurate as it seems in this case, it really doesn't matter if the system was real-time or not.

jannes · on May 28, 2021

It’s the first time they are using commodity hardware on mars.

I believe the CPU is the same that has been used in a few smartphones.

So it’s not a real-time system.

https://9to5google.com/2021/04/19/nasa-ingenuity-qualcomm-sn...

_joel · on May 28, 2021

There's 2 copies of memory and a RAD hardened FPGA that compares them and looks for bitflips. If it detects one then it restarts, so quickly it can do this midflight with no issues.

NovemberWhiskey · on May 28, 2021

There is no single CPU; it's a multitiered control system with at least two of the tiers automotive/military grade components for realtime systems.

ref. https://news.ycombinator.com/item?id=26907669 for more details for example.

pragmatic8 · on May 28, 2021

Why would a smartphone CPU preclude the use of an RTOS?

GeorgeTirebiter · on May 28, 2021

IT doesn't apriori preclude an RTOS, but multitasking OSs from unix heritage have so much 'free stuff' (networking, instrumentation, high-quality compilers, experience, etc) they are often the choice. For reasons I don't fully understand, and I'm in 'the business', true real-time embedded systems tend to have relatively simple TCP/UDP/IP stacks, simple https, and by simple I mean 'less robust than those on unix'. I think a fair amount of that is due to power budgets - embedded tend to be small or tiny. However, in this case, having that super beefy motor system to drive the blades taking many many watts, there would seem to be little point in saving microwatts in the MPU. So, one could do an RTOS on a smartphone chip, but it's 'fast enough' for human interactions - so the extra work in getting finite and well-defined real-time response doesn't really make much difference in smartphone apps - if the chip is 'fast enough'.

rurban · on May 30, 2021

For RTOS such a chip is always fast enough. The problem is IO latency and software.

A RTOS is typically a much slower and much smaller OS. FreeRTOS e.g consists of 3 small source files, and the control loop might be from 1KHz to 10KHz (for extremely high-dynamic systems). Compare that to the snapdragon loop of 2GHz. Factor 1e6.

https in real-time is obviously a joke. There exist proper networking protocols for hard realtime, not randomized and spiky Ethernet based protocols. Telcos use such proper realtime, slicing protocols.

siftedpixel · on May 28, 2021

Can’t you have a real-time kernel?

numpad0 · on May 28, 2021

When conditions under which the code is verified were violated? Like camera chip was supposed to return pictures, CPU was supposed to increment PC reliably, the board was supposed to not experience more than 5000G, etc? I’m not an expert though

LorenPechtel · on May 28, 2021

It seems to me that the problem was too much scheduling.

If it really were just producer/consumer workflows one frame would have been lost, the program would be comparing images two frames apart and seen it moving twice as fast as it was supposed to, this would have caused a minor excursion but then the next frame would be right and it would very quickly have settled back down to proper flight.

The problem is that it wasn't simply comparing frames, but it had an internal clock the frames were being compared against.

the_duke · on May 28, 2021

> this is a textbook real-time scheduling problem

Drive by question: do you have recommendations on good literature for real time systems?

SV_BubbleTime · on May 28, 2021

I struggle to think that JPL didn’t think of this already.

That in the distance/time code they didn’t take timestamps of each successful frame vs just assuming it’s a 30hz camera so surely each frame will be 30hz!

It seems to me, as someone who does similar embedded work, that if I would account for a missing frame failure, they surely would. I am a professional in embedded timing systems - I am guessing this isn’t the full story or it’s been paraphrased for normal people and something important was lost.

EDIT: Better info below. The photo was lost but it’s timestamp was not. They apparently didn’t tightly couple the image and it’s metadata together (hashes... what are they for!?) so there was some sort of mess up there. Knew there was more to the story.

twhitmore · on May 28, 2021

It's not actually a time "stamp" if you don't keep it on or tightly associated..

A struct or pair of {Image, Timestamp} would be the simplest & most robust solution.

dmurray · on May 28, 2021

I wonder if hashes are discouraged at JPL due to a historical focus on "correct, and provably correct" code. Hash collisions happen.

Yes, you can make collisions arbitrarily unlikely, and there are so many situations where they come in useful, but I can imagine an engineering culture where they are just never an easy tool to reach for.

SV_BubbleTime · on May 28, 2021

> arbitrarily unlikely

IDK, I would call more possibilities than grains of sand in the universe a little more than “arbitrarily unlikely”.

dnautics · on May 28, 2021

You can axiomatize uniqueness of hashes in proof systems.

ethbr0 · on May 28, 2021

This would require a hash input (or combination of inputs) to be unique over a limited / expected domain, right?

ncallaway · on May 28, 2021

I mean, you could take anything as an axiom so there are no real requirements for it.

I would imagine you would want to demonstrate the P(collision|domain+operating lifetime) << acceptable level or risk, and after you’d done that, take the uniqueness of hashes as a axiom for the proof system?

That’s just speculation on my part, since I haven’t done much with formal proofs of working code

dnautics · on May 30, 2021

No you just special case the function call and tell the proof analyzer "this never happens". Strictly speaking, a lie, but probably a sophisticated prover could even be rigged to keep track of those conditions (not in x years, e.g.) if you so chose

richardfey · on May 28, 2021

Exactly my thoughts. This is something I would expect covered by the testing process. I wonder what the Apollo engineers would say about this.

crocal · on May 28, 2021

This is actually compulsory requirement for safe communication protocols in railway according to EN 50159. The correct way to manage this is to reject the outdated samples. Ultimately, this should result in a transition to fallback mode (in this case, emergency landing mode).

GeorgeTirebiter · on May 28, 2021

This is actually all very complicated, because if you're doing e.g. velocity estimation via smoothed differences, missing a point means you've (effectively) inserted a spike-like pseudo-datapoint into your Kalman filter (or whatever) --- it will not totally crap, but will be incorrect until the 'bad' datapoint flows completely through the filter.

crocal · on May 28, 2021

... Indeed. Therefore we add sensor diversity to balance the failure modes and increase availability. Having no sample means that you increase your confidence interval at warp speed to account for all possible scenarios. Very quickly your estimate becomes : « great, I am traveling at 35kph +/- 100kph. Wait. What? »

loa_in_ · on May 28, 2021

That sounds nearly impossible a feature to implement. Simulating a virtual copy of a component that's purpose is to deliver novel data, autonomously?

timmattison · on May 28, 2021

I think it can be simpler than that. If a piece of code is going to do a velocity calculation from motion estimation between two frames it should expect to get new data every 1 / FPS seconds. A local clock could help validate that. If data comes too fast or too slow within some tolerance then the system could signal there is an issue. If the images have time stamps it could check those too.

Now you’d need to hope the inertial guidance is good enough at those altitudes. The article says it doesn’t use image based calculations for landing due to the possibility of kicking up dust.

zoomablemind · on May 28, 2021

> "...This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps..."

It seems that the timestamping is detached from frameshooting. That is, as if there were a burst happening, then the burst's frames get assigned the sequential timestamps. Thus if a burst had a glitched/skipped frame, a part of the series and the subsequent bursts become mis-attributed.

I understand that integrating the timestamping to be atomic to frameshooting is likely to slowdown the burst. So there has to be some expected internal timeline to validate the timeline that results from bursting.

SamBam · on May 28, 2021

> I imagine it is sort of like the organisms ...

Can you describe this further, or link to further reading?

tgbugs · on May 28, 2021

I don't have anything specific in mind that covers this (I'll look around), but maybe take a look at the work by Nicholas Strausfeld for some actual data on early nervous systems [0], and Leigh van Valen for the general theoretical framework [1].

0. https://scholar.google.com/scholar?q=Nicholas+Strausfeld

1. https://www.mn.uio.no/cees/english/services/van-valen/evolut...