This is the kind of failure mode that I imagine (now with hindsight) should be mitigated by (re)designing the system in such a way that the timestamp desync would always be detected because each part of the system should have its own internal model of the approximate times that it should be receiving from other systems and never blindly trust them.
I imagine it is sort of like the organisms early in evolution that simply believed their sensory system when it had in fact been hijacked by something that wanted to eat them. There aren't many sensory systems with those issues these days because anything that doesn't have an internal model that it can use for comparison winds up being a snack or crashing on an alien planet.
We'll have to wait and see if they publish a public post-mortem, but this is a textbook real-time scheduling problem and it surprises me. Scheduling theory teaches to abandon processing (a frame/image in this case) if the result is late, both to avoid doing unnecessary work and to recover from sync/latency issues.
From the minimal descriptions I've seen so far, it seems like they were using producer/consumer work queues for the image processing without further synchronization.
Is it "just a scheduling problem"? For example if the system is designed to expect every frame being delivered and the algorithms can't handle "incomplete" data (a frame drop), then you have to redesign your algorithms. From the descriptions of the problem it sounded to me like a similar situation - it expected the N+1-th frame to actually be the N-th frame and treated it as such.
Say we would assume this is a complete real-time system. How would a result be late? It seems like a failure mode that doesn't exist in that domain (given that all guarantees are verified).
Now, I get that this is probably not a real-time system.
Yes, it seems the control system isn't real-time anyway, so this discussion is moot. But to answer your question:
A real-time system can still have inputs that can not be guaranteed, and internal processes can have an upper and lower bound for their execution time. So a result would be late if either:
(where $T$ is a timestamp given by some time source and $t$ is process duration). Condition 1 can be determined on arrival of the data, and the input can be discarded immediately. Condition 2 cannot be determined beforehand, but if the processing isn't finished at $T_{deadline}$, the process/thread could be killed without waiting for it to complete.
Of course, this requires that an accurate deadline can be determined for each input packet. The textbook use case is for rendering live video streams, where stream latency is more important than rendering each frame accurately. This flight control system is a similar use case, since the utility of the camera feed for determining location or drift rapidly declines as the picture ages.
But if the timestamp-keeping isn't accurate as it seems in this case, it really doesn't matter if the system was real-time or not.
There's 2 copies of memory and a RAD hardened FPGA that compares them and looks for bitflips. If it detects one then it restarts, so quickly it can do this midflight with no issues.
IT doesn't apriori preclude an RTOS, but multitasking OSs from unix heritage have so much 'free stuff' (networking, instrumentation, high-quality compilers, experience, etc) they are often the choice. For reasons I don't fully understand, and I'm in 'the business', true real-time embedded systems tend to have relatively simple TCP/UDP/IP stacks, simple https, and by simple I mean 'less robust than those on unix'. I think a fair amount of that is due to power budgets - embedded tend to be small or tiny. However, in this case, having that super beefy motor system to drive the blades taking many many watts, there would seem to be little point in saving microwatts in the MPU. So, one could do an RTOS on a smartphone chip, but it's 'fast enough' for human interactions - so the extra work in getting finite and well-defined real-time response doesn't really make much difference in smartphone apps - if the chip is 'fast enough'.
For RTOS such a chip is always fast enough. The problem is IO latency and software.
A RTOS is typically a much slower and much smaller OS. FreeRTOS e.g consists of 3 small source files, and the control loop might be from 1KHz to 10KHz (for extremely high-dynamic systems). Compare that to the snapdragon loop of 2GHz. Factor 1e6.
https in real-time is obviously a joke. There exist proper networking protocols for hard realtime, not randomized and spiky Ethernet based protocols. Telcos use such proper realtime, slicing protocols.
When conditions under which the code is verified were violated? Like camera chip was supposed to return pictures, CPU was supposed to increment PC reliably, the board was supposed to not experience more than 5000G, etc? I’m not an expert though
It seems to me that the problem was too much scheduling.
If it really were just producer/consumer workflows one frame would have been lost, the program would be comparing images two frames apart and seen it moving twice as fast as it was supposed to, this would have caused a minor excursion but then the next frame would be right and it would very quickly have settled back down to proper flight.
The problem is that it wasn't simply comparing frames, but it had an internal clock the frames were being compared against.
I struggle to think that JPL didn’t think of this already.
That in the distance/time code they didn’t take timestamps of each successful frame vs just assuming it’s a 30hz camera so surely each frame will be 30hz!
It seems to me, as someone who does similar embedded work, that if I would account for a missing frame failure, they surely would. I am a professional in embedded timing systems - I am guessing this isn’t the full story or it’s been paraphrased for normal people and something important was lost.
EDIT: Better info below. The photo was lost but it’s timestamp was not. They apparently didn’t tightly couple the image and it’s metadata together (hashes... what are they for!?) so there was some sort of mess up there. Knew there was more to the story.
I wonder if hashes are discouraged at JPL due to a historical focus on "correct, and provably correct" code. Hash collisions happen.
Yes, you can make collisions arbitrarily unlikely, and there are so many situations where they come in useful, but I can imagine an engineering culture where they are just never an easy tool to reach for.
I mean, you could take anything as an axiom so there are no real requirements for it.
I would imagine you would want to demonstrate the P(collision|domain+operating lifetime) << acceptable level or risk, and after you’d done that, take the uniqueness of hashes as a axiom for the proof system?
That’s just speculation on my part, since I haven’t done much with formal proofs of working code
No you just special case the function call and tell the proof analyzer "this never happens". Strictly speaking, a lie, but probably a sophisticated prover could even be rigged to keep track of those conditions (not in x years, e.g.) if you so chose
This is actually compulsory requirement for safe communication protocols in railway according to EN 50159. The correct way to manage this is to reject the outdated samples. Ultimately, this should result in a transition to fallback mode (in this case, emergency landing mode).
This is actually all very complicated, because if you're doing e.g. velocity estimation via smoothed differences, missing a point means you've (effectively) inserted a spike-like pseudo-datapoint into your Kalman filter (or whatever) --- it will not totally crap, but will be incorrect until the 'bad' datapoint flows completely through the filter.
... Indeed. Therefore we add sensor diversity to balance the failure modes and increase availability. Having no sample means that you increase your confidence interval at warp speed to account for all possible scenarios. Very quickly your estimate becomes : « great, I am traveling at 35kph +/- 100kph. Wait. What? »
I think it can be simpler than that. If a piece of code is going to do a velocity calculation from motion estimation between two frames it should expect to get new data every 1 / FPS seconds. A local clock could help validate that. If data comes too fast or too slow within some tolerance then the system could signal there is an issue. If the images have time stamps it could check those too.
Now you’d need to hope the inertial guidance is good enough at those altitudes. The article says it doesn’t use image based calculations for landing due to the possibility of kicking up dust.
> "...This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps..."
It seems that the timestamping is detached from frameshooting. That is, as if there were a burst happening, then the burst's frames get assigned the sequential timestamps. Thus if a burst had a glitched/skipped frame, a part of the series and the subsequent bursts become mis-attributed.
I understand that integrating the timestamping to be atomic to frameshooting is likely to slowdown the burst. So there has to be some expected internal timeline to validate the timeline that results from bursting.
I don't have anything specific in mind that covers this (I'll look around), but maybe take a look at the work by Nicholas Strausfeld for some actual data on early nervous systems [0], and Leigh van Valen for the general theoretical framework [1].
I imagine it is sort of like the organisms early in evolution that simply believed their sensory system when it had in fact been hijacked by something that wanted to eat them. There aren't many sensory systems with those issues these days because anything that doesn't have an internal model that it can use for comparison winds up being a snack or crashing on an alien planet.