> and while the flight uncovered a timing vulnerability that will now have to be addressed, it also confirmed the robustness of the system in multiple ways.
This is so important. Imagine if it would have crashed and damaged communications, that would have been the end of it. This extra step in engineering effort really payed off and gave Ingenuity a second life, and NASA a really good opportunity to learn important lessons and do further debugging.
Also, the article at mars.nasa.gov is a delight to read, with all the information one would like to know being in there. Not to mention that there is even a video for us to see.
Note that this robustness seems to refer to just the physical build, not that the code had a glitch yet somehow saved the day.
> Flight Six ended with Ingenuity safely on the ground because a number of subsystems – the rotor system, the actuators, and the power system – responded to increased demands to keep the helicopter flying. In a very real sense, Ingenuity muscled through the situation, and while the flight uncovered a timing vulnerability that will now have to be addressed, it also confirmed the robustness of the system in multiple ways.
The code being robust and self-correcting, rather than the increased demands coincidentally being within tolerance levels, would have been more interesting or laudable.
Afaict, JPL and NASA (and subcontractors) have a very strong historical knowledge base of building maximally-robust systems with limited redundancy hardware (in the sense of not having an unlimited mass budget for more hardware).
But then again, they've been working with semi-autonomous systems about as long as anybody.
Yes, but in this case that was just good luck. As the article mentions, the low-altitude behavior is because of concerns regarding dust. It was not intended to provide robustness in the face of glitches in the optical navigation system.
Dust concerns were one reason, but avoiding glitches and ensuring the smoothness of position data is another.
> Then, once the vehicle estimates that the legs are within 1 meter of the ground, the algorithms stop using the navigation camera and altimeter for estimation, relying on the IMU in the same way as on takeoff. As with takeoff, this avoids dust obscuration, but it also serves another purpose -- by relying only on the IMU, we expect to have a very smooth and continuous estimate of our vertical velocity, which is important in order to avoid detecting touchdown prematurely.
I wonder if, had Ingenuity disabled the visual system upon noticing timestamp conflicts, the use of the IMU alone would have made the entire situation a more safe one.
I think it's interesting and laudable that the hardware coped. But yeah, on the face of it this sounds very much like brittle software. For 46 seconds the craft's performance suffered from a single frame being dropped. It sounds like either the calculations were smeared over way too much time, or (more likely) the code made too many assumptions that relied on every piece of the pipeline properly behaving, not only in the moment but also over the entire history of the flight.
I'm speaking from extreme ignorance obviously. But this reminds me of a million code reviews I've done where I've asked developers to make fewer assumptions about the state of the system. Often the response is something like "but how could that ever happen?" And my response is always "i have no idea, but shit happens."
I would love to see a postmortem that discussed the specifics of what went wrong in the software, and whether they can attribute the lack of robustness to system design flaws.
> This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps.
It was not a single lost frame which caused this issue. The issue was that the glitch then corrupted the timestamp accuracy of the frames that followed. I guess the dropped frame was just a symptom, maybe an initial memory corruption or something like that.
But it was not the fact that the frame dropped and that this missing piece of information then had the negative effects.
> The code being robust and self-correcting, rather than the increased demands coincidentally being within tolerance levels, would have been more interesting or laudable.
This is the good thing about the overall outcome of this incident: They now have a chance to learn from this.
It would be interesting to know from NASA if, had Ingenuity crashed and lost all COMMS, they would still know what was the cause of the issue. I mean the wrong timestamps of the images, not what caused them.
It would also be interesting to know what caused them, if it was a bug in the software, or a particle flipping a bit in the timestamping-code.
I had the same response while reading it, too. Like, wow, this is telling me all the technical things I wanted to know about the situation. This is one reason I pay for IEEE, they do a great job relaying the facts in summary articles from the main source.
This is the kind of failure mode that I imagine (now with hindsight) should be mitigated by (re)designing the system in such a way that the timestamp desync would always be detected because each part of the system should have its own internal model of the approximate times that it should be receiving from other systems and never blindly trust them.
I imagine it is sort of like the organisms early in evolution that simply believed their sensory system when it had in fact been hijacked by something that wanted to eat them. There aren't many sensory systems with those issues these days because anything that doesn't have an internal model that it can use for comparison winds up being a snack or crashing on an alien planet.
We'll have to wait and see if they publish a public post-mortem, but this is a textbook real-time scheduling problem and it surprises me. Scheduling theory teaches to abandon processing (a frame/image in this case) if the result is late, both to avoid doing unnecessary work and to recover from sync/latency issues.
From the minimal descriptions I've seen so far, it seems like they were using producer/consumer work queues for the image processing without further synchronization.
Is it "just a scheduling problem"? For example if the system is designed to expect every frame being delivered and the algorithms can't handle "incomplete" data (a frame drop), then you have to redesign your algorithms. From the descriptions of the problem it sounded to me like a similar situation - it expected the N+1-th frame to actually be the N-th frame and treated it as such.
Say we would assume this is a complete real-time system. How would a result be late? It seems like a failure mode that doesn't exist in that domain (given that all guarantees are verified).
Now, I get that this is probably not a real-time system.
Yes, it seems the control system isn't real-time anyway, so this discussion is moot. But to answer your question:
A real-time system can still have inputs that can not be guaranteed, and internal processes can have an upper and lower bound for their execution time. So a result would be late if either:
(where $T$ is a timestamp given by some time source and $t$ is process duration). Condition 1 can be determined on arrival of the data, and the input can be discarded immediately. Condition 2 cannot be determined beforehand, but if the processing isn't finished at $T_{deadline}$, the process/thread could be killed without waiting for it to complete.
Of course, this requires that an accurate deadline can be determined for each input packet. The textbook use case is for rendering live video streams, where stream latency is more important than rendering each frame accurately. This flight control system is a similar use case, since the utility of the camera feed for determining location or drift rapidly declines as the picture ages.
But if the timestamp-keeping isn't accurate as it seems in this case, it really doesn't matter if the system was real-time or not.
There's 2 copies of memory and a RAD hardened FPGA that compares them and looks for bitflips. If it detects one then it restarts, so quickly it can do this midflight with no issues.
IT doesn't apriori preclude an RTOS, but multitasking OSs from unix heritage have so much 'free stuff' (networking, instrumentation, high-quality compilers, experience, etc) they are often the choice. For reasons I don't fully understand, and I'm in 'the business', true real-time embedded systems tend to have relatively simple TCP/UDP/IP stacks, simple https, and by simple I mean 'less robust than those on unix'. I think a fair amount of that is due to power budgets - embedded tend to be small or tiny. However, in this case, having that super beefy motor system to drive the blades taking many many watts, there would seem to be little point in saving microwatts in the MPU. So, one could do an RTOS on a smartphone chip, but it's 'fast enough' for human interactions - so the extra work in getting finite and well-defined real-time response doesn't really make much difference in smartphone apps - if the chip is 'fast enough'.
For RTOS such a chip is always fast enough. The problem is IO latency and software.
A RTOS is typically a much slower and much smaller OS. FreeRTOS e.g consists of 3 small source files, and the control loop might be from 1KHz to 10KHz (for extremely high-dynamic systems). Compare that to the snapdragon loop of 2GHz. Factor 1e6.
https in real-time is obviously a joke. There exist proper networking protocols for hard realtime, not randomized and spiky Ethernet based protocols. Telcos use such proper realtime, slicing protocols.
When conditions under which the code is verified were violated? Like camera chip was supposed to return pictures, CPU was supposed to increment PC reliably, the board was supposed to not experience more than 5000G, etc? I’m not an expert though
It seems to me that the problem was too much scheduling.
If it really were just producer/consumer workflows one frame would have been lost, the program would be comparing images two frames apart and seen it moving twice as fast as it was supposed to, this would have caused a minor excursion but then the next frame would be right and it would very quickly have settled back down to proper flight.
The problem is that it wasn't simply comparing frames, but it had an internal clock the frames were being compared against.
I struggle to think that JPL didn’t think of this already.
That in the distance/time code they didn’t take timestamps of each successful frame vs just assuming it’s a 30hz camera so surely each frame will be 30hz!
It seems to me, as someone who does similar embedded work, that if I would account for a missing frame failure, they surely would. I am a professional in embedded timing systems - I am guessing this isn’t the full story or it’s been paraphrased for normal people and something important was lost.
EDIT: Better info below. The photo was lost but it’s timestamp was not. They apparently didn’t tightly couple the image and it’s metadata together (hashes... what are they for!?) so there was some sort of mess up there. Knew there was more to the story.
I wonder if hashes are discouraged at JPL due to a historical focus on "correct, and provably correct" code. Hash collisions happen.
Yes, you can make collisions arbitrarily unlikely, and there are so many situations where they come in useful, but I can imagine an engineering culture where they are just never an easy tool to reach for.
I mean, you could take anything as an axiom so there are no real requirements for it.
I would imagine you would want to demonstrate the P(collision|domain+operating lifetime) << acceptable level or risk, and after you’d done that, take the uniqueness of hashes as a axiom for the proof system?
That’s just speculation on my part, since I haven’t done much with formal proofs of working code
No you just special case the function call and tell the proof analyzer "this never happens". Strictly speaking, a lie, but probably a sophisticated prover could even be rigged to keep track of those conditions (not in x years, e.g.) if you so chose
This is actually compulsory requirement for safe communication protocols in railway according to EN 50159. The correct way to manage this is to reject the outdated samples. Ultimately, this should result in a transition to fallback mode (in this case, emergency landing mode).
This is actually all very complicated, because if you're doing e.g. velocity estimation via smoothed differences, missing a point means you've (effectively) inserted a spike-like pseudo-datapoint into your Kalman filter (or whatever) --- it will not totally crap, but will be incorrect until the 'bad' datapoint flows completely through the filter.
... Indeed. Therefore we add sensor diversity to balance the failure modes and increase availability. Having no sample means that you increase your confidence interval at warp speed to account for all possible scenarios. Very quickly your estimate becomes : « great, I am traveling at 35kph +/- 100kph. Wait. What? »
I think it can be simpler than that. If a piece of code is going to do a velocity calculation from motion estimation between two frames it should expect to get new data every 1 / FPS seconds. A local clock could help validate that. If data comes too fast or too slow within some tolerance then the system could signal there is an issue. If the images have time stamps it could check those too.
Now you’d need to hope the inertial guidance is good enough at those altitudes. The article says it doesn’t use image based calculations for landing due to the possibility of kicking up dust.
> "...This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps..."
It seems that the timestamping is detached from frameshooting. That is, as if there were a burst happening, then the burst's frames get assigned the sequential timestamps. Thus if a burst had a glitched/skipped frame, a part of the series and the subsequent bursts become mis-attributed.
I understand that integrating the timestamping to be atomic to frameshooting is likely to slowdown the burst. So there has to be some expected internal timeline to validate the timeline that results from bursting.
I don't have anything specific in mind that covers this (I'll look around), but maybe take a look at the work by Nicholas Strausfeld for some actual data on early nervous systems [0], and Leigh van Valen for the general theoretical framework [1].
I feel this bug in my bones. I work a lot on WAMI/aerial imagery. I've dealt with this exact problem of timestamp-off-by-one-image. I have 9 cameras with different exposure times all triggered by the same hardware pulse. Fortunately my system is data-collect only so the worst outcome is some wrong filenames.
I learned only way too late in the project that most GigeVision cameras emit timestamps that can be synched using NTP and/or PTP. The camera specs/manual said nothing about this. Only found it by accident looking through the GeniCam XML API.
Moral of the story: ensure your cameras emit timestamps on device. Don't rely on system time.
Unless you need to sync times. Such as with connected low power devices which need to wake up by themselves. (Like Mars devices running for more than a few years). You cannot wake them up via a ping. Device time is then unreliable, you need to sync with network time, and then distribute time changes to your measurements. But properly.
Google/Amazon do have a similar problem with eventually consistent databases, but low power (NB-IoT) meshes are harder, as synced wakeup's are critical. Don't rely on device time.
Look at the quality of those images of the ground in that video. Those blades are whirring around at super-high rates to provide lift in the Martian atmosphere and taking 30 pictures per second, but in each exposure you can clearly see the shadow of the blades with very little blurring. And with ambient solar brightness quite a bit lower than Earth's "full sunlight", too. Those are quality cameras.
Only a factor of 2.3 lower. Sunlight is extremely bright. Full sunlight on Mars is much brighter than a cloudy day on Earth and an order of magnitude brighter than, say, an office or bedroom on Earth
The first image, from the Perseverance rover's MastCam-Z, is at least as much "true color" as any ordinary camera. (That is, the precise color might be slightly off due to differences between the camera's sensitivity and your monitor's spectrum, but it should be pretty close.)
The video from Ingenuity itself is taken by a black-and-white camera and has no color information.
When you're sending stuff to other planets the cost of what you're sending is usually a trivial part of the cost of the mission. So long as it's off the shelf and not heavier you send the best there is.
"The modern concept of the vacuum of space, confirmed every day by experiment, is a relativistic ether. But we do not call it this because it is taboo."
I am aware of the theory of quantum vacuum, but I generally take Aether to mean the old proposal as the medium light waves travel through. It has been refuted.
Don't throw your laptop out, yet. Flash it directly. Well, depends on the laptop/chip ofcourse, but do your research: https://youtu.be/Gdehz26lYWM?t=173
My favorite is we got details. Legitimate details that give us a pretty clear picture where it went wrong and how it can be corrected. No calling it "software glitch" and stopping there. Those kinds of articles frustrate me to no end.
As a very small portion of the HN crowd, believe we are quite pleased.
The most incredible thing for me about this glitch is that as far as I know it's the first serious error in the entire mission. The live coverage of the landing, and the combined video that followed, was utterly astounding and humbling. Dare mighty things, indeed!
The communications from NASA's Mars teams has been incredibly thorough and transparent. That they share all of the raw images soon after they are received from Mars gives an incredible feeling of inclusion.
"For the majority of time airborne, the downward-looking navcams takes 30 pictures a second of the Martian surface and immediately feeds them into the helicopter’s navigation system. Each time an image arrives, the navigation system’s algorithm performs a series of actions: First, it examines the timestamp that it receives together with the image in order to determine when the image was taken. Then, the algorithm makes a prediction about what the camera should have been seeing at that particular point in time, in terms of surface features that it can recognize from previous images taken moments before (typically due to color variations and protuberances like rocks and sand ripples). Finally, the algorithm looks at where those features actually appear in the image. The navigation algorithm uses the difference between the predicted and actual locations of these features to correct its estimates of position, velocity, and attitude.
Approximately 54 seconds into the flight, a glitch occurred in the pipeline of images being delivered by the navigation camera. This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps. From this point on, each time the navigation algorithm performed a correction based on a navigation image, it was operating on the basis of incorrect information about when the image was taken. The resulting inconsistencies significantly degraded the information used to fly the helicopter, leading to estimates being constantly “corrected” to account for phantom errors. Large oscillations ensued."
As for why this didn't crash the helicopter, the hardware managed to keep up with the executed corrections and oscillations:
> Flight Six ended with Ingenuity safely on the ground because a number of subsystems – the rotor system, the actuators, and the power system – responded to increased demands to keep the helicopter flying. In a very real sense, Ingenuity muscled through the situation, and while the flight uncovered a timing vulnerability that will now have to be addressed, it also confirmed the robustness of the system in multiple ways.
Here's a cool relevant podcast (from before flight 6 but gives some nice insights into what it takes to fly that Helicopter on Mars): [0]
Description:
Tim Canham, Mars Helicopter Operations Lead at NASA’s JPL joins us again to share technical details you've never heard about the Ingenuity Linux Copter on Mars. And the challenges they had to work around to achieve their five successful flights.
Let's put a pin your prediction; it's definitely an interesting possibility!
54 seconds of 29.97 frames: 1618.38 frames
54 seconds of 30.00 frames: 1620.00 frames
Toss in a bit of rounding errors and/or tolerances and it's definitely suspiciously close to where a 1 frame error appears between those two.
I'd suggest that the evidence against it would be that while this may be the longest flight on Mars, it seems certain that they'd have taken longer flights on Earth than this. But still, the math nearly checks out.
Seems to be the main question with no clear answer yet. One of the purposes of Ingenuity was to demonstrate non-radhard electronics working on Mars, and a frame-dropping glitch could have happened anywhere from software to hardware.
Whatever the issue, presumably it did not arise during testing. That seems to point toward something that would only happen on Mars... so, yeah, maybe radiation?
It's incredible that the thing works at all- but to add the resilience that they seem to have accidentally built in (switching from visual at landing) and it surviving such a wild flight envelope is just incredible.
I bought a cheap drone on a common importing website and the thing likes to get lost even though it has GPS and is in the perfect clear. It flew into my truck, uncommanded, at full speed. On Earth. With GPS. So for this thing on another planet to do what it just did? Wow.
The article links to NASA but here it is anyways. Let me skim it and I might add some additional information here but go and check out the first picture there yourself - it is amazing. It could as well be an aerial photograph of a desert area on Earth by the looks of it.
OK, so from the NASA article, it seems that they use an inertial measurement unit [0] (think an accelerometer/rotational sensor similiar to what your smartphone uses) that reads out heading and acceleration 500 times per second and sums it up to get the the helicopter's position. This is "dead reckoning" [1]) which unfortunately suffers from accumulating errors. To work around these accumulating errors, they sync their prediction to camera data thirty times per second by predicting how features visible on the surface should have moved in this time. So, sensor fusion [2], in a sense. The images are delivered with timestamps (I would have been surprised if they were not, to be honest) but one of the images went missing and for some reason, its timestamp was not. (Variable not cleared? Are they maybe using parallel queues to match images to timestamps (would be very weird, imo but I am no helicopter scientist?)). Either way, the following timestamps were now attributed to wrong images. I really wonder how that happened and hope they expand on that later. As the system predicts how far a feature in an image should have moved between frames, things were now considerably off (example: from the IMU prediction, a feature should have moved 1 unit per second but with a missing frame and the previous timestamp, it now looks as if it moved 2 units in a second) and the state gets updated with this wrong information.
Here is an interesting bit about safety margins from this article by the way:
Despite encountering this anomaly, Ingenuity was able to maintain flight and land safely [...]. One reason it was able to do so is the considerable effort that has gone into ensuring that the helicopter’s flight control system has ample “stability margin”: We designed Ingenuity to tolerate significant errors without becoming unstable, including errors in timing. This built-in margin was not fully needed in Ingenuity’s previous flights, because the vehicle’s behavior was in-family with our expectations, but this margin came to the rescue in Flight Six.
None of this explains why are images and timestamps considered in the absolute sense: if they were only considered in relative sense (to the previous one), after surviving the glitch of 1s/30 (which, at the mentioned speed of 4m/s would be 13cm or roughly 5" of distance covered — 11% of the Ingenuity rotor span of 1.2m, or a maximum of <15 degrees tilt on the height of 50cm, though the base height used by the algorithm probably ignores the legs and just takes the body-to-rotors height), it should go back to acting normally.
Even in absolute terms, once you are 110 seconds into the flight, we are talking about an offset of 1s/(30*110s), or 0.03% — if control algorithm can get confused by an error of this magnitude, it's not a very good one.
But honestly, I am sure the algorithm is much better than that, and I believe the explanation is "watered down" for the general populace, and a lot more is in play that's not being shared.
It sounds like the alignment error persisted for the remainder of the flight. Depending on the direction of the slip, the visual system would either record motion that wasn’t yet sensed by the IMU or it would record no motion when there was some recorded by the IMU. In both cases the control loops would likely try to correct for the discrepancy. The large control inputs sound a bit like ’integral windup’ in a PID loop where persistent error between desired and observed states are essentially summed over time resulting in larger control input (in the case of a drone this would be helpful for overcoming wind).
Unless they box that in with limits you could easily get to a point where it starts flying like there’s a chimp on the yoke and either crash or saturate the navigation system and have it fall back/fail entirely.
The control systems behavior is less interesting to me (yes it could be windup or many other reasons for oscillation) than why the (u?)kf behaved this way. The single frame could have caused a time slip for the rest of the flight but usually there’s something that syncs timestamps before the filter acts on it. I hope we can get a technical postmortem.
An small error in position would lead to a larger error in speed estimate. To correct speed, you have to pitch the helicopter. But the speed keeps being wrong, leading to oscillation.
Because of using multiple, independent sensors (camera in the air, IMU near the ground) a disaster was averted. This is an important lesson: do not rely totally on one sensing modality. ( I wish Elon would learn this lesson: https://jalopnik.com/teslas-removing-radar-for-semi-automate... )
I wonder, why not have a real time clock capture time stamps for different inputs for same frame, like one when written to memory, one when the trigger for the frame is done, etc and have a comparative analysis. Discard the process when there is a discrepancy, and re-start the captures from beginning, and go into a default hover state when such a fault is found.
Maybe is was just "impossible". Hard real time systems are not supposed to lose frames, sometimes provably so.
You have to make some assumptions. For example, that pure functions will always return the same result with the same arguments. In fact, in real life, it is not always true, especially in space where a cosmic ray can flip a bit. You may try do be defensive to account for an unreliable hardware, but you have to draw the line at some point.
Yes, is should have been tested, but it doesn't surprise me that it wasn't even for a company with as good a track record as JPL.
Corner cases are hard. Here's a little C program; try to figure out what it prints before you compile & run it. It computes a step size between two numbers, given a number of steps:
#include <stdio.h>
int f(unsigned n_steps, int from_val, int to_val){
int step_size = (from_val - to_val) / n_steps;
return step_size;
}
Yeah, they are, but your example is not good enough, it is like an example from a textbook on traps of C.
Corner cases hard when you are trying to do something new, because if you did it before, you'd know the most of them. Or if someone else did it before and wrote a textbook. =)
I wonder what the shutter speed for the nav camera used in that GIF. They said the frames themselves are captured at 30Hz, but for the shadow of the rotors to be frozen without blurring, the shutter must be much shorter. Does anyone here know?
It looks like the real issue is "we wrote all the camera code ourself rather than using battle tested libraries".
The android camera API for example will tag every frame with the timestamps and exposure settings it was captured with. Unit and integration tests can simulate dropped frames and verify the algorithm still runs as it should.
I would guess they tried to go for direct camera hardware access and ended up writing their own logic to schedule frames, and it contained a bug.
I mean they weren't "forced" to use anything. They chose the chip. If they had different requirements they could have chosen a different chip.
Most likely they just used the specific kernel version with closed binary blobs from Qualcomm for a typical embedded Linux, not the full Android system.
>And where, pray tell, would you find battle tested libraries for flying a drone on Mars?
this isn't magic. Image processing isn't a wholly unknown field when under the influence of Martian gravity.
It's a staggering achievement, and a testament to humankind's ability to wrangle technology, but -- to be frank here -- visual odometry is technology from 1960s era spy planes, not beyond-human voodoo.
Sometimes errors are stupid ones, even if the stupidity is causing problems a zillion miles away on distant unknown frontiers being first-explored by human kind.
tl;dr : This visual odometry problem isn't a space problem, it's a computer science problem -- one that has been encountered by numerous people even here on earth.
The repercussions of such a simple non-space-problem might very well turn into space problems when the craft crashes, however -- and that's a damn shame given that this particular set of problems has been encountered-and-fixed by numerous earthlings over and over again.
super tl;dr : as an only half-informed-participant in this discussion I bet this problem is NIH-syndrome-related, a problem NASA has encountered a lot in the past, a problem that's entirely managerial and very hard to eradicate.
It does sound like something you’d expect unit and integration tests to catch, but I am not sure that necessitates the use of off the shelf libraries?
In the context of a space helicopter, I’d guess that deep understanding and ownership of each software component is super important when you’re likely to have to support a debug it from millions of miles away. At this scale, and in this context, doesn’t “rolling your own” critical components, in some cases, make sense?
> and while the flight uncovered a timing vulnerability that will now have to be addressed, it also confirmed the robustness of the system in multiple ways.
This is so important. Imagine if it would have crashed and damaged communications, that would have been the end of it. This extra step in engineering effort really payed off and gave Ingenuity a second life, and NASA a really good opportunity to learn important lessons and do further debugging.
Also, the article at mars.nasa.gov is a delight to read, with all the information one would like to know being in there. Not to mention that there is even a video for us to see.