> "A Single Line of Code Brought Down a Half-Billion Euro Rocket Launch"
Blaming a system failure on a single point like this dooms that system to repeat similar failures (albeit in another element) in the future.
There are numerous testing, quality and risk controls that could've been in place. There are probably even a few people who didn't do their job (besides the one person a decade ago who wrote the 'single line'). The point isn't to pin blame on any one point, but to look at the system (people, processes, technology) and try to understand why the system is fragile enough that a single person's error is able to escalate into a half-billion euro error.
By focusing in on the point of failure, you end up falling victim to survivorship bias [0]. It is how you end up with developer teams swamped with unit-testing requirements and test coverage metrics, but still somehow end up with errors that impact the end-user anyway. It is how you get company surveys that always seem to miss the point, saying that the measures they implemented to improve company culture worked, yet everyone is burning out and miserable.
They're not focusing on one line of code, they cover the failover systems that also failed as well. It's also a mistake to try and fix bad tools, languages and programming practices with higher level processes. Just use better tools that do bounds checking (and unit checking which has also caused failures), preferably checked at compile-time and the problem is fixed without all the rigmarole you describe.
A bounds check wouldn't have helped. The value would have saturated instead of rolling over, resulting in a similar failure.
The mistake was an incorrect specification. A programming tool can't identify that you've made the wrong thing, which is why we need "rigamarole" to validate the spec. That's what the systems engineers are for.
Saturation would have been fine; actually anything would have been fine, since the result was not actually used in flight. However, Ada traps, and the trap was not handled (because resources were tight and overflow was physically impossible on the Ariane 4, for which the code had been written), and the specification required that the system shut down entirely on an unhandled trap.
According to the article the code was designed to run on the previous iteration of the rocket, the Ariane 4, who's first flight was on 15 June 1988. Conceivable that code was written in the mid-1980's, better 'tooling' might not have been an option.
sounds like you're saying that being unable to write in-bounds code is the single-point-of-logic failure for a coder, and if you correct that part of the bad tool, all their other algos will be great...
I think people who can write in-bounds code and type correct code with no safety rails have a leg up to write really good code.
Also, an issue like this going unnoticed points to a lack of proper QA. There were probably a fair few more than a "single line of bad code" that could have fucked the launch of this one didn't do it first.
There is actually a lot of research on how catastrophic failures in highly complex systems happen. Here is a brilliant article that summarizes the main findings:
I cannot read that one without thinking in the descriptions and analyses of disasters like the sinking of the MS Titanic, the Chernobyl disaster, the loss of the Challenger space shuttle, or the Fukushima disaster. Many, many points in the article seem correct for all of them.
> The system is designed to have a backup, standby system, which unfortunately, runs the exact same code.
At Boeing, the backup system runs on a different CPU architecture, with a different program design, a different programming language, and a different team that isn't allowed to talk with the team on the other path.
Actually, the electric thumb trim switches overrode MCAS. This is why the Lion Air crew managed to restore normal trim 25 times. The Egypt Air did twice.
You can make up your own terminology as you please, but the normal trim position was restored because the thumb switches overrode MCAS commands.
The mistake the pilots made was not then turning off the trim system (which also overrides MCAS).
The first MCAS incident ended with the pilots restoring trim to normal with the thumb switches a couple times, then turning off the stab trim system, then continuing on and safely arriving at their destination. The media never mentions this.
No this isn't about semantics. You're trying to tell me a system that brought down 2 airliners isn't that bad because it didn't bring down others. What?
Inventing your own meanings for words in order to advance an argument is a waste of time for you and me.
> You're trying to tell me a system that brought down 2 airliners isn't that bad because it didn't bring down others.
I'm saying that pilot error was a contributing cause to the two crashes, and the media narrative that the pilots could not have saved the situation is incorrect. Boeing even sent out an Emergency Airworthiness Directive to all MAX pilots after the first crash, detailing the two step recovery procedure (restore normal trim with the thumb switches, then turn it off). The EA pilots did not do that.
There were numerous causes of the accident that all had to combine to result in the crash. One of the causes was the single path design of the MCAS system.
Either Boeing is serious about systems safety or it isn't They let profit motive trump engineering concerns and hundreds paid with their lives. Finding nuance to defend the corporate player that killed hundreds of innocents is disgusting.
The errors in the MCAS system design did not save Boeing any money.
> Finding nuance to defend the corporate player that killed hundreds of innocents is disgusting.
I did not defend Boeing. Besides, do you want airliners to be safe or not? If you want to fly safely, you've got to address all factors in a crash, including pilot error.
They obviously hid a new system from view to avoid retraining, when the aerodynamics and automated behavior had changed substantially. It was motivated by profit and cost so many lives. It's pretty horrible and I cannot comprehend why you'd defend it in any way. Yeah plane crashes are complex and should be studied in detail. No that doesn't mean in this case corporate greed didn't play a decisive role.
>> The cause? A simple, and very much avoidable coding bug, from a piece of dead code, left over from the previous Ariane 4 mission, which started nearly a decade before.
>> The worst part? The code wasn’t necessary after takeoff, it was only part of the launch pad alignment process. But sometimes a trivial glitch might delay a launch by a few seconds and, in trying to save having to reset the whole system, the original software engineers decided that the sequence of code should run for an extra… 40 seconds after the scheduled liftoff.
The author appears to be using a different definition of "dead code" than I'm used to. To me, dead code is code that is no longer called by anything else, and has no chance of running. Maybe a more accurate term is "legacy code"?
> With 16-bit unsigned integers, you can store anything from 0 to 65,535. If you use the first bit to store a sign (positive/negative) and your 16-bit signed integer now covers everything from -32,768 to +32,767 (only 15 bits left for the actual number). Anything bigger than these values and you’ve run out of bits.
That's, oh man, that's not how they're stored or how you should think of it. Don't think of it that way because if you think "oh 1 bit for sign" that implies the number representation has both a +0 and a -0 (which is the case for ieee 754 floats) that are bitwise different in at least the sign bit, which isn't the case for signed ints. Plus, if you have that double zero that comes from dedicating a bit to sign, then you can't represent 2^15 or -2^15, because you are instead representing -0 and +0. Except, you can represent -2^15, or -32,768, by their own prose. So there's either more than just 15 bits for negative numbers or there's not actually a "sign bit."
Like, ok, sure, you don't want to explain the intricacies of 2's complement for this, but don't say there's a sign bit. Explain signed ints as a shifting the range of possible values to include negative and positive values. Something like
> With 16-bit unsigned integers, you can store anything from 0 to 65,535. If you shift that range down so that 0 is in the middle of the range of values instead of the minimum and your 16-bit signed integer now covers everything from -32,768 to +32,767. Anything outside the range of these values and you’ve run out of bits.
> ...If you shift that range down so that 0 is in the middle of the range of values instead of the minimum...
Not a downvoter, but: your concept of "shifting the range" is also misleading.
In the source domain of 16-bit numbers, [0...65535] can be split into two sets:
[0...32767]
[32768...65535]
The first set of numbers maps to [0...32767] in 2's complement.
But the second interval maps to [-32768...-1].
So it's not just a "shift" of [0...65535] onto another range. There's a discontinuous jump going from 32767 to 32768 (or -1 to 0 if converting the other direction).
And actually, we don't know if the processor used 2's complement or 1's complement -- if it was 1's complement, they would have a signed 0!
I think they'd have to say "remapping" the range? On the whole, I think OP did about as well as you're going to do, given the audience.
> And actually, we don't know if the processor used 2's complement or 1's complement -- if it was 1's complement, they would have a signed 0!
We can infer it used two's complement, and absolutely rule out one's complement or any signed-zero system, because the range is [-2^15,2^15) and with a signed zero you can't represent every integer in that range in 16-bits, you have one too many unique numbers.
The range is of values, not their representation in bits, which can be mapped in any order. You could specify that the bit representation for 0 was 0x1234 and the bit representation for 1 was 0x1134 and proceeded accordingly and the range of values for those 16 bits could still independently be [-32768, 32767] or [0, 65535] or [65536, 131071] if you wanted.
We know the signed int they're talking about can't be the standard 1's complement because its stated range of values is [-32768, 32767]. If the representation were 1's complement the range would be [-32767, 32767] to accommodate the -0. It could be some modified form of 1's complement where they redefine the -0 to actually be -32768, but that's not 1's complement anymore.
Everything written in those three sentences you've highlighted from the article is correct. You may not like how they've chosen their three sentences, but these three sentences contain no lies.
Every negative number has 1 as it's first bit, every positive number (including 0) has 0 as its first bit. Therefore first bit encodes sign. The other 15 bits encode value. They may not use the normal binary encoding for negative integers as you'd expect from how we encode unsigned integers, but you cannot explain every detail every time.
It's not a sign bit or a range shift.
Signed integers are 2-adic numbers. In an n-bit signed integer, the "left" bit b represents the infinite repeated tail sum_{k >= n-1} {b 2^k} == - 2^{n-1} of the 2-adic integer value.
I have found that among software engineers, it is surprisingly not common knowledge that floating point operations have all these sharp edges and gotchas.
The most common situation in which it crops up is when dealing with quantities that require fractional units/arithmetic of some commonly discrete unit of measure. For example, you implement some complex logic to do request sampling, and in your binary you convert the total number of active requests to a float, add some stuff, divide some stuff, add some more stuff, multiply it again, then convert back to an int something like “number of requests that should be sampled.” Because floating point operations are non-associative, non-distributive, and commonly introduce remainder artifacts, you can end up with results like sampling 1 more request than there are total requests active, even when the arithmetic itself seems like that should be impossible.
This is also common when dealing with time, although typically the outcome is not that bad. Despite time having a simple workaround of just changing the unit of measure (eg using milliseconds instead of seconds) and using int operations on that, because people don’t know why they shouldn’t use floating point operations in this case, they don’t always reach for it.
The worst is when some complicated operation is done to report a float (or int converted from a float) as a metric. In the request sampling example, that would likely be noticed quickly and fixed. But when the float value looks reasonable enough and doesn’t violate some kind of system invariant, it can feed you bad data for a very long time before someone catches it.
Would you happen to have any resources on how to treat floating point values?
I noticed some odd behaviour recently when using ruby to save to Postgres where the handoff between the two systems introduced imprecision in the saved value. Didn’t get to dig into it because it wasn’t a priority but it’s definitely an annoying unanswered question.
In addition to what others mentioned, to start from the basics I would try to learn how floating point values are implemented and how processors evaluate floating point expressions. For example the float is separated into two parts, it’s not able to efficiently represent many decimal numbers, etc.
A simple rule of thumb is to try to avoid using floating point values at all outside of contexts like scientific simulations. For basic situations, you can almost always use either a library (for big numbers, decimals, fractions, etc.) or express your logic with ints by using established patterns (make the unit of measurement smaller/bigger, explicitly round up or down, etc.). Any time you take something that is an int 99% of the time, convert it to a float for something, then convert it back to an int, you are doing something wrong.
If you want a thorough understanding: you'll want to look up "numerical computation", "numerical methods" or "computational methods"; in particular computing error bounds and error propagation. Typically covered in bachelor's level university-level Math, CS, or Engineering departments.
If you just want to fix the odd behavior: adjust the schema so that you only work with whole numbers. For example, in the database schema, you can use DECIMAL instead of REAL/DOUBLE columns, or use two columns to specify the ratio of two integers (for example num/denom in frame rates in various video containers/codecs). In the application code: work only in whole numbers (e.g. cents, satoshis) instead of fractionals, using bigint or string types as applicable instead of e.g. double.
Don't Fixed Point numbers add more issues?
Let's assume you have a machine that accepts 8-bits fixed-point numbers, with 6 bits for integers and 2 for decimals. For simplicity, let's use decimal digits.
If you must represent the number 2.013, the resulting number would be
+00002.01
So you cut the third decimal digit and wasted 4 bits with useless zeros. At the same time, if you represented the same number in 8 bits floating point, you would have all the digits.
I’m very happy that the flight software codebase I’m currently working on doesn’t use any floating point. We don’t even have FPUs enabled. Then again, it’s not GN&C do the stakes are not as high.
I hate these “single line of code did X” type headlines.
It will always be a single line of code. The nature of most programs is to execute commands in a sequence. Eventually you hit one that fails.
Hell, you could reduce it to even be less than a line of code. It could be a single variable. A single instruction. It could be a couple bits. A couple bad 1’s and 0’s in memory blew up a multibillion dollar rocket launch.
> "To achieve this, the guidance system converts the velocity readings, from 64 bit floating point to 16 bit signed integer".
Oh, excellent possible interview question? "Write some code that reliably converts the full range of possible 64 bit floating point values to a 16 bit signed integer. What are the issues you'll have to deal with and what edge cases might arise?"
I’ve interviewed 8 web devs over the last two weeks, each with years of experience, each asking $115k+, and more than one couldn’t figure out how to take an array of objects each with a ‘category’ property and output an array with the unique values of ‘category’. (this is one line of code)
Only one could successfully wire up a <select> populated with the list of unique categories and then filter the original list based on the selected category.
By the way, these interviews were done over Zoom with screen sharing with the candidate able to use their own dev environment and browser of their choice.
The interviews were allotted one hour and all candidates took longer than the scheduled time.
It’s been beyond depressing. I’m ready to just start asking for FizzBuzz again.
".... i can't. No one can. It's a mathematical impossibility as a general solution for at least 2 separate reasons.
The first issue is that we're taking 64 bits of data and trying to squeeze them into 16 bits. Now, sure it's not that bad, because we have the sign bit and NANs and infinities, but even if you toss away the exponent entirely, that's still 53 bits of mantissa to squeeze into 16 bits of int.
The second issue is all the values not directly expressible as an integer, either because they're infinity, NAN, too big, too small, or fractional.
The only way we can overcome these issues is to decide what exactly we mean by "converts", because while we might not _like_ it, casting to an int64 and then masking off our 16 most favorite bits of the 64 available is a stable conversion. That might be silly, but it brings up a valid question. What is our conversion algorithm?
Maybe by "convert" we meant map from the smallest float to the smallest int and then the next smallest float to the next smallest int, and then either wrapping around or just pegging the rest to the int16.max.
Or maybe we meant from the biggest float to the biggest int and so on doing the inverse of the previous. Those are two very different results.
And we haven't even considered whether to throw on NAN or infinity or what to do with -0 for both those cases.
Or maybe we meant translate from the float value to the nearest representable integer? We'd have a lot of stuff mapping to int16.max and int16.min, and we'd still have to decide how to handle infinity, NAN, and -0, but still possible.
Basically, until we know the rough conversion function, we can't even know if NAN, infinity and -0 are special cases and we can't even know if clipping will be an edge case or not. There's lots of conversions where we can happily wrap around on ourself and there are no edge cases, and there's lots of conversions where we have edge cases, but we can clip or wrap, and there's lots of conversions where we have edge cases and clipping/wrapping."
Step 1: Enumerate all possible inputs. <--- this is the important part
Step 2: Map each input to something in the output domain.
Step 3: Are you using Ada??? C++ is *right out*.
...it looks like the proper name for what is needed is a "non-injective surjective total function".
Or it’s a bad interview question as posed because in practice you could look up the spec and a better question would be to probe how the candidate would approach the problem and what clarify questions they would ask.
It's a bad question because it's completely impossible, so the interviewee has to guess if the interviewer knew that and looking for pushback, or didn't know that and you have to figure out what they actually meant and what they incorrectly think the answer is supposed to be.
And most likely any serious Ada codebase worked by people worrying about such a question, has this in a generics or done (correctly) 7000 times... 'I'd just instantiate the Quantizer or Quantization_Manager generics, y'all have one, right?'
Why would the program react like that to a SINGLE wrong signal that disagrees with everything else and produce a signal that cannot do anything good in any circumstances? This just smells like a truly naive piece of implementation.
There should be layers upon layers of safeties to prevent this dumb thing from happening. The computer should know the position, orientation and velocity of the rocket at any point in time and new signals should be interpreted in the context of what the computer already knows and in context of what other sensors are telling. It is not like the rocket can turn itself around in 1ms and if it does there probably isn't much it can do anyway.
This suggests to me the problem is not the bug, it is the overall quality of development.
The article mentions there was a duplicate redundant signal, and they both agreed with each other. The problem indeed wasn’t the bug, you’re right. It was the assumption somewhere along the way that this older system couldn’t cause damage, the failure to review the fixed point range.
It’s easy to judge with hindsight though. Every rocket development project that has ever happened on the planet has had unintended explosions and accidents, and they are always staffed with brilliant people. Layers of safety at some level might only make the problem harder, more code just adds more complexity and failure points. Since there is an uncountable number of ways for something seemingly innocuous to break a rocket, and until we’ve all tried making rockets, it’s probably best to take away the lesson that fundamentally making rockets is highly prone to catastrophic failure.
Of course there was a redundant signal (because the contract probably required it).
What I meant was other signals -- other information about state of the rocket. There is lots of sensors in these things if only so that you can figure out when stuff goes wrong.
> It’s easy to judge with hindsight though. Every rocket development project that has ever happened on the planet has had unintended explosions and accidents,
No, that's bad excuse. Accidents do happen but can only be excused when a reasonable effort to prevent it has been taken.
For example, Challenger disaster resulted in such a huge shakeup in Nasa exactly because reasonable precautions were NOT taken vs. other accidents which were truly unforeseen and were due to lack of knowledge/experience.
The question is about taking reasonable precautions. It is reasonable effort to design a system that will drive multiple half a billion dollar rockets with some level of care to ignore absolutely idiotic signals.
I do it on my home projects and at work with non safety critical applications. Why can't they do it for such a critical project?
> Why can’t they do it for such a critical project?
They can, they did, and they still exploded a rocket! Again, hindsight is 20/20. This question isn’t very reasonable to ask this way (with incredulity) until you’ve successfully built several rockets yourself.
The Commission: So why did it blow up? Could anything have been done?
The Nasa: These things just happen! Hindsight is 20/20. Also, you don't have enough experience to point out our engineering problems until you exploded couple of them yourself!
I’m not excusing it. (Nor did they). I’m only suggesting that your armchair incredulity is waaay out of place, exposing your own assumptions more than the rocket team’s. I’m only adding this because you pushed back a second time, and I mean this with only love for a fellow programmer, not malice, but your comment implying that your home project code and work code is as well tested and executed as Ariane 5’s is more than a little amusing. Put your code up here for review and let’s all see if it’s crashproof and worthy of a high reliability embedded system safety-critical environment... ;) Like really, in case you’re young, saying something like that in an interview reveals so much hubris it might cost you the job, it is exceptionally presumptuous.
This is important because assuming that something dumb happened is part of the problem too, and that’s what your comments above attempt to communicate. Pretending like it was easy to avoid is to be intentionally ignorant of the fact that nobody ever has avoided this problem, not in rocket launches, not in web development, not in cars or video games, or in any code of any significant size. Safety critical engineering has to absorb this fact deeply, and nobody can walk into it thinking, well @twawaaay did it with their home project, so all we need is duh multiple layers of testing and redundancy. Yeah, they absolutely had multiple layers of testing and redundancy, they had everything you’ve ever thought was a good idea for writing safe code, and then 10x more than that, it might be worth reading more about the history before jumping to such conclusions.
Pretty much. The trajectory is preplanned anyway. I would expect nothing less than every launch to be ran through simulated environment multiple times if only to catch wrong launch information.
Also this FP to integer does not smell any better. Anybody who's been interested in programming for any length of time will learn this is just a bad idea asking for trouble.
If I was tech lead for the project I would definitely make sure there is static analysis that prevents these kinds of conversions from happening.
Even very earth tied machines suffer similar issues. I worked on what was known as a "hot leveller" computer, a PDP-11/73 at the steelworks I was employed at. It had something like 9 rolls (maybe 200mm in diameter) that would be applied to a very hot steel plate (maybe 10mm to 150mm thick) after it had been rolled from a maybe 300mm thick slab.
The levellers job was to smooth out any waves that might have acquired during the rolling process - almost like a clothes iron. The gap between the rolls needed to be adjusted by hydraulically positioning backup rolls that are even able to bend those work rolls across their width (maybe 3000mm). As you are always intending apply a huge amount of force anyways, to achieve the desired results, it was a mix of metallurgical driven algorithms and hard limits to doing the "setup".
While there was always an operator that had to accept the setup before the run, there was always the risk of hitting the machines surfaces too hard, and straining components, and maybe causing a prolonged as expensive outage. Obviously the biggest risks were when there were changes or even experiments by both engineers and metallurgists. It was fun times as a quite junior engineer, and think there were a few times when over zealous setups resulted in some big noises. But I don't think I broke anything fortunately.
It'd probably be more accurate to say that a technology environment which allowed any single line of code to cause catastrophic failure is what brought down the launch. Or a failure of sufficiently accurate testing brought down the launch.
Y2K bug affected a lot of machines, yet the fear of the impact beforehand had a larger effect on society-at-large than any of the software consequences during the event.
>> However, the reading is larger than the biggest possible 16 bit integer, a conversion is tried and fails. Usually, a well-designed system would have a procedure built-in to handle an overflow error and send a sensible message to the main computer. This, however, wasn’t one of those cases.
This is so unbelievably untrue. I've never seen code anywhere that waits to fail before doing the right thing.
This is exactly why I think exceptions are mostly useless, someone has to anticipate the problem, so why not write something that works right the first time. There are cases where exceptions can happen, but I don't think floating point arithmetic should be considered one of those cases.
These stories of gnat-brings-down-empire are, to me, always missing the point. There are going to be bugs. The hard part is in creating management around the actual software engineering such that this is not a problem.
“Exceptions” are just a name we give to a particular type of flow control and syntactic sugar.
In this case the syntactic sugar helps deal with passing data across function boundaries and up the stack in a way that can make the code more readable.
Every function could handle returning its own error states, and then higher level functions could aggregate and check and return the error states of every function it calls… or you might find it saves a massive amount of boilerplate to just use try/catch!
Blaming a system failure on a single point like this dooms that system to repeat similar failures (albeit in another element) in the future.
There are numerous testing, quality and risk controls that could've been in place. There are probably even a few people who didn't do their job (besides the one person a decade ago who wrote the 'single line'). The point isn't to pin blame on any one point, but to look at the system (people, processes, technology) and try to understand why the system is fragile enough that a single person's error is able to escalate into a half-billion euro error.
By focusing in on the point of failure, you end up falling victim to survivorship bias [0]. It is how you end up with developer teams swamped with unit-testing requirements and test coverage metrics, but still somehow end up with errors that impact the end-user anyway. It is how you get company surveys that always seem to miss the point, saying that the measures they implemented to improve company culture worked, yet everyone is burning out and miserable.
[0] - https://en.wikipedia.org/wiki/Survivorship_bias