A single line of code brought down a half-billion euro rocket launch

probablypower · on Feb 3, 2023

> "A Single Line of Code Brought Down a Half-Billion Euro Rocket Launch"

Blaming a system failure on a single point like this dooms that system to repeat similar failures (albeit in another element) in the future.

There are numerous testing, quality and risk controls that could've been in place. There are probably even a few people who didn't do their job (besides the one person a decade ago who wrote the 'single line'). The point isn't to pin blame on any one point, but to look at the system (people, processes, technology) and try to understand why the system is fragile enough that a single person's error is able to escalate into a half-billion euro error.

By focusing in on the point of failure, you end up falling victim to survivorship bias [0]. It is how you end up with developer teams swamped with unit-testing requirements and test coverage metrics, but still somehow end up with errors that impact the end-user anyway. It is how you get company surveys that always seem to miss the point, saying that the measures they implemented to improve company culture worked, yet everyone is burning out and miserable.

[0] - https://en.wikipedia.org/wiki/Survivorship_bias

naasking · on Feb 4, 2023

They're not focusing on one line of code, they cover the failover systems that also failed as well. It's also a mistake to try and fix bad tools, languages and programming practices with higher level processes. Just use better tools that do bounds checking (and unit checking which has also caused failures), preferably checked at compile-time and the problem is fixed without all the rigmarole you describe.

marmetio · on Feb 4, 2023

A bounds check wouldn't have helped. The value would have saturated instead of rolling over, resulting in a similar failure.

The mistake was an incorrect specification. A programming tool can't identify that you've made the wrong thing, which is why we need "rigamarole" to validate the spec. That's what the systems engineers are for.

kps · on Feb 4, 2023

Saturation would have been fine; actually anything would have been fine, since the result was not actually used in flight. However, Ada traps, and the trap was not handled (because resources were tight and overflow was physically impossible on the Ariane 4, for which the code had been written), and the specification required that the system shut down entirely on an unhandled trap.

marmetio · on Feb 4, 2023

True. I should have said "if it was needed".

Had they validated more, they would have realized they made the wrong thing.

wyldfire · on Feb 4, 2023

Saturation might have worked, actually.

> the main computer interprets the data as real navigation data and considers it as an indication that the rocket is wildly off-course

Not clear what sort of magnitudes we are talking about but saturation could have worked here and avoided the problem.

But an exception handler could've helped too.

> The code wasn’t necessary after takeoff, it was only part of the launch pad alignment process.

A supervisor for this task could have decided to ignore an overflow fault after launch.

atonse · on Feb 4, 2023

Also remember this happened in the late 90s which means the code was written in the 80s.

6LLvveMx2koXfwn · on Feb 4, 2023

According to the article the code was designed to run on the previous iteration of the rocket, the Ariane 4, who's first flight was on 15 June 1988. Conceivable that code was written in the mid-1980's, better 'tooling' might not have been an option.

naasking · on Feb 4, 2023

Not at the time, but those features are available now, and that still isn't the lesson commonly taken from these disasters, unfortunately.

fsckboy · on Feb 4, 2023

sounds like you're saying that being unable to write in-bounds code is the single-point-of-logic failure for a coder, and if you correct that part of the bad tool, all their other algos will be great...

I think people who can write in-bounds code and type correct code with no safety rails have a leg up to write really good code.

chinchilla2020 · on Feb 4, 2023

> single line of code

The funny thing is that you could pick and choose to attribute any error to a 'single line of code'

dodslaser · on Feb 4, 2023

Also, an issue like this going unnoticed points to a lack of proper QA. There were probably a fair few more than a "single line of bad code" that could have fucked the launch of this one didn't do it first.

jnxx · on Feb 4, 2023

There is actually a lot of research on how catastrophic failures in highly complex systems happen. Here is a brilliant article that summarizes the main findings:

https://how.complexsystems.fail/

I cannot read that one without thinking in the descriptions and analyses of disasters like the sinking of the MS Titanic, the Chernobyl disaster, the loss of the Challenger space shuttle, or the Fukushima disaster. Many, many points in the article seem correct for all of them.

raldi · on Feb 3, 2023

This is like saying a single little match blew up a building, neglecting to mention the garage full of oily rags and gasoline cans.

The one line of code was the spark, yes, but the catastrophic consequences were due to a series of poorly-designed failsafes and insufficient testing.

dang · on Feb 3, 2023

An overflow error costing 500M dollars (1996) - https://news.ycombinator.com/item?id=18939625 - Jan 2019 (20 comments)

The Explosion of the Ariane 5 - https://news.ycombinator.com/item?id=5331474 - March 2013 (58 comments)

There must be others. Anyone?

WalterBright · on Feb 4, 2023

> The system is designed to have a backup, standby system, which unfortunately, runs the exact same code.

At Boeing, the backup system runs on a different CPU architecture, with a different program design, a different programming language, and a different team that isn't allowed to talk with the team on the other path.

namaria · on Feb 4, 2023

Then they go and make a single failure prone sensor capable of overriding pilots' control of flight surfaces.

WalterBright · on Feb 4, 2023

Actually, the electric thumb trim switches overrode MCAS. This is why the Lion Air crew managed to restore normal trim 25 times. The Egypt Air did twice.

namaria · on Feb 4, 2023

> restore ... 25 times

Yeah that's not restoring.

WalterBright · on Feb 4, 2023

You can make up your own terminology as you please, but the normal trim position was restored because the thumb switches overrode MCAS commands.

The mistake the pilots made was not then turning off the trim system (which also overrides MCAS).

The first MCAS incident ended with the pilots restoring trim to normal with the thumb switches a couple times, then turning off the stab trim system, then continuing on and safely arriving at their destination. The media never mentions this.

namaria · on Feb 4, 2023

No this isn't about semantics. You're trying to tell me a system that brought down 2 airliners isn't that bad because it didn't bring down others. What?

WalterBright · on Feb 4, 2023

> No this isn't about semantics.

Inventing your own meanings for words in order to advance an argument is a waste of time for you and me.

> You're trying to tell me a system that brought down 2 airliners isn't that bad because it didn't bring down others.

I'm saying that pilot error was a contributing cause to the two crashes, and the media narrative that the pilots could not have saved the situation is incorrect. Boeing even sent out an Emergency Airworthiness Directive to all MAX pilots after the first crash, detailing the two step recovery procedure (restore normal trim with the thumb switches, then turn it off). The EA pilots did not do that.

There were numerous causes of the accident that all had to combine to result in the crash. One of the causes was the single path design of the MCAS system.

All of the causes needed to be addressed.

namaria · on Feb 4, 2023

Either Boeing is serious about systems safety or it isn't They let profit motive trump engineering concerns and hundreds paid with their lives. Finding nuance to defend the corporate player that killed hundreds of innocents is disgusting.

WalterBright · on Feb 4, 2023

The errors in the MCAS system design did not save Boeing any money.

> Finding nuance to defend the corporate player that killed hundreds of innocents is disgusting.

I did not defend Boeing. Besides, do you want airliners to be safe or not? If you want to fly safely, you've got to address all factors in a crash, including pilot error.

namaria · on Feb 5, 2023

They obviously hid a new system from view to avoid retraining, when the aerodynamics and automated behavior had changed substantially. It was motivated by profit and cost so many lives. It's pretty horrible and I cannot comprehend why you'd defend it in any way. Yeah plane crashes are complex and should be studied in detail. No that doesn't mean in this case corporate greed didn't play a decisive role.

Cute false dichotomy tho.

peterhunt · on Feb 4, 2023

This is super interesting. Is there anywhere I can read more about this?

WalterBright · on Feb 4, 2023

https://www.digitalmars.com/articles/b39.html

https://www.digitalmars.com/articles/b40.html

sumthinprofound · on Feb 4, 2023

https://militaryembedded.com/avionics/computers/optimizing-r...

mrtksn · on Feb 4, 2023

Very interesting, considering what happened with MCAS.

enraged_camel · on Feb 4, 2023

>> The cause? A simple, and very much avoidable coding bug, from a piece of dead code, left over from the previous Ariane 4 mission, which started nearly a decade before.

>> The worst part? The code wasn’t necessary after takeoff, it was only part of the launch pad alignment process. But sometimes a trivial glitch might delay a launch by a few seconds and, in trying to save having to reset the whole system, the original software engineers decided that the sequence of code should run for an extra… 40 seconds after the scheduled liftoff.

The author appears to be using a different definition of "dead code" than I'm used to. To me, dead code is code that is no longer called by anything else, and has no chance of running. Maybe a more accurate term is "legacy code"?

joshAg · on Feb 4, 2023

> With 16-bit unsigned integers, you can store anything from 0 to 65,535. If you use the first bit to store a sign (positive/negative) and your 16-bit signed integer now covers everything from -32,768 to +32,767 (only 15 bits left for the actual number). Anything bigger than these values and you’ve run out of bits.

That's, oh man, that's not how they're stored or how you should think of it. Don't think of it that way because if you think "oh 1 bit for sign" that implies the number representation has both a +0 and a -0 (which is the case for ieee 754 floats) that are bitwise different in at least the sign bit, which isn't the case for signed ints. Plus, if you have that double zero that comes from dedicating a bit to sign, then you can't represent 2^15 or -2^15, because you are instead representing -0 and +0. Except, you can represent -2^15, or -32,768, by their own prose. So there's either more than just 15 bits for negative numbers or there's not actually a "sign bit."

Like, ok, sure, you don't want to explain the intricacies of 2's complement for this, but don't say there's a sign bit. Explain signed ints as a shifting the range of possible values to include negative and positive values. Something like

> With 16-bit unsigned integers, you can store anything from 0 to 65,535. If you shift that range down so that 0 is in the middle of the range of values instead of the minimum and your 16-bit signed integer now covers everything from -32,768 to +32,767. Anything outside the range of these values and you’ve run out of bits.

mturmon · on Feb 4, 2023

> ...If you shift that range down so that 0 is in the middle of the range of values instead of the minimum...

Not a downvoter, but: your concept of "shifting the range" is also misleading.

In the source domain of 16-bit numbers, [0...65535] can be split into two sets:

    [0...32767]
    [32768...65535]

The first set of numbers maps to [0...32767] in 2's complement.

But the second interval maps to [-32768...-1].

So it's not just a "shift" of [0...65535] onto another range. There's a discontinuous jump going from 32767 to 32768 (or -1 to 0 if converting the other direction).

And actually, we don't know if the processor used 2's complement or 1's complement -- if it was 1's complement, they would have a signed 0!

I think they'd have to say "remapping" the range? On the whole, I think OP did about as well as you're going to do, given the audience.

dragonwriter · on Feb 4, 2023

> And actually, we don't know if the processor used 2's complement or 1's complement -- if it was 1's complement, they would have a signed 0!

We can infer it used two's complement, and absolutely rule out one's complement or any signed-zero system, because the range is [-2^15,2^15) and with a signed zero you can't represent every integer in that range in 16-bits, you have one too many unique numbers.

joshAg · on Feb 4, 2023

The range is of values, not their representation in bits, which can be mapped in any order. You could specify that the bit representation for 0 was 0x1234 and the bit representation for 1 was 0x1134 and proceeded accordingly and the range of values for those 16 bits could still independently be [-32768, 32767] or [0, 65535] or [65536, 131071] if you wanted.

We know the signed int they're talking about can't be the standard 1's complement because its stated range of values is [-32768, 32767]. If the representation were 1's complement the range would be [-32767, 32767] to accommodate the -0. It could be some modified form of 1's complement where they redefine the -0 to actually be -32768, but that's not 1's complement anymore.

ANumberlessMan · on Feb 4, 2023

Everything written in those three sentences you've highlighted from the article is correct. You may not like how they've chosen their three sentences, but these three sentences contain no lies.

Every negative number has 1 as it's first bit, every positive number (including 0) has 0 as its first bit. Therefore first bit encodes sign. The other 15 bits encode value. They may not use the normal binary encoding for negative integers as you'd expect from how we encode unsigned integers, but you cannot explain every detail every time.

gowld · on Feb 4, 2023

It's not a sign bit or a range shift. Signed integers are 2-adic numbers. In an n-bit signed integer, the "left" bit b represents the infinite repeated tail sum_{k >= n-1} {b 2^k} == - 2^{n-1} of the 2-adic integer value.

https://zjkmxy.github.io/posts/2021/11/twos-complement-2-adi...

opportune · on Feb 4, 2023

I have found that among software engineers, it is surprisingly not common knowledge that floating point operations have all these sharp edges and gotchas.

The most common situation in which it crops up is when dealing with quantities that require fractional units/arithmetic of some commonly discrete unit of measure. For example, you implement some complex logic to do request sampling, and in your binary you convert the total number of active requests to a float, add some stuff, divide some stuff, add some more stuff, multiply it again, then convert back to an int something like “number of requests that should be sampled.” Because floating point operations are non-associative, non-distributive, and commonly introduce remainder artifacts, you can end up with results like sampling 1 more request than there are total requests active, even when the arithmetic itself seems like that should be impossible.

This is also common when dealing with time, although typically the outcome is not that bad. Despite time having a simple workaround of just changing the unit of measure (eg using milliseconds instead of seconds) and using int operations on that, because people don’t know why they shouldn’t use floating point operations in this case, they don’t always reach for it.

The worst is when some complicated operation is done to report a float (or int converted from a float) as a metric. In the request sampling example, that would likely be noticed quickly and fixed. But when the float value looks reasonable enough and doesn’t violate some kind of system invariant, it can feed you bad data for a very long time before someone catches it.

mooooooooooooo · on Feb 4, 2023

Would you happen to have any resources on how to treat floating point values?

I noticed some odd behaviour recently when using ruby to save to Postgres where the handoff between the two systems introduced imprecision in the saved value. Didn’t get to dig into it because it wasn’t a priority but it’s definitely an annoying unanswered question.

opportune · on Feb 6, 2023

In addition to what others mentioned, to start from the basics I would try to learn how floating point values are implemented and how processors evaluate floating point expressions. For example the float is separated into two parts, it’s not able to efficiently represent many decimal numbers, etc.

A simple rule of thumb is to try to avoid using floating point values at all outside of contexts like scientific simulations. For basic situations, you can almost always use either a library (for big numbers, decimals, fractions, etc.) or express your logic with ints by using established patterns (make the unit of measurement smaller/bigger, explicitly round up or down, etc.). Any time you take something that is an int 99% of the time, convert it to a float for something, then convert it back to an int, you are doing something wrong.

cbhl · on Feb 4, 2023

If you want a thorough understanding: you'll want to look up "numerical computation", "numerical methods" or "computational methods"; in particular computing error bounds and error propagation. Typically covered in bachelor's level university-level Math, CS, or Engineering departments.

If you just want to fix the odd behavior: adjust the schema so that you only work with whole numbers. For example, in the database schema, you can use DECIMAL instead of REAL/DOUBLE columns, or use two columns to specify the ratio of two integers (for example num/denom in frame rates in various video containers/codecs). In the application code: work only in whole numbers (e.g. cents, satoshis) instead of fractionals, using bigint or string types as applicable instead of e.g. double.

gowld · on Feb 4, 2023

Or Fixed Point numbers .

flopriore · on Feb 4, 2023

Don't Fixed Point numbers add more issues? Let's assume you have a machine that accepts 8-bits fixed-point numbers, with 6 bits for integers and 2 for decimals. For simplicity, let's use decimal digits. If you must represent the number 2.013, the resulting number would be +00002.01 So you cut the third decimal digit and wasted 4 bits with useless zeros. At the same time, if you represented the same number in 8 bits floating point, you would have all the digits.

brianwawok · on Feb 4, 2023

Depends what you track. Fixed point is great for money. Floating point is not. Using an integer cents is easier than both.

aDfbrtVt · on Feb 4, 2023

I have had this page bookmarked for years, hopefully you find it as useful as I do. It is a formal yet digestible treatment of floating point numbers.

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...

mfashby · on Feb 4, 2023

Not necessarily explaining how to do it right, but lots of examples of pitfalls

https://jvns.ca/blog/2023/01/13/examples-of-floating-point-p...

pugworthy · on Feb 3, 2023

September 12-13.

An oceanographic research ship is doing gravity and magnetic surveys off the coast of Brazil.

Suddenly, data acquisition software crash!

   byte day_of_year;

squarefoot · on Feb 4, 2023

Was by any chance the programmer an alien from either Mercury or Venus?:)

guenthert · on Feb 4, 2023

No, just an old hat from the pre-360 era.

pugworthy · on Feb 4, 2023

I recall it was written in some form of BASIC in the late 80’s.

jakeinspace · on Feb 3, 2023

I’m very happy that the flight software codebase I’m currently working on doesn’t use any floating point. We don’t even have FPUs enabled. Then again, it’s not GN&C do the stakes are not as high.

xwdv · on Feb 4, 2023

I hate these “single line of code did X” type headlines.

It will always be a single line of code. The nature of most programs is to execute commands in a sequence. Eventually you hit one that fails.

Hell, you could reduce it to even be less than a line of code. It could be a single variable. A single instruction. It could be a couple bits. A couple bad 1’s and 0’s in memory blew up a multibillion dollar rocket launch.

photochemsyn · on Feb 4, 2023

> "To achieve this, the guidance system converts the velocity readings, from 64 bit floating point to 16 bit signed integer".

Oh, excellent possible interview question? "Write some code that reliably converts the full range of possible 64 bit floating point values to a 16 bit signed integer. What are the issues you'll have to deal with and what edge cases might arise?"

zaroth · on Feb 4, 2023

Oh come now that’s just silly!

I’ve interviewed 8 web devs over the last two weeks, each with years of experience, each asking $115k+, and more than one couldn’t figure out how to take an array of objects each with a ‘category’ property and output an array with the unique values of ‘category’. (this is one line of code)

Only one could successfully wire up a <select> populated with the list of unique categories and then filter the original list based on the selected category.

By the way, these interviews were done over Zoom with screen sharing with the candidate able to use their own dev environment and browser of their choice.

The interviews were allotted one hour and all candidates took longer than the scheduled time.

It’s been beyond depressing. I’m ready to just start asking for FizzBuzz again.

joshAg · on Feb 4, 2023

".... i can't. No one can. It's a mathematical impossibility as a general solution for at least 2 separate reasons.

The first issue is that we're taking 64 bits of data and trying to squeeze them into 16 bits. Now, sure it's not that bad, because we have the sign bit and NANs and infinities, but even if you toss away the exponent entirely, that's still 53 bits of mantissa to squeeze into 16 bits of int.

The second issue is all the values not directly expressible as an integer, either because they're infinity, NAN, too big, too small, or fractional.

The only way we can overcome these issues is to decide what exactly we mean by "converts", because while we might not _like_ it, casting to an int64 and then masking off our 16 most favorite bits of the 64 available is a stable conversion. That might be silly, but it brings up a valid question. What is our conversion algorithm?

Maybe by "convert" we meant map from the smallest float to the smallest int and then the next smallest float to the next smallest int, and then either wrapping around or just pegging the rest to the int16.max.

Or maybe we meant from the biggest float to the biggest int and so on doing the inverse of the previous. Those are two very different results.

And we haven't even considered whether to throw on NAN or infinity or what to do with -0 for both those cases.

Or maybe we meant translate from the float value to the nearest representable integer? We'd have a lot of stuff mapping to int16.max and int16.min, and we'd still have to decide how to handle infinity, NAN, and -0, but still possible.

Basically, until we know the rough conversion function, we can't even know if NAN, infinity and -0 are special cases and we can't even know if clipping will be an edge case or not. There's lots of conversions where we can happily wrap around on ourself and there are no edge cases, and there's lots of conversions where we have edge cases, but we can clip or wrap, and there's lots of conversions where we have edge cases and clipping/wrapping."

vaidhy · on Feb 4, 2023

You are hired!!!

vba616 · on Feb 4, 2023

I'd be like:

   Step 1: Enumerate all possible inputs. <--- this is the important part
   Step 2: Map each input to something in the output domain.
   Step 3: Are you using Ada??? C++ is *right out*.

...it looks like the proper name for what is needed is a "non-injective surjective total function".

gowld · on Feb 4, 2023

They are asking for an injective function, which is impossible.

l33t233372 · on Feb 4, 2023

If I had this question and I couldn’t look up IEEE 754 I would be cooked.

Maybe I have no business writing floating point code.

yellow_postit · on Feb 4, 2023

Or it’s a bad interview question as posed because in practice you could look up the spec and a better question would be to probe how the candidate would approach the problem and what clarify questions they would ask.

gowld · on Feb 4, 2023

It's a bad question because it's completely impossible, so the interviewee has to guess if the interviewer knew that and looking for pushback, or didn't know that and you have to figure out what they actually meant and what they incorrectly think the answer is supposed to be.

touisteur · on Feb 4, 2023

And most likely any serious Ada codebase worked by people worrying about such a question, has this in a generics or done (correctly) 7000 times... 'I'd just instantiate the Quantizer or Quantization_Manager generics, y'all have one, right?'

twawaaay · on Feb 4, 2023

Why would the program react like that to a SINGLE wrong signal that disagrees with everything else and produce a signal that cannot do anything good in any circumstances? This just smells like a truly naive piece of implementation.

There should be layers upon layers of safeties to prevent this dumb thing from happening. The computer should know the position, orientation and velocity of the rocket at any point in time and new signals should be interpreted in the context of what the computer already knows and in context of what other sensors are telling. It is not like the rocket can turn itself around in 1ms and if it does there probably isn't much it can do anyway.

This suggests to me the problem is not the bug, it is the overall quality of development.

dahart · on Feb 4, 2023

The article mentions there was a duplicate redundant signal, and they both agreed with each other. The problem indeed wasn’t the bug, you’re right. It was the assumption somewhere along the way that this older system couldn’t cause damage, the failure to review the fixed point range.

It’s easy to judge with hindsight though. Every rocket development project that has ever happened on the planet has had unintended explosions and accidents, and they are always staffed with brilliant people. Layers of safety at some level might only make the problem harder, more code just adds more complexity and failure points. Since there is an uncountable number of ways for something seemingly innocuous to break a rocket, and until we’ve all tried making rockets, it’s probably best to take away the lesson that fundamentally making rockets is highly prone to catastrophic failure.

twawaaay · on Feb 4, 2023

Of course there was a redundant signal (because the contract probably required it).

What I meant was other signals -- other information about state of the rocket. There is lots of sensors in these things if only so that you can figure out when stuff goes wrong.

> It’s easy to judge with hindsight though. Every rocket development project that has ever happened on the planet has had unintended explosions and accidents,

No, that's bad excuse. Accidents do happen but can only be excused when a reasonable effort to prevent it has been taken.

For example, Challenger disaster resulted in such a huge shakeup in Nasa exactly because reasonable precautions were NOT taken vs. other accidents which were truly unforeseen and were due to lack of knowledge/experience.

The question is about taking reasonable precautions. It is reasonable effort to design a system that will drive multiple half a billion dollar rockets with some level of care to ignore absolutely idiotic signals.

I do it on my home projects and at work with non safety critical applications. Why can't they do it for such a critical project?

dahart · on Feb 4, 2023

> Why can’t they do it for such a critical project?

They can, they did, and they still exploded a rocket! Again, hindsight is 20/20. This question isn’t very reasonable to ask this way (with incredulity) until you’ve successfully built several rockets yourself.

twawaaay · on Feb 4, 2023

Still sounds like a bad excuse.

Imagine discussion after Challenger disaster.

The Commission: So why did it blow up? Could anything have been done?

The Nasa: These things just happen! Hindsight is 20/20. Also, you don't have enough experience to point out our engineering problems until you exploded couple of them yourself!

dahart · on Feb 4, 2023

I’m not excusing it. (Nor did they). I’m only suggesting that your armchair incredulity is waaay out of place, exposing your own assumptions more than the rocket team’s. I’m only adding this because you pushed back a second time, and I mean this with only love for a fellow programmer, not malice, but your comment implying that your home project code and work code is as well tested and executed as Ariane 5’s is more than a little amusing. Put your code up here for review and let’s all see if it’s crashproof and worthy of a high reliability embedded system safety-critical environment... ;) Like really, in case you’re young, saying something like that in an interview reveals so much hubris it might cost you the job, it is exceptionally presumptuous.

This is important because assuming that something dumb happened is part of the problem too, and that’s what your comments above attempt to communicate. Pretending like it was easy to avoid is to be intentionally ignorant of the fact that nobody ever has avoided this problem, not in rocket launches, not in web development, not in cars or video games, or in any code of any significant size. Safety critical engineering has to absorb this fact deeply, and nobody can walk into it thinking, well @twawaaay did it with their home project, so all we need is duh multiple layers of testing and redundancy. Yeah, they absolutely had multiple layers of testing and redundancy, they had everything you’ve ever thought was a good idea for writing safe code, and then 10x more than that, it might be worth reading more about the history before jumping to such conclusions.

gowld · on Feb 4, 2023

I know enough to not put my bad code on the rocket.

amelius · on Feb 4, 2023

I'm also wondering if they had a simulation environment, and why it didn't catch the problem.

twawaaay · on Feb 4, 2023

Pretty much. The trajectory is preplanned anyway. I would expect nothing less than every launch to be ran through simulated environment multiple times if only to catch wrong launch information.

Also this FP to integer does not smell any better. Anybody who's been interested in programming for any length of time will learn this is just a bad idea asking for trouble.

If I was tech lead for the project I would definitely make sure there is static analysis that prevents these kinds of conversions from happening.

allenrb · on Feb 4, 2023

And you'd think catching incorrect launch information would be an obvious focal point, and yet... https://spaceflightnow.com/2018/02/23/investigators-say-erro...

twawaaay · on Feb 4, 2023

Yup. Definitely seems like they have basic quality issue.

martyvis · on Feb 4, 2023

Even very earth tied machines suffer similar issues. I worked on what was known as a "hot leveller" computer, a PDP-11/73 at the steelworks I was employed at. It had something like 9 rolls (maybe 200mm in diameter) that would be applied to a very hot steel plate (maybe 10mm to 150mm thick) after it had been rolled from a maybe 300mm thick slab.

The levellers job was to smooth out any waves that might have acquired during the rolling process - almost like a clothes iron. The gap between the rolls needed to be adjusted by hydraulically positioning backup rolls that are even able to bend those work rolls across their width (maybe 3000mm). As you are always intending apply a huge amount of force anyways, to achieve the desired results, it was a mix of metallurgical driven algorithms and hard limits to doing the "setup".

While there was always an operator that had to accept the setup before the run, there was always the risk of hitting the machines surfaces too hard, and straining components, and maybe causing a prolonged as expensive outage. Obviously the biggest risks were when there were changes or even experiments by both engineers and metallurgists. It was fun times as a quite junior engineer, and think there were a few times when over zealous setups resulted in some big noises. But I don't think I broke anything fortunately.

tpoacher · on Feb 4, 2023

How can you write a whole article about "a single line of code" and not have that line appear anywhere in the article?

Even worse, why was I completely unsurprised, nay expecting this to be the case when I clicked?

(to the article's credit, it didn't quite start in the typical "George was walking his dog home when he noticed something wrong" fashion...)

jasontedor · on Feb 4, 2023

I think James Gleick had a much better write up about this: http://www.maths.mic.ul.ie/posullivan/A%20Bug%20and%20a%20Cr...

bluish29 · on Feb 4, 2023

I think this story about code is to software engineering what the Nokia story is to business

simondotau · on Feb 4, 2023

It'd probably be more accurate to say that a technology environment which allowed any single line of code to cause catastrophic failure is what brought down the launch. Or a failure of sufficiently accurate testing brought down the launch.

albert_e · on Feb 4, 2023

How many such errors may have happened on successful projects but never got post-mortemed because there was luckily no catastrophic failure

motohagiography · on Feb 4, 2023

Millions of lines of code was what brought down that rocket launch. Saying it was one line just absolves everyone else.

thedg · on Feb 3, 2023

did you know that one time an integer overflow caused 184 billion bitcoins to be minted

MrOwnPut · on Feb 3, 2023

(2010) The fix, Bitcoin patch 0.3.10, was implemented by Satoshi

https://bitcointalk.org/index.php?topic=827.0

Kukumber · on Feb 3, 2023

The world is not ready for the "Epochalypse", let's see what will collapse first, our civilization or our computers

https://en.wikipedia.org/wiki/Year_2038_problem

msikora · on Feb 4, 2023

Maybe just in time to miraculously stop the malicious AI from taking over the world?

nl · on Feb 3, 2023

The world is mostly ready for the 2038 problem because we have mostly transitioned to 64bit machines where it isn't a problem.

ozarker · on Feb 3, 2023

There’s a lot of 32bit machines still around. There’s going to be a lot of problems in a lot of unexpected places

serf · on Feb 3, 2023

Y2K bug affected a lot of machines, yet the fear of the impact beforehand had a larger effect on society-at-large than any of the software consequences during the event.

actionfromafar · on Feb 3, 2023

Indeed, but so very few things were automated back then compared to now.

throwaway-blaze · on Feb 4, 2023

Sure, but we got thru y2k pretty unscathed. You don't think we can patch all the critical things in the next 15 years?

actionfromafar · on Feb 3, 2023

Tell that to all IoT things and PLCs out there. The *D* in IoT stands for Date.

Kukumber · on Feb 4, 2023

https://github.com/search?q=%22%28int%29+System.currentTimeM...

thedg · on Feb 3, 2023

jaynate · on Feb 4, 2023

Wow. It makes you wonder what other types of bugs could exist?

monksy · on Feb 4, 2023

What was their unit test coverage?

phkahler · on Feb 3, 2023

>> However, the reading is larger than the biggest possible 16 bit integer, a conversion is tried and fails. Usually, a well-designed system would have a procedure built-in to handle an overflow error and send a sensible message to the main computer. This, however, wasn’t one of those cases.

This is so unbelievably untrue. I've never seen code anywhere that waits to fail before doing the right thing.

This is exactly why I think exceptions are mostly useless, someone has to anticipate the problem, so why not write something that works right the first time. There are cases where exceptions can happen, but I don't think floating point arithmetic should be considered one of those cases.

testrun · on Feb 3, 2023

>>so why not write something that works right the first time

Nobody writes 100% correct code. Ever.

RantyDave · on Feb 4, 2023

They might do, we should test for that.

These stories of gnat-brings-down-empire are, to me, always missing the point. There are going to be bugs. The hard part is in creating management around the actual software engineering such that this is not a problem.

rvnx · on Feb 3, 2023

"Remote computer is not answering" What is the correct code then ? In case there is no exception.

phkahler · on Feb 4, 2023

That's right. But nobody puts their best code in a generic exception handler either.

M4R5H4LL · on Feb 4, 2023

That's an untested assumption :)

WalterBright · on Feb 4, 2023

Speak for yourself.

zaroth · on Feb 4, 2023

“Exceptions” are just a name we give to a particular type of flow control and syntactic sugar.

In this case the syntactic sugar helps deal with passing data across function boundaries and up the stack in a way that can make the code more readable.

Every function could handle returning its own error states, and then higher level functions could aggregate and check and return the error states of every function it calls… or you might find it saves a massive amount of boilerplate to just use try/catch!