Why would the program react like that to a SINGLE wrong signal that disagrees wi...

dahart · on Feb 4, 2023

The article mentions there was a duplicate redundant signal, and they both agreed with each other. The problem indeed wasn’t the bug, you’re right. It was the assumption somewhere along the way that this older system couldn’t cause damage, the failure to review the fixed point range.

It’s easy to judge with hindsight though. Every rocket development project that has ever happened on the planet has had unintended explosions and accidents, and they are always staffed with brilliant people. Layers of safety at some level might only make the problem harder, more code just adds more complexity and failure points. Since there is an uncountable number of ways for something seemingly innocuous to break a rocket, and until we’ve all tried making rockets, it’s probably best to take away the lesson that fundamentally making rockets is highly prone to catastrophic failure.

twawaaay · on Feb 4, 2023

Of course there was a redundant signal (because the contract probably required it).

What I meant was other signals -- other information about state of the rocket. There is lots of sensors in these things if only so that you can figure out when stuff goes wrong.

> It’s easy to judge with hindsight though. Every rocket development project that has ever happened on the planet has had unintended explosions and accidents,

No, that's bad excuse. Accidents do happen but can only be excused when a reasonable effort to prevent it has been taken.

For example, Challenger disaster resulted in such a huge shakeup in Nasa exactly because reasonable precautions were NOT taken vs. other accidents which were truly unforeseen and were due to lack of knowledge/experience.

The question is about taking reasonable precautions. It is reasonable effort to design a system that will drive multiple half a billion dollar rockets with some level of care to ignore absolutely idiotic signals.

I do it on my home projects and at work with non safety critical applications. Why can't they do it for such a critical project?

dahart · on Feb 4, 2023

> Why can’t they do it for such a critical project?

They can, they did, and they still exploded a rocket! Again, hindsight is 20/20. This question isn’t very reasonable to ask this way (with incredulity) until you’ve successfully built several rockets yourself.

twawaaay · on Feb 4, 2023

Still sounds like a bad excuse.

Imagine discussion after Challenger disaster.

The Commission: So why did it blow up? Could anything have been done?

The Nasa: These things just happen! Hindsight is 20/20. Also, you don't have enough experience to point out our engineering problems until you exploded couple of them yourself!

dahart · on Feb 4, 2023

I’m not excusing it. (Nor did they). I’m only suggesting that your armchair incredulity is waaay out of place, exposing your own assumptions more than the rocket team’s. I’m only adding this because you pushed back a second time, and I mean this with only love for a fellow programmer, not malice, but your comment implying that your home project code and work code is as well tested and executed as Ariane 5’s is more than a little amusing. Put your code up here for review and let’s all see if it’s crashproof and worthy of a high reliability embedded system safety-critical environment... ;) Like really, in case you’re young, saying something like that in an interview reveals so much hubris it might cost you the job, it is exceptionally presumptuous.

This is important because assuming that something dumb happened is part of the problem too, and that’s what your comments above attempt to communicate. Pretending like it was easy to avoid is to be intentionally ignorant of the fact that nobody ever has avoided this problem, not in rocket launches, not in web development, not in cars or video games, or in any code of any significant size. Safety critical engineering has to absorb this fact deeply, and nobody can walk into it thinking, well @twawaaay did it with their home project, so all we need is duh multiple layers of testing and redundancy. Yeah, they absolutely had multiple layers of testing and redundancy, they had everything you’ve ever thought was a good idea for writing safe code, and then 10x more than that, it might be worth reading more about the history before jumping to such conclusions.

gowld · on Feb 4, 2023

I know enough to not put my bad code on the rocket.

amelius · on Feb 4, 2023

I'm also wondering if they had a simulation environment, and why it didn't catch the problem.

twawaaay · on Feb 4, 2023

Pretty much. The trajectory is preplanned anyway. I would expect nothing less than every launch to be ran through simulated environment multiple times if only to catch wrong launch information.

Also this FP to integer does not smell any better. Anybody who's been interested in programming for any length of time will learn this is just a bad idea asking for trouble.

If I was tech lead for the project I would definitely make sure there is static analysis that prevents these kinds of conversions from happening.

allenrb · on Feb 4, 2023

And you'd think catching incorrect launch information would be an obvious focal point, and yet... https://spaceflightnow.com/2018/02/23/investigators-say-erro...

twawaaay · on Feb 4, 2023

Yup. Definitely seems like they have basic quality issue.