> Peyton added that even though the update to the DynamicSource software had bee...

hn_throwaway_99 · on Feb 18, 2023

> but being wrong under load sounds like a horrible internal architecture problem.

Tons of software can be "wrong under load" - things like race conditions and memory leaks are common problems, and they don't necessarily point to a huge architectural defect. E.g. Cloudflare had a bug years ago that caused highly sensitive data to leak, but only for a teeny (relatively) portion of their site. Similar issue happened to GitHub: https://github.blog/2021-03-18-how-we-found-and-fixed-a-rare...

Ozzie_osman · on Feb 18, 2023

True, but software for domains where people can die is very different than Cloudflare. The scale is much higher in the latter, and it needs to be less "bulletproof".

If you're writing software for giant metal flying cylinder carrying hundreds of people, we can't just brush mistakes off the same way we might with web-based software (and yes, privacy is important, but it's not life or death).

hn_throwaway_99 · on Feb 18, 2023

I'm not "brushing mistakes off", at all. The process that led to the bugs needs a full review and root-cause corrective action, especially enhanced testing scenarios. I'm just pointing out that software "being wrong under load" is not some sort of crazy, unusual problem that points to poor software architecture. These bugs can easily be caused by a single line of code that wasn't synchronized correctly.

pas · on Feb 19, 2023

... ... that's the problem. why isn't it running isolated?

AdrianB1 · on Feb 18, 2023

In most industries that are not IT, testing the IT systems very well is needed, but rarely achieved. In my area (FDA regulated) performance testing is almost never done correctly and (for sure) never completely, but as long as it works and the auditors never look at these details nobody from management is willing to know about it, because fixing it costs money. For many years the people signing off for all the software were intentionally not business people with zero IT knowledge, so they could not be accused of knowingly signing off low quality code. That saved tens or hundreds of millions of dollars in the past 20 years alone.

oarsinsync · on Feb 19, 2023

> In most industries, testing the IT systems very well is needed, but rarely achieved.

FTFY. IT shops on the whole aren’t any better here.

jki275 · on Feb 18, 2023

That's still an architecture defect.

Software's accuracy being affected by load is completely unacceptable.

puffoflogic · on Feb 18, 2023

For what it's worth, I had the same reaction upon reading this paragraph. It seems there are HN commenters who understand software design and those who do not.

To be fair to the other commenters, it's conceivable that the practical difference between the current unsound architecture and a sound replacement might come down to to some small defect. But an attempt to fix the problem by fixing just the defect, even some kind of RCA process on it, will fail if the architecture itself is not fixed, and that requires a change in attitudes and understanding by the devs responsible for the system, not just a software change. That is where the real flaw lies. But some people don't believe that it's possible for the flaw to be in these places, outside the code. All the RCAs in the world can't help them.

hn_throwaway_99 · on Feb 18, 2023

Please let us know where these software magicians are that write perfect, bug-free code.

jki275 · on Feb 18, 2023

Who said anything about that? You're arguing against a straw man nobody has set up but yourself..

There's no such thing as perfect software, but there is definitely software that doesn't make up a false value if it can't work.

forgottenlogin9 · on Feb 19, 2023

And if there is a bug in the logic that determines if it can or can’t work?

jki275 · on Feb 19, 2023

fail safe. Don't fail stupid.

tantalor · on Feb 22, 2023

Hence the crumple thingy on the tail of the aircraft

lupire · on Feb 18, 2023

"huge" is not the same as "horrible". One racy boi can take out an airplane.

hn_throwaway_99 · on Feb 18, 2023

But calling something an "internal architectural problem" implies things are fundamentally designed wrong, and the code needs a rewrite. My point was the architecture can definitely be fine, but you can still have bugs (even catastrophic ones) because someone left out a `synchronized` keyword somewhere.

thrashh · on Feb 18, 2023

It could still be

For example, one way to use SQL is to escape your parameters and then concat them straight into your SQL string. However, if someone leaves out an ‘escape’ somewhere, you have a major security incident

The other way is to pass parameters separately as data to your SQL driver. It completely negates the problem

If you’ve chosen the first way in your project already, you’ve committed to a major internal architectural problem. In the same vein, maybe if your code requires sprinkling ‘synchronized’ everywhere, you did it wrong

But that said, we don’t know what actually is the problem since it’s just a marketing statement saying that high load was the cause and that doesn’t tell us anything

jg42 · on Feb 18, 2023

The key here seems to be the word "many." Alaska Airlines has 289 airplanes (per wikipedia).

All the other arguments seem to assume a large consumer type of load - tens of thousands of users, etc...

I just can't see an undue strain being placed on a well designed system from < 300 data points. And I haven't even accounted for the distribution of needing to compute takeoff data over the course of a day nor how many planes are NOT taking off at the same time, etc...

Also, to somewhat change the topic, didn't Alaska Airlines disband their QA org a few years ago as part of cost cutting? IIRC, they did this to model the software company models (that ship bugs regularly to consumers) and seem to be getting some data that they need to bring back that org...

Aloha · on Feb 18, 2023

They also mentioned American was using the same system too. So it may have been load from more than one company.

hn_go_brrrrr · on Feb 18, 2023

Good observation. I had assumed each company had their own instance, but that's not necessarily true.

exabrial · on Feb 18, 2023

I was just about to post the _Exact_ same thing. Each execution of the program should be completely independent from other calculations. Something is _Horribly_ wrong with their architecture.

tbrownaw · on Feb 18, 2023

... Call to the baggage-tracking service returned 429 and an empty body, and that was treated as "there are no bags" because some dweeb [dev, but I'm leaving in what autocorrect said] had been reading to much design guidance from Netflix/Facebook/etc that assumes missing features on a fraction of page loads are NBD?

lostlogin · on Feb 18, 2023

I wonder how it works. Obviously you are correct, but the number of moving parts must be huge. The number of passengers, amount of cargo and the fuel load must be changing the whole time.

stateofinquiry · on Feb 18, 2023

So much I can't understand here. Why would the load matter- is this a web app? Why would that possibly be a good idea? This seems like software that should run locally, for security, assurance, and auditability reasons.

Follow up thought: When software results are this critical, I wonder if a totally separate program should be used as well and result compared. An independent implementation from another vendor.

basch · on Feb 18, 2023

Maybe a timeout. Some part of a value returned 0 if it didn’t return in time.

coin · on Feb 18, 2023

Sounds like poor error handing. 0 is valid value is not a substitute for no data or error.

sudhirj · on Feb 18, 2023

Not according to the article, the values returned were within the realm of possibility, but wrong. Giving out zeroes might have actually been fine, it’s easy enough to realise there’s a problem if you see your plane weighs 0 pounds.

advisedwang · on Feb 18, 2023

I mean it might be just one part. E.g. luggage gets counted as zero but passenger weight, fuel weight and aircraft weight correct - that might be pluausible but low.

basch · on Feb 18, 2023

Hence “some part of a value.”

Such as weight=plane+fuel+people+luggage. If one of the variables became 0 you’d still have the other 3.

It was a low enough value to have made multiple pilots question its value that day, like fuel or passenger load was missing completely.

dehrmann · on Feb 19, 2023

Classic pokemon exception handling.

dilyevsky · on Feb 18, 2023

From the description it sounds like they tried to give a non-technical explanation of a concurrency bug

ericmcer · on Feb 18, 2023

If you view software as a bunch of contracts between different services/libraries/etc. it is hard to ensure that each piece upholds its contract, so we trust them to a degree. I can see how a failure in one place due to load would be hard to catch or gracefully fail from.

That said they have a pretty explicit cap on maximum activity (number of planes) so it is weird they didn't test around this. It isn't like 400,000 aircraft suddenly DDOS their system.

cratermoon · on Feb 18, 2023

Concurrency is hard. Thread unsafe operations that end up getting executed in multi-threaded contexts are a huge source of heisenbugs

speed_spread · on Feb 18, 2023

I refuse to believe that summing up weights for each of 300 planes is so hard that it can't be made correct by a team of moderately competent developers operating under sane management. There might be some complexity, but it's 2023 and we have enough knowledge and tools to solve that kind of stuff routinely and reliably. Somebody fucked up big time and should be fired never to work in software again.

AdrianB1 · on Feb 18, 2023

I can give you a simple example on how this can happen, from a real production system in a different regulated environment: all the SQL SELECT statements are WITH (NOLOCK) on MS SQL server, resulting in non-committed (dirty) reads. Under very light load it works fine, on higher loads it reads some dirty data. If you calculate the weight of the plane by adding the weight of the checked in luggage plus passengers read from a shared database, some differences will appear and they will increase with the load as more transactions will be in flight and possibly read or ignored.

lcuff · on Feb 18, 2023

I agree that it's hard to see from the outside how calculating weight figures for an airplane could be software load-dependent. But this all smacks of the point that Brooks makes in 'The Mythical Man Month' all those years ago. Two guys in a garage can make a program that does 'x', but to make a programming system product that does 'x' is more than an order of magnitude more work. The complexities introduced don't relate to the 'x', but the intricacies of large systems. Deciding that 'someone has fucked up and should never work again'? Sorry, competent and conscientious people regularly make such mistakes. Which is why we need tests, software review, engineering processes et al to catch those mistakes.

justinclift · on Feb 18, 2023

Boeing itself (unsure about this company though) was widely reported to have offshored their senior engineers a few years ago. eg:

https://www.bloomberg.com/news/articles/2019-06-28/boeing-s-... (paywalled)

https://archive.is/vdg9S (paywall workaround)

Note that there were several articles about it around the time, the above one is just the first reasonable seeming one from a quick search.

djanogo · on Feb 18, 2023

This wasn't really a concurrency problem as I understand it, the requests for calculations didn't need to share any data between them.

"the bug was missed because it only presented when many aircraft at the same time were using the system"

hn_throwaway_99 · on Feb 18, 2023

There is certainly not enough information in that quote to say one way or another what the bug was. I've definitely seen concurrency bugs under load because data that wasn't supposed to be shared actually was, e.g. I posted this serious GitHub bug in a comment above, https://github.blog/2021-03-18-how-we-found-and-fixed-a-rare....

Obviously sessions should be independent and not sharing data, but that's why it was a bug.

usea · on Feb 18, 2023

I believe any bug that only happens under load is a concurrency bug by definition. The shared resource is the thing under load. If it weren't shared, then the load from one computation would have no effect on another.

howinteresting · on Feb 18, 2023

Should have used Rust.

klyrs · on Feb 18, 2023

A proper transactional database would suffice. You could do this safely and performantly on '00s hardware in PHP3 for crying out loud, serving thousands of planes per second.

AdrianB1 · on Feb 18, 2023

Bad coding happen even in transactional databases, see my example in the other comment.

klyrs · on Feb 19, 2023

Note that I didn't say you couldn't write bad PHP3.

howinteresting · on Feb 19, 2023

(This was a joke!)