> Peyton added that even though the update to the DynamicSource software had been tested over an extended period, the bug was missed because it only presented when many aircraft at the same time were using the system.
That seems horribly wrong to me. I can understand software being slow under load, but being wrong under load sounds like a horrible internal architecture problem.
> but being wrong under load sounds like a horrible internal architecture problem.
Tons of software can be "wrong under load" - things like race conditions and memory leaks are common problems, and they don't necessarily point to a huge architectural defect. E.g. Cloudflare had a bug years ago that caused highly sensitive data to leak, but only for a teeny (relatively) portion of their site. Similar issue happened to GitHub: https://github.blog/2021-03-18-how-we-found-and-fixed-a-rare...
True, but software for domains where people can die is very different than Cloudflare. The scale is much higher in the latter, and it needs to be less "bulletproof".
If you're writing software for giant metal flying cylinder carrying hundreds of people, we can't just brush mistakes off the same way we might with web-based software (and yes, privacy is important, but it's not life or death).
I'm not "brushing mistakes off", at all. The process that led to the bugs needs a full review and root-cause corrective action, especially enhanced testing scenarios. I'm just pointing out that software "being wrong under load" is not some sort of crazy, unusual problem that points to poor software architecture. These bugs can easily be caused by a single line of code that wasn't synchronized correctly.
In most industries that are not IT, testing the IT systems very well is needed, but rarely achieved. In my area (FDA regulated) performance testing is almost never done correctly and (for sure) never completely, but as long as it works and the auditors never look at these details nobody from management is willing to know about it, because fixing it costs money. For many years the people signing off for all the software were intentionally not business people with zero IT knowledge, so they could not be accused of knowingly signing off low quality code. That saved tens or hundreds of millions of dollars in the past 20 years alone.
For what it's worth, I had the same reaction upon reading this paragraph. It seems there are HN commenters who understand software design and those who do not.
To be fair to the other commenters, it's conceivable that the practical difference between the current unsound architecture and a sound replacement might come down to to some small defect. But an attempt to fix the problem by fixing just the defect, even some kind of RCA process on it, will fail if the architecture itself is not fixed, and that requires a change in attitudes and understanding by the devs responsible for the system, not just a software change. That is where the real flaw lies. But some people don't believe that it's possible for the flaw to be in these places, outside the code. All the RCAs in the world can't help them.
But calling something an "internal architectural problem" implies things are fundamentally designed wrong, and the code needs a rewrite. My point was the architecture can definitely be fine, but you can still have bugs (even catastrophic ones) because someone left out a `synchronized` keyword somewhere.
For example, one way to use SQL is to escape your parameters and then concat them straight into your SQL string. However,
if someone leaves out an ‘escape’ somewhere, you have a major security incident
The other way is to pass parameters separately as data to your SQL driver. It completely negates the problem
If you’ve chosen the first way in your project already, you’ve committed to a major internal architectural problem. In the same vein, maybe if your code requires sprinkling ‘synchronized’ everywhere, you did it wrong
But that said, we don’t know what actually is the problem since it’s just a marketing statement saying that high load was the cause and that doesn’t tell us anything
The key here seems to be the word "many."
Alaska Airlines has 289 airplanes (per wikipedia).
All the other arguments seem to assume a large consumer type of load - tens of thousands of users, etc...
I just can't see an undue strain being placed on a well designed system from < 300 data points. And I haven't even accounted for the distribution of needing to compute takeoff data over the course of a day nor how many planes are NOT taking off at the same time, etc...
Also, to somewhat change the topic, didn't Alaska Airlines disband their QA org a few years ago as part of cost cutting? IIRC, they did this to model the software company models (that ship bugs regularly to consumers) and seem to be getting some data that they need to bring back that org...
I was just about to post the _Exact_ same thing. Each execution of the program should be completely independent from other calculations. Something is _Horribly_ wrong with their architecture.
... Call to the baggage-tracking service returned 429 and an empty body, and that was treated as "there are no bags" because some dweeb [dev, but I'm leaving in what autocorrect said] had been reading to much design guidance from Netflix/Facebook/etc that assumes missing features on a fraction of page loads are NBD?
I wonder how it works. Obviously you are correct, but the number of moving parts must be huge. The number of passengers, amount of cargo and the fuel load must be changing the whole time.
So much I can't understand here. Why would the load matter- is this a web app? Why would that possibly be a good idea? This seems like software that should run locally, for security, assurance, and auditability reasons.
Follow up thought: When software results are this critical, I wonder if a totally separate program should be used as well and result compared. An independent implementation from another vendor.
Not according to the article, the values returned were within the realm of possibility, but wrong. Giving out zeroes might have actually been fine, it’s easy enough to realise there’s a problem if you see your plane weighs 0 pounds.
I mean it might be just one part. E.g. luggage gets counted as zero but passenger weight, fuel weight and aircraft weight correct - that might be pluausible but low.
If you view software as a bunch of contracts between different services/libraries/etc. it is hard to ensure that each piece upholds its contract, so we trust them to a degree. I can see how a failure in one place due to load would be hard to catch or gracefully fail from.
That said they have a pretty explicit cap on maximum activity (number of planes) so it is weird they didn't test around this. It isn't like 400,000 aircraft suddenly DDOS their system.
I refuse to believe that summing up weights for each of 300 planes is so hard that it can't be made correct by a team of moderately competent developers operating under sane management. There might be some complexity, but it's 2023 and we have enough knowledge and tools to solve that kind of stuff routinely and reliably. Somebody fucked up big time and should be fired never to work in software again.
I can give you a simple example on how this can happen, from a real production system in a different regulated environment: all the SQL SELECT statements are WITH (NOLOCK) on MS SQL server, resulting in non-committed (dirty) reads. Under very light load it works fine, on higher loads it reads some dirty data. If you calculate the weight of the plane by adding the weight of the checked in luggage plus passengers read from a shared database, some differences will appear and they will increase with the load as more transactions will be in flight and possibly read or ignored.
I agree that it's hard to see from the outside how calculating weight figures for an airplane could be software load-dependent. But this all smacks of the point that Brooks makes in 'The Mythical Man Month' all those years ago. Two guys in a garage can make a program that does 'x', but to make a programming system product that does 'x' is more than an order of magnitude more work. The complexities introduced don't relate to the 'x', but the intricacies of large systems. Deciding that 'someone has fucked up and should never work again'? Sorry, competent and conscientious people regularly make such mistakes. Which is why we need tests, software review, engineering processes et al to catch those mistakes.
There is certainly not enough information in that quote to say one way or another what the bug was. I've definitely seen concurrency bugs under load because data that wasn't supposed to be shared actually was, e.g. I posted this serious GitHub bug in a comment above, https://github.blog/2021-03-18-how-we-found-and-fixed-a-rare....
Obviously sessions should be independent and not sharing data, but that's why it was a bug.
I believe any bug that only happens under load is a concurrency bug by definition. The shared resource is the thing under load. If it weren't shared, then the load from one computation would have no effect on another.
A proper transactional database would suffice. You could do this safely and performantly on '00s hardware in PHP3 for crying out loud, serving thousands of planes per second.
That seems horribly wrong to me. I can understand software being slow under load, but being wrong under load sounds like a horrible internal architecture problem.