A corrupt file led to the FAA ground stoppage – also found in backup system

huslage · on Jan 12, 2023

You all seem to think this is similar in value or operation to a web app. It is not. It is a safety-critical system that requires very stringent operational and development guidelines ON PURPOSE. The idea that the FAA shouldn't be risk averse in this system is absolutely ridiculous. The complexity of operating the airspace of an entire nation is nothing to scoff at and the importance of the NOTAM system should not be minimized in any way. This is not some government corruption thing. There are hundreds of thousands of lives at stake every day.

I, for one, appreciate the work that the FAA does to keep us safe in the air and appreciate that this was handled appropriately. Everything breaks. It's just a matter of how and when and what we do with it when it does. The FAA handled this outage appropriately and in a timely manner. I feel for the engineers who had to work on this incident.

alexb_ · on Jan 12, 2023

You've got it exactly right. There are a lot of people here who are completely deluded into the "move fast and break things" mindset not realizing that sometimes you really do not want to move fast, because you REALLY do not want to break things. A corrupted file throwing up panics like this is a good thing, because you don't want corrupted files to pass through like everything is fine.

bastardoperator · on Jan 12, 2023

If the corrupted file is in the backup, it DID pass through like everything was fine. What's clear to me is that the FAA has no post deployment validation, hasn't tested its DR strategy, and that errors can go unseen for long periods of time.

haswell · on Jan 12, 2023

> What's clear to me is that the FAA has no post deployment validation, hasn't tested its DR strategy, and that errors can go unseen for long periods of time.

It is possible to have all of those mitigations in place and still experience a failure like this.

Post deployment validation is only as good as the validations executed. 99% coverage still leaves the door open to failure.

A DR strategy is just that - a strategy.

A failure of this sort is not an automatic implication that those things do not exist, just that they failed in this particular case.

I would find it incredibly surprising that an organization of that complexity could have survived as long as they did without a major incident if none of those things were in place.

They’d be either incredibly lucky, or incredibly competent, and if they are the latter, they would not operate without such mitigations in place.

It seems far more believable that an organization of the FAA’s age and complexity missed something along the way.

landemva · on Jan 13, 2023

> incredibly surprising that an organization of that complexity could have survived as long as they did without a major incident

I'm not surprised. FAA does not fly each plane. Government organizational complexity helps ensure the government organization survives through next round of Congressional appropriations.

Org complexity + opaque oversight + 'safety' + 'homeland security' + taxpayer funded = playing around and more budget.

The pilot is responsible for safety. Air travel has rules to avoid collisions (eastbound gets altitude levels different than westbound, pilots shall broadcast on known frequencies) and pilots have distributed intelligence to keep their flight safe.

Yes, somehow there needs to be coordination of runway use. Many ways to provide reservations and queuing.

bastardoperator · on Jan 12, 2023

We can make excuses all day long. A simple query of the database/table would have produced an error. Sure, the FAA does some complex stuff, but the tech I see in airplanes looks ancient. I'm willing to bet most of the FAA complexity comes from budget (lack thereof) and old computer systems.

haswell · on Jan 12, 2023

This has nothing to do with excuses - I’m challenging the assertion that “because something bad happened, they must not have any mitigations in place at all”.

This seems like a bad case of binary thinking, and my point was that the occurrence of an incident like this is not sufficient to support that claim. It’s just as likely that an ancient process that wasn’t accounted for somewhere in the architecture broke down, and this is how it manifested.

Clearly improvements are needed, as is always the case after an outage. That doesn’t justify wild speculation.

Anecdote time: I once worked for a large financial institution that makes money when people swipe their credit cards. The system that authorizes purchases is ancient, battle tested, and undergoes minimal change because the cost of an outage could be measured in the millions of $ per minute.

Every change was scrutinized, reviewed by multiple groups, discussed with executives, and tested thoroughly. The same system underwent regular DR testing that involved quite a lot of involvement from all related teams.

So the day it went down, it was obviously a big deal, and raised all of the natural questions about how such a thing could occur.

Turns out it had an unknown transitive dependency on an internal server - a server that had not been rebooted in literally a decade. When that server was rebooted (I think it was a security group insisting it needed patches despite some strong reasons to avoid that when considering the architecture), some of the services never came back up, and everyone quickly learned that a very old change that predated almost everyone there established this unknown dependency.

The point of this story is really about the unknowability of sufficiently complex legacy enterprise systems.

All of the right processes and procedures won’t necessarily account for that seemingly inconsequential RPC call to an internal system implemented by a grizzled dev shortly before his retirement.

wbl · on Jan 12, 2023

Those were the wrong procedures. If you were regularly rebooting systems left and right you'd learn quickly if things didn't come up.

jquery · on Jan 12, 2023

And then you find an obscure service doesn’t come back up on the 10,000th or 100,000th reboot because of <any number of reasons>. And now you have multiple states, because you have to handle failover. It’s turtles all the way down.

haswell · on Jan 13, 2023

It’s always easy to say that in hindsight. But keep in mind this is an environment with many core components built in the 80s. Regular reboots on old AIX systems wasn’t a common practice - the sheer uptime capability of these systems was a big selling point in an environment that looks nothing like a modern cloud architecture.

But none of that is really the point. The point is that even with every correct procedure in place, you’ll still encounter failures.

Modern dev teams in companies that build software have more checks and balances in place from the get go that help head off some categories of failure.

But when an organization is built on core tech born of the 80s/90s, there will always be dragons, regardless of the current active policies and procedures.

The problem is that the cost to replace some of these systems was inestimable.

thrashh · on Jan 12, 2023

We found the person who intimately knows how FAA’s system is engineered and who also builds perfect systems

slingnow · on Jan 12, 2023

A "simple query". The tech "looks ancient". You're "willing to bet" things. And yet you speak so confidently and derisively about this outage.

It must be nice to sit behind your keyboard and just have all of the answers all day long! Do you have any tips for how to be so omniscient?

DonHopkins · on Jan 12, 2023

We can make shit up and pretend to be experts and criticize things we know absolutely nothing about all day long, too.

zhengyi13 · on Jan 12, 2023

I read this incident as they in fact did have a DR strategy, they knew how to execute it, did so successfully, and still failed.

The only thing I'm left wondering at this point is whether the corrupt data were a config, or state.

krabizzwainch · on Jan 13, 2023

I don't even know if I'd call this a disaster recovery fail. Depending on what they meant by "corruption", a roughly 6-8 hour turn around time is not awful for a database restore.

berniedurfee · on Jan 12, 2023

Move fast and break things, unless people are involved.

Please, don’t break people.

People are non-fungible.

indymike · on Jan 12, 2023

Software falls into two buckets: stuff that directly injures and kills people when it fails, and everything else.

justinsaccount · on Jan 12, 2023

People like to think that the alternative to "move fast and break things" is "move slowly and not break things" but it's not, it's "move slowly, break things anyway, then take days to resolve the problem because you never learned how to move fast".

DonHopkins · on Jan 12, 2023

You act like "moving fast" is all you need to know to "move fast". As if it's simply the skill of making time move faster, and you don't need any other skills than that, because once you've broken the laws of physics and changed the speed of time, everything just works faster without any differences or consequences. Do you watch a lot of Superhero movies?

You should work smarter, not harder. Just turn up your smart knob. But why didn't you ever think of that before? Probably because you had your smart knob turned all the way down.

shuntress · on Jan 12, 2023

Slow is smooth and smooth is fast.

justinsaccount · on Jan 12, 2023

I'm not sure what point you are trying to make. Physics? super heroes? knobs?

DonHopkins · on Jan 13, 2023

My point is that "you never learned how to move fast" is just as glib and patronizing and useless advice as "work smarter, not harder".

ESPECIALLY if millions of people's lives and fortunes are at stake, as in this case with air traffic control.

eckza · on Jan 12, 2023

That's just not true.

Moving faster does not mean moving more effectively. Frantic activity should not be mistaken for progress.

I'm OK with the FAA taking "days to resolve the problem" if it means that nobody dies.

For what it's worth, my team and I are partial to "move carefully, and tend things".

justinsaccount · on Jan 12, 2023

"Moving faster" doesn't mean "frantic" either. Who said frantic?

haswell · on Jan 13, 2023

Frantic is often a natural outcome of “moving faster” when the environment one is moving in is not conducive to that speed of movement.

In my experience, this tendency towards frantic is multiplied the larger and more complex the organization and architecture becomes.

The entire point of “move fast” in software circles is to leave behind the constraints of legacy tech and management practices in favor of building something “better”.

In a mature org that grew up before these ideas were mainstream, maybe one or two teams can manage to move faster, but invariably they end up depending on other teams, who in turn depend on deeply ingrained and established company culture and procedures.

We can talk about why those impediments are a Bad Thing, and I wouldn’t advise a consumer startup to adopt those methodologies in 2022, but there’s still the harsh reality that where they exist, “just move faster” doesn’t help much more than telling a depressed person to “just do cardio every day”. There’s often a lot of inner work that’s gotta happen to make way for the new.

The only way I’ve seen this sort of work in a large org is when a brand new “emerging tech” group is spun up and given autonomy to work outside of the legacy norms. This is not perfect either, and seems much better for greenfield projects. When applied to deeply entrenched legacy systems, all of the problems mentioned above come to a head.

This also creates a weird in/out group dynamic which tends to further stratify the old tech and widen the gap between the old practices and the new.

In the context of this particular conversation though, I think the concept of “move fast” has lost all meaning and has little to offer for an org like the FAA.

avgDev · on Jan 12, 2023

When I hear "move fast and break things", I am reminded of a guy that I worked with. He delivered work fast......full of bugs...0 planning...and his daily ritual was to just keep patching the "pile of shit" he put together.

He has now moved on and we sometimes chat, he works for a huge corp, still sucks at writing SQL.

justinsaccount · on Jan 12, 2023

And as I said, people think the alternative to "fast, full of bugs, 0 planning" is automatically "slow, no bugs, lots of planning" but it's often "slow, lots of planning, just as many bugs"

avgDev · on Jan 12, 2023

When you are in a complex spiderweb you simply cannot move fast. If you are moving fast you are not looking at everything and it will blow up in your face.

learningstud · on Jan 13, 2023

Or, you can have complex systems with PROOFS like CompCert and Sel4. With formal methods, you can move as fast or as slow as you like.

justinsaccount · on Jan 12, 2023

True.. which is why you want simple systems that are easy to understand and iterate on.

DonHopkins · on Jan 13, 2023

The worldwide air traffic control system is not simple or easy to understand and iterate on. And that's not because it was designed by idiots, or that you're so much smarter and more experienced and a vastly better programmer than the combined efforts of everyone in the world working on air traffic control, as you seem to be implying from your comfortable armchair.

Are you proposing the entire world simply give up air travel, because government regulations and industry standards and the laws of physics and chaos theory prevent you from having the simple easy to understand air traffic control system you envision?

justinsaccount · on Jan 13, 2023

Who said anything about the worldwide air traffic control system? Did you mean to reply to a different thread?

cloverich · on Jan 12, 2023

Change is the most common reason for breaking things. Moving fast means more broken things, hence the slogan. The alternative is indeed move slow, break things less often. It's a bad strategy when you NEED a LOT of change. But if you don't NEED a LOT of change, and you do need a lot of stability, it seems perfectly valid?

justinsaccount · on Jan 13, 2023

So for the things that don't need a lot of change, what are some characteristics of the system?

Does CI/CD exist? Does CI even exist? Are deployments automated? Is data sanity checked before loading? Is there a development environment? Do things like hourly snapshots exist? Can you easily provision a replacement system from scratch and restore data from a known good snapshot?

Or, is every process manual, slow, and error prone because there's never been a need to move fast.

Look at this one sentence in the article:

> In the overnight hours of Tuesday into Wednesday, FAA officials decided to shut down and reboot the main NOTAM system -- a significant decision, because the reboot can take about 90 minutes, according to the source.

So they do a reboot, that takes 90 minutes for some reason, and then that didn't even fix the problem. Their system that needs a lot of stability is now broken.

cdiamand · on Jan 12, 2023

The FAA has risk reduction baked into it's very heart and soul and this is evident the more you learn about how the system operates (go look at a flying textbook).

It is not perfect organization, but I think it deserves more credit than it is receiving in this thread.

berniedurfee · on Jan 12, 2023

Really. The outage was impactful, but it was remediated in a few hours.

That ain’t bad considering how infrequent incidents are in light of the immense complexity of the operation.

Much less complex and much more modern systems go down regularly and sometimes don’t come back up for much longer.

marmetio · on Jan 12, 2023

I guess this is the comment chain where we address the room?

I'm seeing a lot of misuse of the word "risk". The FAA prioritizes safety over mission. The mishap that resulted in downtime affected the mission. The common-cause failure of the backup system affected the mission. That is not evidence that they're bad at managing safety risk.

Given that there was an article published at the time about removing the dissimilar backup, it's probable that they explicitly accepted this mission risk.

arcbyte · on Jan 12, 2023

The only way to optimize for lowest overall risk is to optimize for speed of change.

All the checklists in the world to prevent something from happening are fine and dandy until something happens anyway (which it will). And then they hamstring you from actually fixing it.

Instead, if you can move fast consistently, you can minimize the total downtime.

nucleardog · on Jan 13, 2023

> Instead, if you can move fast consistently, you can minimize the total downtime.

In safety critical software where _a_ failure can result in loss of life, is “total cumulative duration of downtime” really the metric we’re optimizing for?

justinsaccount · on Jan 13, 2023

Yep, this is the exact point I tried to make above and got heavily downvoted.

If you can't move fast when things are working well, you can't move fast when things are broken. Acting like moving slow is going to prevent things from ever breaking is just wishful thinking.

thfuran · on Jan 12, 2023

>All the checklists in the world to prevent something from happening are fine and dandy until something happens anyway (which it will)

Eventually, but not as frequently.

simonh · on Jan 13, 2023

Downtime isn’t the metric their procedures are optimised to minimise. It’s optimised to minimise air traffic accidents. Moving fast might minimise total down time (though I seriously doubt that), but what effect would it have on accuracy and reliability? Mistakes mean dead people. In this incident zero people died. You really sure you know you can improve on that?

DonHopkins · on Jan 13, 2023

Please don't change jobs from Facebook to the air traffic control industry.

JCM9 · on Jan 12, 2023

This demonstrates a gross misunderstanding of how the FAA actually operates in its efforts to address “risk.” It’s far more theatre via bureaucracy and paperwork than actual proven engineering efforts that demonstrably reduce risk.

haswell · on Jan 12, 2023

This is a sweeping claim with really nothing to back it up. Are you saying this from an inside knowledge of the FAA, or is this just an opinion?

On the surface, the relative safety of air travel and the lack of major stoppages over a span of 22 years seems like a major counter example.

You’re making this statement emphatically and authoritatively, though, so I’m curious to understand where that certainty comes from and how it accounts for the other publicly visible properties of the FAA and air travel.

njovin · on Jan 12, 2023

Two examples:

1. The FAA basically handed their risk-management keys over to Boeing when authorizing the 737-MAX, contributing to those deaths (https://www.newyorker.com/news/our-columnists/how-boeing-and...)

2. The FAA's pilot medical vetting process, while thorough, is behind the times. There are people who took ADHD medicine in high school that are unable to obtain a medical certificate due to the FAA's overly-strict policies on prescription drugs. There are current pilots with serious mental issues who are afraid to see a doctor about them due to fear of losing their medical license (https://www.flyingmag.com/why-pilots-dont-want-to-talk-about...).

haswell · on Jan 12, 2023

> 1. The FAA basically handed their risk-management keys over to Boeing when authorizing the 737-MAX, contributing to those deaths

Couldn’t this also be interpreted as: when the FAA holds the keys, disasters like the 737-MAX tend not to happen? Obviously this raises questions about how that decision came about in the first place, but as an example, it seems counterproductive to your point, i.e. evidence that shifting away from some long standing policies directly led to harm, implying the original policies might have been better ones.

In a thread that seems eager to move fast and break things, this seems like a big problem, and would seem to indicate the need for a return to founding principles, not the opposite.

> 2. The FAA's pilot medical vetting process

This is an interesting one for sure, but also seems like an incredibly complex issue. Have there been studies about the safety of operating machinery while on those drugs that would obviate the need for a policy change?

The potential risk averted by such a policy would need to be weighed against the negative impacts of the 2nd order undesirable behaviors - obviously it’s bad that the policy discourages much needed mental health support, but how bad this is depends entirely on how effective the initial screening process is.

I’m not saying these mental health policies shouldn’t be changed, but neither do they seem to have obvious or measurably better alternatives at the moment.

And taken in the context of the original claim - that people are grossly misunderstanding the FAA and all of this is theater - they seem like weak examples to use as evidence of broad organizational failure.

jacoblambda · on Jan 12, 2023

Regarding #2, while there are a lot of issues with the FAA's medical process, you can absolutely get cleared to fly after having previously taken ADHD medication.

What you have to do is:

1. Not take those meds for at least 1-2 years.

2. Show documentation that getting off the meds hasn't impeded your performance. This generally means showing a stable work history if you've been off them for a long time or documents showing no change in performance between before you stopped taking them and x months after if you recently got off them.

3. Take an FAA ADHD re-evaluation.

Then they'll clear you. It's an annoying process but it's absolutely doable.

tibbydudeza · on Jan 12, 2023

The FAA was gutted in the name of deregulation and competition - you might want to ask the GOP what happens when you don't have adequate govt oversight and regulation.

Two planes crashed due to a design flaw - hundreds of people killed and one manufacturer and model forever tarnished like McDonald Douglass and their DC-10.

DonHopkins · on Jan 13, 2023

And you might also want to ask the GOP why Ronald Reagan fired 11,345 air traffic controllers and banned them from federal service for life.

https://en.wikipedia.org/wiki/Professional_Air_Traffic_Contr...

>On August 5, following the PATCO workers' refusal to return to work, the Reagan administration fired the 11,345 striking air traffic controllers who had ignored the order, and banned them from federal service for life. In the wake of the strike and mass firings, the FAA was faced with the difficult task of hiring and training enough controllers to replace those that had been fired. Under normal conditions, it took three years to train new controllers. Until replacements could be trained, the vacant positions were temporarily filled with a mix of non-participating controllers, supervisors, staff personnel, some non-rated personnel, military controllers, and controllers transferred temporarily from other facilities. PATCO was decertified by the Federal Labor Relations Authority on October 22, 1981. The decision was appealed but to no avail, and attempts to use the courts to reverse the firings proved fruitless.

My late friend Ron Reisman worked at NASA Ames Research Center on air traffic control and flight safety, and he hired up a bunch of the professional air traffic controllers who Reagan fired, and taught them to program.

Because it's much easier to teach an air traffic controller how to program, than it is to teach a programmer how to control air traffic.

And we have them to thank for how safe the air traffic control system is today.

https://www.nasa.gov/50th/Folklife/biosPropulsion.html

>Ron Reisman has BA in Philosophy and Classical Greek, and an MS in Computer Science. He joined NASA Ames Research Center in 1988 as one of the original members of the Center Tracon Automation System development team. Since the late 1990s he has worked on traffic flow management research and development. He is currently supporting the Next Generation Air Traffic System research.

I saw him give an earlier version of this talk at the November 1989 Usenix Montery Graphics Conference, where he discussed training air traffic controllers to program, and he subsequently gave me a tour of the flight simulators and air traffic control systems at NASA Ames:

https://www.usenix.org/legacy/events/usenix02/usenix02.pdf

>INTRODUCTION TO AIR TRAFFIC MANAGEMENT SYSTEMS

>Ron Reisman and James Murphy, NASA Ames Research Center, and Rob Savoye, Seneca Software

>This introduction to air traffic control systems summarizes the operational characteristics of the principal Air Traffic Management (ATM) domains (i.e., en route, terminal area, surface control, and strategic traffic flow management) and the challenges of designing ATM decision support tools. The Traffic Flow Automation System (TFAS), a version of the Center TRACON Automation System (CTAS), will be examined. TFAS achieves portability across platforms (Solaris, HP/UX, and Linux) by adherence to software standards (ANSI, ISO, POSIX). Software engineering issues related to design, code reuse, portability, performance, and implementation are discussed.

Based on what I know about Ron's and other people's diligent methodological work on air traffic control and safety, I feel extremely safe and confident flying, and I find it insulting to his memory and the legacy of his work when the armchair architect ex-Facebook employees on this thread (and the GOP) glibly and patronizingly implore the FAA to "move fast and break things", as if they had no idea how many lives and fortunes are at stake.

Here's a video of Ron showing Marvin Minsky the flight simulator, an early AR headset, and the hydraulic lifts:

https://www.youtube.com/watch?v=mOKENF_-z8Y

I previously mentioned his earlier work making Apple ]['s talk with dolphins, and his thoughts on AI and dolphin intelligence:

https://news.ycombinator.com/item?id=32039126

To answer "What is AI?" you first have to answer "What is I?"

Check out the Apple ][ at the Dolphin Research Center in 1982, and Ron Reisman’s thoughts about dolphin intelligence:

https://www.youtube.com/watch?v=CWHCTNztnwQ&t=392s

“It’s gotten to the point that I never say anything about intelligence in general. I don’t know what it means any more. I used to. But then I started trying to test it. And if you think about it for a while, you don’t know what it is.” -Ron Reisman

tibbydudeza · on Jan 13, 2023

737Max saga has nothing to do with air traffic control.

"The FAA, citing lack of funding and resources, has over the years delegated increasing authority to Boeing to take on more of the work of certifying the safety of its own airplanes."

"There wasn’t a complete and proper review of the documents,” the former engineer added. “Review was rushed to reach certain certification dates.”

Seems like FAA certification was a disaster in the making.

https://www.seattletimes.com/business/boeing-aerospace/faile...

btown · on Jan 12, 2023

Is this conjecture or based on documented shortcomings of their approach though? Much of this paperwork, especially if created in response to NTSB crash post-mortems, could very well be examples of https://fs.blog/chestertons-fence/ - is there reason to believe otherwise?

gnarbarian · on Jan 12, 2023

you nailed it. Plus the FAA is unable to even perceive the fact that this culture is counterproductive to the goal nor would they know how to address the problem in any way other than adding more layers of paperwork and bureaucracy.

neilv · on Jan 12, 2023

On what is your own understanding based?

tristor · on Jan 12, 2023

FAA regulations, more so than nearly anything else regulated in the US, are written in blood. They are not written out of government corruption. Many many many hundreds of people have died in airplane accidents since airplanes were invented, and the reason why your average person can board a commercial flight and act like they just got on the bus (with a /better/ risk profile than a bus on public roads) is exactly due to these regulations. The FAA has and continues to lead the entire world in how to do reasonable and meaningful air traffic regulation.

berniedurfee · on Jan 12, 2023

Mentor Pilot on YouTube does a great job of discussing aircraft incidents and how the RCA almost always informs new regulations.

nonethewiser · on Jan 12, 2023

Paying exorbitant contracts to companies with weak talent and processes is not "stringent."

s1dev · on Jan 12, 2023

Yes. Breaking things in aviation frequently results in a body count

tyingq · on Jan 12, 2023

>The FAA handled this outage appropriately and in a timely manner

It's still not working as designed, right now. The workarounds in place are fine, but were made up on the fly. Certainly there's room for improvement.

greggarious · on Jan 12, 2023

If they’re so risk averse, why are their systems failing so severely?

It’s absolutely a corruption issue, in that the government prefers to pay 2-3x what they would to solve things internally to contractors who then perform a poor job and lobby to keep whatever they build in place for decades.

r3trohack3r · on Jan 12, 2023

> It’s absolutely a corruption issue, in that the government prefers to pay 2-3x

Being serious, I wish I could find one of these 2-3x multiple payouts in government. Every time I've looked at anything government related (including direct contractors) the pay is garbage. Usually 15% to 50% of what the private market pays for my skill set.

panzagl · on Jan 12, 2023

The contracting company receives usually 2-3x what the actual developer makes- so if the gov't pays $300K for a developer, they get a $100K developer. Some of that overhead is justified, but a lot of it is the company's profit.

The old answer used to be for the government to hire directly, but that's been hamstrung for like 40 years by now.

HillRat · on Jan 12, 2023

At my old firm, our federal project profit margin was a bit lower than median, though once you factor in the sales, contracting, and leg overhead the bottom line was somewhat worse than the project actuals made them look. Since our side of the house did relatively short-burn contracts (6-12 months), federal work was generally not that valuable for us; I used it for filler work when our usual sales pipeline was weak. The real value was for the side of the house that did long-term software and support work or heavy citizen support outsourcing, when contract durations can be measured in decades. Same for a friend who inked a $5bn DOD deal; the margin isn’t great, but it’s a 10-year deal that gives her a stable cash flow basis to grow on.

(Also, OP is probably underestimating the full-sheet cost of a federal FTE, as well as the complexities of fund-based budgeting and forecasting.)

justjash · on Jan 12, 2023

Worked for a govt contractor years ago and I would imagine the payout from the govt contract was at least 2-3x. There were usually 2-3 layers of people getting paid and I don't think any of them were hurting. Of course no employee was getting a massive payout compared to private.

Govt > contractor > sub-contractor > employee was pretty common. I never knew of anyone being a direct contractor to the agency since the contracts were so large and involved a lot of employees.

bink · on Jan 12, 2023

It's been several years since I've done contracting work (and for the FAA no less), but that can't really happen to the extent you're implying. The open roles for contract positions specify a pay range based on experience (degrees and/or years of experience). You can have a prime and then a sub contractor, but the amount that each can add for management overhead is limited and depends on the contract vehicle. I want to say it was around 5-10%.

The actual employer can pay the employed contractor whatever they want, but the rates are published and if they underpay too much the employee will be poached by a competitor. Because the rates are public info the employee can look up how much profit their employer is making any time they want.

specialp · on Jan 12, 2023

I am not sure if you ever dealt with a very entrenched system that has been in place longer than you have been alive, but it is not easy. It is also not easy to hire to deal with these things. Especially at government rates which are lower than the private sector.

If you were the FAA and started up an internal startup to find only the best to replace systems that have been running forever you would face a lot of problems.

1. Nobody would give you the funding until something like what happened yesterday did

2. Your developers getting paid more than the FAA directors will get lots of political attention

3. Even a great internal engineering team would most likely take ages to do something like this. This isn't move fast and break stuff with a greenfield, it is painful deconstruction and analysis of a very complicated system.

4. Many people won't want to work on this no matter how much you are paying.

It is also out of the wheel house of an agency tasked with flight safety. So expensive contractors arise. Do they do a poor job often? Yes. But these are jobs that are really hard to scope and execute. If it is really a case of overpaid contractors coming in and not doing the work, people here should start a startup and hire a SEAL team six of 1970's computer system rip and replacers and make a lot of money.

Cthulhu_ · on Jan 12, 2023

Maybe the better question to ask is how bad would it be if they *weren't* so risk-averse.

jjk166 · on Jan 12, 2023

Imagine how bad it would be if Kim Jong Un *weren't* so pro-american. It's a ridiculous hypothetical. Everything can always be worse, that is not a testament to an entity's quality.

gnarbarian · on Jan 12, 2023

demonstrably the policies employed by the FAA do not effectively avert risk (in software). It's more akin to risk aversion theater.

haswell · on Jan 12, 2023

There’s a lot to unpack in your claim, and this really hinges on your definition of “effective”.

What do you consider to be effective, and how would you differentiate it from theater?

Right up front, it seems necessary to account for the relative safety of flying if that safety has nothing to do with FAA policies.

It also seems that any safety policy that is effective enough will eventually appear indistinguishable from theater as people become more and more disconnected from the possibility of disaster.

Ozone mitigations come immediately to mind.

malfist · on Jan 12, 2023

Which is why we've not had a commercial plane crash since.......2009?

Seems FAA is pretty effective at reducing risk.

gnarbarian · on Jan 12, 2023

a few things

1) mostly thank the airline operators for that.

2) the fact the 737 Max disasters happened outside the US was a matter of chance not exemplary policies by the FAA. FAA policies did not prevent the deployment and roll out of this dangerous aircraft in the United States and did not lead to grounding of the aircraft until multiple events occurred.

3) I'm mostly talking about software here, and I believe a big part of this issue is shoe-horning software into the policies and procedures specifically designed for aviation. there are so many little pointless (in most contexts) requirements that cause the engineers working on these systems to lose the forest through the trees. the FAA creates its own complexity which prevents thinking holistically about our systems in a meaningful or effective manner.

in many ways this is a force of entropy. the more lines of code you have to support thousands of requirements and the more revisions you make to those lines of code without a top to bottom refactor the more likely you are to have insidious bugs that pop out like this incident.

brazzy · on Jan 12, 2023

> 1) mostly thank the airline operators for that.

This is a 100% proof that you have not the faintest clue what you're talking about.

The #1 priotity of airline operators is profit, not safety.

spearman · on Jan 12, 2023

Another way to think about it is that safety is good for profits, at least in so far as consumers care about it. And I think consumers do care quite a bit. Of course there are arguments to be made for regulation, but I think your statement is overly emphatic.

brazzy · on Jan 12, 2023

> And I think consumers do care quite a bit.

In theory, if you ask explicitly, of course they do. In practice, they choose the cheapest ticket and maybe avoid airlines that have had a high profile incident recently. They don't have the time or means to actually evaluate an airline's safety culture.

Meanwhile, executive decisions are driven by quarterly earnings reports, and you can cut a lot of corners for quite a number of quarters before your luck runs out and 200 people die.

My statement is, if anything, not emphatic enough.

gnarbarian · on Jan 12, 2023

I was a software engineering manager at the FAA for more than six years.

brazzy · on Jan 12, 2023

That doesn't make the statement any less ridiculous.

What was the ultimate reason for the 737 MAX debacle? That airlines want to save money on type rating training.

Look at accident reports, and half the time the airline's safety culture (or lack thereof) is at least a contributing cause.

The FAA may be in many ways dysfunctional, but so are the airlines, and it's sure as hell not them who are pushing for better safety standards, it's the FAA and (especially) the NTSB.

gnarbarian · on Jan 12, 2023

point was that if FAA software policies around safety were actually effective we would not have had the 737 max issues.

inetknght · on Jan 12, 2023

On the other hand, if Boeing actually cared about safety instead of profits they wouldn't have done their utmost to hide the fact that they were avoiding the FAA's safety regulations to improve profits.

bastardoperator · on Jan 12, 2023

I haven't had a car crash, should I also thank the FAA?

malfist · on Jan 12, 2023

If the FAA was involved in car safety and there was a reduction in car crashes, absolutely.

nonethewiser · on Jan 12, 2023

How may would there be with an FAA alternative?

mlhpdx · on Jan 12, 2023

More. Look at the track records around the world. Similarly run programs get similar results, “differently” run programs less so.

SkyLemon · on Jan 12, 2023

You mean like Boeing?

Waterluvian · on Jan 12, 2023

Demonstrably flu vaccines do not effectively avert risk. Demonstrably auto safety standards do not avert risk.

But they do. And they've been so incredibly effective that we have collectively forgotten what risk used to feel like, so we're ready to say we don't need these standards/organizations or that they're not working. Obviously this is not to say they are perfect or free from criticism. But it's not just theatre.

adamsmith143 · on Jan 12, 2023

>a few things 1) mostly thank the airline operators for that.

So Thank airlines for not crashing and do what when Boeing creates a death machine called 737 MAX??

eli · on Jan 12, 2023

They don't completely eliminate risk, you mean?

nonethewiser · on Jan 12, 2023

He quite clearly does not mean that. Please address his perspective in good faith.

eli · on Jan 13, 2023

A bad outcome here doesn’t demonstrate their policies are ineffective any more than a car crash demonstrates that seatbelts are ineffective.

rattlesnakedave · on Jan 12, 2023

> You all seem to think this is similar in value or operation to a web app

This line of reasoning is pure, 100%, cope. There is a reason this was broken and grounded flights nationwide. Providing cover for systems that allowed this to happen is not productive.

haswell · on Jan 12, 2023

> This line of reasoning is pure, 100%, cope.

What does this even mean?

There’s a meaningful difference between providing cover and reminding people that major systems like the one in question are in a different category than the average web app.

The fact that such a nationwide stoppage hasn’t occurred since 2001 speaks to the stability of these systems, and I’m not sure what you’re suggesting here.

> There is a reason this was broken and grounded flights nationwide. Providing cover for systems that allowed this to happen is not productive.

I’m not trying to sound snarky here, but things do tend to break for reasons. Are you suggesting that there is a cure?

dmix · on Jan 12, 2023

Earlier today another HN user linked to a PDF from a previous 2018 (cira 2014) investigation that pointed to the "dual-channel back up" system being fragile and likely insufficient.

https://news.ycombinator.com/item?id=34338373

> ERAM’s original design did not include a dedicated backup system. FAA believed that ERAM did not need one due to the redundancy provided by the system’s dual channel design. This design was intended to prevent outages because it allows for seamless switching between the two channels without impacting air traffic control should a problem occur in the active channel. The Agency believed this redundancy would make it unlikely that a problem in one channel could migrate to the other channel. In addition, to provide additional backup capabilities during ERAM’s implementation, FAA planned to temporarily maintain EBUS, its pre-existing backup system, before phasing it out completely beginning in 2015, to rely solely on ERAM’s dual channels.

> However, problems experienced during and since ERAM’s implementation have shown that the system remains susceptible to dual channel failures. As a result, FAA decided to maintain EBUS much longer than intended because air traffic controllers currently rely on EBUS for backup. However, with FAA’s ongoing and planned upgrades to ERAM, which will span the next 7 years, EBUS will soon become incompatible with the new hardware. As such, FAA plans to begin phasing out EBUS in April 2019, leaving ERAM without a backup system to supplement the system’s redundant dual channels.

A $2 billion dollar Lockheed Martin system where a memory overflow takes down BOTH the main system and the backup system. Surely that's not a signal of fragility.

gnarbarian · on Jan 12, 2023

I spent a lot of time at the FAA writing software. (6+ years) there is a huge culture of process, policy and not a whole lot of thinking or analysis or actually understanding the problems that they are working on. it is maddening.

imagine a spreadsheet with 700 lines in it telling you that you need to do ABCDEFG each of those lines is instructing you to write a document detailing a procedure with the chain of custody forms and keys and whatever password rotations etc etc. follow this process and fill out this presentation template wait 3 months and then present it to a board who doesn't give a shit or have any understanding of your project.

it's all an insane amount of work and it gets us the opposite in terms of the goals these processes are intended to achieve.

I have zero doubt that the only thing that will happen as a result of this catastrophic incident is another 50 lines in a spreadsheet somewhere telling you to do stuff that nobody will comprehend or implement correctly or even verify until there is a similar incident causing an investigation into it.

metacritic12 · on Jan 12, 2023

Agencies like the FAA are notoriously risk adverse. Basically the motivations of most employees seems like, "if I mess up once, I get fired; if I don't produce any movement, I can't get fired". Naturally, the output is glacial progress and the introduction of tons of safety-theatre procedures (that get the implementers promotions for "increasing a culture of safety").

We all know this from the subject the FAA regulates: flights. New unleaded gas gets forever to approve. Simple changes to instruments takes years of certification, leaving 1960s technology in place when clear improvements have happened in the last 60 years.

I always wondered what would happen if this culture were carried over to another space. We know what it looks like in medicine because the FDA has similar priorities. Rarely do we see it in tech, which is known to "move fast and break stuff". But here we get a glimpse of the dystopian crossover between FAA-procedure-culture and software engineering.

avar · on Jan 12, 2023

We know what it's like in software because we have e.g. the example of how the Space Shuttle's flight control software was developed and audited. It's weeks of meetings, tests and change processes etc. obsessing over changing a single instruction in assembly.

Which also suggests to me that the problem isn't per-se the glacial progress & approval processes, but that they're using the wrong glacial processes.

If this FAA software was developed with something like the Space Shuttle's process it would still take forever to change something, but at least you'd end up with a function that you could mathematically prove would be able to handle any conceivable input.

I think you're never going to convince the government of a "move fast" culture. Even if the overall cost of grounding planes would be less than the cost of more reliability they'd never go for it.

Too many people's asses are on the line, and they're not having to spend their own money. The only thing they have to "pay" for is possible loss of face, or loss of political capital, both of which can be insured against effectively for free with taxpayer dollars.

But you might just be able to convince them that they're using the wrong sort of bureaucracy. You'd still spend a billion or two on something that should cost a million, but at least you'd get actual reliability as a result.

metacritic12 · on Jan 12, 2023

I would settle for the government moving bureaucracy from safety-theatre to actual safety.

nradov · on Jan 12, 2023

The NASA Space Shuttle flight software was among the best ever developed. There was never a defect that impacted safety.

But that team had kind of an "unfair" advantage in that were able to program in assembly code on bare metal with no real software stack. Whereas the rest of us are forced to build on a foundation of sand using multiple layers of low-quality third-party software in order to deliver any useful functionality.

skissane · on Jan 12, 2023

> But that team had kind of an "unfair" advantage in that were able to program in assembly code on bare metal with no real software stack

The Space Shuttle’s avionics software was not written in assembly, rather HAL/S, a high-level language invented for the project. Assembly was mainly used for the custom real-time OS kernel. They also maintained their HAL/S toolchain, which was written in XPL-a PL/I dialect which was popular for compiler development in the 1970s. The development environment ran on IBM mainframes, and the main CPUs on the Shuttles were the aerospace derivatives of the IBM S/360 mainframe architecture, System/4pi, model AP-101. The same CPUs were used by USAF (e.g. the B-1 bomber), but USAF mainly used JOVIAL to program theirs. Another big user of JOVIAL was the FAA, who used it to write a lot of their original mainframe-based air traffic control software (FAA HOST).

The Space Shuttle team inventing their own programming language was a byproduct of the time the project started (1970s). If they’d started a decade later, they probably would have used Ada instead. But Ada didn’t exist yet, and they thought inventing their own language was a better choice than JOVIAL

Cthulhu_ · on Jan 12, 2023

It sounds less like risk averse and more like cover-your-ass though; go through long checklists and committees and people so that no one person can be held responsible for any problems.

The closest thing you'll see in tech will be at back-end software like banks, insurance companies, pension funds, investment companies, embedded engineering / SCADA, ERP systems, etc. The "move fast and break things" mindset seems to mainly be a thing in Silicon Valley internet companies / start-ups, and mainly the latter because they're driven to pump up their own value instead of provide reliable software. Because if Twitter is down, it's an inconvenience, but if airplane software fails, lives are on the line.

tomrod · on Jan 12, 2023

> Agencies like the FAA are notoriously risk adverse.

There is a difference from risk averse ("we require heavy testing before deployment") and dysfunctional ("we are terrified of breaking anything but are unwilling to invest in maintenance").

Take the Air Force. Their risk decision is: can we accomplish a specific mission at hand with ideally minimal loss of warfigher life. If they don't maintain their planes, they cannot achieve the warfighter life loss minimization.

Similar with IT generally. Kicking the can down the road just grows the problem.

chasd00 · on Jan 12, 2023

yeah working for government is strange. I'm in a position where the product owners don't want to make any enhancements/changes to a production system out of budget concerns. However, the actual invoice for my team's time is the same whether they make changes or not. They don't want to "spend the money on enhancements to a functioning system" but the invoice amount is the same every month regardless.

oh and in my experience, government fte's can screw up an infinite amount of times with no risk to their job.

Tangurena2 · on Jan 12, 2023

The early aviation industry had a very substantial "move fast and break stuff" mentality. YouTube has plenty of videos of those folks. However "break stuff" usually meant smearing aircraft and bodies all over the landscape. Much of FAA's regulation is to try to prevent killing "too many" people. If the engine on your car stops, you can usually pull over to the side of the road. If the engine on your aircraft stops, you're going to be landing soon and hopefully not landing on a building full of people. And if you're really lucky when the engine stops, all the people sitting in the back can walk away.

The FAA is frequently called a "tombstone agency" - who only act long after fatal accidents and the bodies have been buried.

vikingerik · on Jan 12, 2023

When that risk-adverse culture is carried over to another space, it just fails and never gets noticed. Risk-adverse organizations get outcompeted by more efficient risk-tolerant ones, every time that such competition is possible.

Risk aversion only works for an entity that has a forcible monopoly in its space. The FAA and FDA do. Another is the Nuclear Regulatory Commission, which exhibits the same behavior: their job is to prevent accidents, and the surest way to do that is to never approve anything at all.

nonethewiser · on Jan 12, 2023

Seems like Space-X is a counterpoint to this type of operation. They are much more efficient and have been rather reliable.

greggarious · on Jan 12, 2023

>Agencies like the FAA are notoriously risk adverse. Basically the motivations of most employees seems like, "if I mess up once, I get fired

Weird, my impression was federal employment is the opposite: mess up as much as you want and you’ll be retrained and reassigned.

Or a that only in contexts like harassment or drug use?

metacritic12 · on Jan 12, 2023

You're right, my "mess up once, get fired", I really mean if a flight crashes and the FAA could have done something, they'll get political flak. High level political appointees may need to resign. Whereas if an agency does nothing for a term, the political leaders get to stay.

For routine FTEs, as a sibling post mentioned, you can probably mess up in a lot of ways (short of murder or being really non-PC) and still keep your job.

grepfru_it · on Jan 12, 2023

This is common for government agencies. I worked at Labor and your second paragraph hit very close to home. The only plus side was the insane amount of free time you spent waiting. I tried to go back on the civilian side of the contracts and I nope'd out of it as soon as I hit red tape

mobilefriendly · on Jan 12, 2023

This is by design, it meets the primary goal of the government which is funding patronage jobs that make few, if any, demands on the worker.

selimnairb · on Jan 12, 2023

All the federal employees I have met are EXCELLENT. It is very difficult to get hired into one of the dwindling jobs at the agencies. A lot of the government has been contracted out. Having worked at a federal IT contractor, I can say that in my experience most of those workers are very good and dedicated to their work. HOWEVER, they don’t always understand well the mission at hand. Some of this is to be expected given that they are contractors who come and go more frequently than federal employees charged with implementing government programs.

VLM · on Jan 12, 2023

Ironically extreme selection pressure is exactly what leads to extreme risk aversion.

"We only hire the absolute best" does not lead to "move fast and break things" (which sounds awful in a FAA context anyway) it leads to people who devoted their lives to being the absolute best at coloring inside the lines and never straying off the path, to being the best follower out there, to the ultimate authoritarians desiring to grow into being the authority.

The heaviest selection pressure usually does not lead to the most efficient system, it generally leads to a system able to endure heavy selection pressure.

There's a sociologist who wrote a famous book about bureaucracy and its in my library at home and the name of the sociologist and his book are at the tip of my tongue but he wasn't near the top of a quick google search; the above is a paraphrase of his book. No its not Douglas Adams or even Scott Adams although those two are correct about the problem in general LOL.

ssklash · on Jan 12, 2023

David Graeber?

__MatrixMan__ · on Jan 12, 2023

People who want a job like that should just be on UBI instead. Then at least we'd have systems that could change to meet the needs of their users in a timely way.

sokoloff · on Jan 12, 2023

UBI will never pay what a government job pays in purchasing power; that part is just math I think.

__MatrixMan__ · on Jan 12, 2023

I don't think that is unreasonable to expect that government pay in a world where the government is a welfare program with a governing hobby might have less purchasing power than UBI in a world where we prioritize effective governance over beaurocracy. It's not a zero sum game.

sokoloff · on Jan 12, 2023

There are just under 3 M federal (civilian) employees. I think it's entirely unreasonable to think that we would pay a UBI to ~210 M adult citizens (a 70x multiple) at levels that would represent a greater amount of purchasing power than to the people nominally working for the federal government.

If you're firm in your view that that's reasonable, I'd like to learn more about the proposal as to how the math would work.

__MatrixMan__ · on Jan 12, 2023

OK.

I'm not going to code up a simulation because the research hasn't been done to confirm my choice of constants, but I can sketch it. Each workday is a function of the macroeconomic climate and some set of cultural norms during which we exhibit some blend of the following personae. As we'll see, introducing UBI reduces the prevalence of the bureaucrat persona which has knock-on effects leading to surplus.

---

The Missionary - has a mission and is working towards it. Cares more about the mission than prestige.

The Worker - doesn't have a plan, but likes to be a part of something meaningful. Will gamble with prestige in order to ensure that the work stays meaningful.

The Bureaucrat - willing to tolerate or create waste in favor of preserving prestige. Sometimes manages to trick a worker into believing they're a missionary.

---

Obviously people are more complex than this. Also, I'll use dollars to indicate productive output even though I think that most of the time collapsing such things to a single dimension is a slippery slope to somewhere awful. All this to say: gimme a break, it's model.

Here are my totally made up constants, note that X is a parameter which will depend on UBI:

---

Missionary creates 100$ of output always, plus a 1% daily chance to inspire a worker to become a missionary, a 1% chance to inspire a bureaucrat to become a worker, and a 1% chance to burn out and become a worker.

Worker creates 80$ of output if they're following a missionary and -20$ if they're following a bureaucrat because it's likely that they're causing more harm than good. They have an X% chance of burning out and becoming a bureaucrat.

A Bureaucrat creates -$20 of output, because they're definitely doing more harm than good.

Now lets say that everybody consumes $5 each day to stay alive.

---

So X is our worker burn-out rate.

As with most systems of this kind, it's very sensitive to initial conditions. If you start with a high enough concentration of workers and missionaries, your bureaucrat rate will be very low and you'll have a surplus. Too many bureaucrats and most of your workers are doing more harm than good, the system is carried (if it survives at all) by the missionaries and the minority of workers following them.

Critically, X is a function of risk tolerance. The worker becomes a bureaucrat because they cannot tolerate the risk of pointing out the wastefulness of the bureaucrat above them.

Introducing UBI does two things. It makes standing up to your Bureaucrat less risky, reducing X, and it creates a fourth type, the Video Gamer, who consumes $5 to stay alive but doesn't sabotage the output of any workers like the bureaucrat does.

Some percentage of the Bureaucrats will become Video Gamers if UBI is implemented. That percent depends on the size of the surplus. If the surplus gets big enough, UBI can be so comfortable that there's no reason to be a bureaucrat, because it doesn't afford a significant quality of life increase.

---

So to answer your question about the 3M and the 210M, I'd guess that today we've got 213M people living on the positive output of maybe 50M--the rest are bureaucrats or are following bureaucrats. They're busy fighting over their slice of the pie instead of baking it. Bureaucrats sort of expand to consume available resources, so as automation improves worker output, that ratio will get worse unless we find a place to put them.

We'd have to do research to come up with better constants and run that model for real to be sure, but I don't think it's unreasonable to assume that reducing both the bureaucrat concentration and the worker burnout rate by 50% would triple the system's output once you let the personae conventions find a new equilibrium. I'm not sure how much more federal employees will get paid above UBI, but I think there's room for the end result to be that future UBI is cushier than today's government work.

sokoloff · on Jan 12, 2023

Where does the UBI money come from in this system, particularly if the surplus gets so bit that there's no reason to take a government job?

__MatrixMan__ · on Jan 13, 2023

We issue it to ourselves, more or less like CirclesUBI is doing it in Berlin.

They're just letting it be inflationary and setting the payout to increase over time to adjust for inflation. So maybe you get $5 per week this year and $8 per week next year... This can be balanced so that it amounts to a more or less constant purchasing power.

Personally I prefer the demurrage approach where account balances just have a decay rate--that way you've got a better shot at $5 written down today having the same meaning to people who read it next year, but the economics are the same (more on the theory here: http://en.trm.creationmonetaire.info/ ).

It's gotta be decoupled from the government so that, as discussed in my model, it can act as a safety net while you're ridding yourself of wasteful bureaucracy. It doesn't really work if the bureaucrat you're deposing can threaten to take away your UBI.

VLM · on Jan 12, 2023

The people who need a pyramid structure to strive for, and office politics to fight, will never settle for "from each according to their ability and to each according to their need", and those are the people we've selected for.

__MatrixMan__ · on Jan 12, 2023

So we give them a pyramid to fight over. We should just stop letting it be the whole world.

TomSwirly · on Jan 12, 2023

[flagged]

selimnairb · on Jan 12, 2023

Here here. Also, if there weren’t red tape, then there would be more corruption and lack of accountability. The staff of agencies are damned if they do, damned if they don’t. If one wants to critique government agencies, criticize the political appointees who are in thrall to the industries they are supposed to be regulating. The rank and file generally work hard and in good faith. They are just trying to be good stewards of public resources. I’ve seen this at the federal level and state levels (primarily in North Carolina and Louisiana).

MichaelZuo · on Jan 12, 2023

It sounds like it would be very simple to implement correctly if it was literally spelled out like what your describing?

tominous · on Jan 12, 2023

I just spent a month doing an E-Business Suite platform migration and it was very similar: follow the step-by-step instructions to apply patches and run commands. Each patch has a README file with dependent patches or commands which need to be completed first. It works mind-numbingly great until you run into the first of many circular dependencies.

That's one problem with treating the implementer as a machine to run code. The whole procedure can't be tested, so when parts are changed they can break the whole. It relies on the human in the loop to resolve the conflicts, which is not repeatable.

The other problem is the "mind-numbing" part. No-one can maintain 100% perfection all of the time. And in the context of presenting to people who don't know what it all means, I can see why mistakes would be made.

MichaelZuo · on Jan 12, 2023

There can be multiple folks doing it in parallel and checking each other’s work, unless I’ve misunderstood?

If problems of circularity arise they can try to clarify the issue, or go back to the drawing board.

jjk166 · on Jan 12, 2023

The problem is that spelling out complicated things is hard. Take the law for example - in theory we have a coherent code that specifies exactly what things are crimes and the appropriate methods of dealing with them. In practice, it takes teams of highly trained professionals and an elaborate system of courts to clarify what these laws mean in all but the most trivial cases.

Generally you need some flexibility to handle slight variations in circumstance whenever making a decision, and at times things come down to judgement calls that can not be turned into an algorithm. But bureaucracies don't like empowering their workers to make decisions, and so you get ever more conoluted instructions to shift the decision making process higher up the ladder.

MichaelZuo · on Jan 12, 2023

Which legal theory postulates a ‘coherent code that specifies exactly what things are crimes and the appropriate methods of dealing with them’?

jjk166 · on Jan 13, 2023

That is a different use of the word theory, and serves as an excellent example of why the problem of communicating complex ideas so unambiguously as to eliminate the need for interpretation is so intractable.

MichaelZuo · on Jan 13, 2023

More concretely, I have never read or encountered anyone educated in or practicing law who supposed this could even exist.

Where did you acquire this notion?

nonethewiser · on Jan 12, 2023

What produces this sort of environment? Is it basically a way for everyone to shed responsibility?

I guess even that wouldnt answer it, since this isnt really a problem in private industry.

rprospero · on Jan 12, 2023

In my experience, this arises as an unintended consequence of the quest to lower costs and reduce bureaucracy.

About ten years ago, a new manager was brought in to make us act less like a moribund government department and behave more efficiently. As an example of government waste, he pointed to the money we were spending on storage for data back-ups. We wouldn't need back-ups if we stopped making mistakes.

You might think that this no-back-ups policy would be an instant disaster, but it lasted years without issue. When there was a failure, the manager would hand the sys-admin a soldering iron, the admin would fix the hard drive, and we would be back on track. Finally, the sys-admin retired and a new one replaced him. Not long after, a critical system failed and data that we were required by law to maintain was lost. The manager handed the admin a soldering iron and told him to fix the hard drive. The admin said it was impossible and the manager fired him (yes, you can get fired from a government job). Other candidates were interviewed, but no one applying for a $30k job was confident that they could repair a broken hard drive.

Finally, there was talking of hiring the old admin to come out of retirement and fix the drive. Except he explained that it had always been impossible. During his tenure, he'd spent 5% of his salary (gross, not net) paying for back-ups and replacement drives. When the manager gave him a soldering iron, he'd just chuck out the old drive, by a replacement off Newegg with his personal credit card, and load it with the data he'd backed up to his personal S3 storage. His back-up script was still running on the server, but he'd stopped paying for the storage space the moment he retired.

Eventually, the manager was forced to spend a whole year's budget on an expensive data-retrieval firm to collect the data (which was still cost an order of magnitude less than the fine the department would have had to pay if we'd lost the data). He was fired and a new manager brought on board. Because of the money which had been lost on the data retrieval, new measures were put in place to prevent this from ever happening again. This included a new back-up system and audits to ensure that other employees were using personal funds to pay for departmental resources. Of course, this meant rigorously documenting exactly what resources each employee was using...

Six years after the manager was brought in to decrease cost and increase agility, we were now more over budget and tightly controlled than we'd ever been.

themitigating · on Jan 12, 2023

And you were a part of that culture ?

kortilla · on Jan 12, 2023

I see this a lot with people are experts in real time operating systems environments, particularly in aviation/space stuff (maybe because that’s where I worked for a while).

They have excellent intuition around making things redundant to single pieces of hardware failing but don’t really grok making stuff resilient to wider failures.

Anything involving transaction logs, rollbacks, and plain old backups take a backseat to live hardware-redundant environments. “It’s OK though because we follow the NASA software development process which has a rigorous set of validation steps that prevent bugs.”

stingraycharles · on Jan 12, 2023

> They have excellent intuition around making things redundant to single pieces of hardware failing but don’t really grok making stuff resilient to wider failures.

I always feel like making single components redundant is a fairly well-defined process -- generally speaking, the mechanisms are the same (1+ redundant components, failover, STONITH, etc), where making things resilient on a higher level is not as well-defined, and often requires bespoke solutions to each unique situation.

killerstorm · on Jan 13, 2023

Hmm?

BFT state machine replication is well-understood and well-defined: use N of M agreement for inputs and run them through a deterministic state machine. Optionally, do N of M signature of outputs.

OTOH what are properties of failover? "Failover" seems like an attempt to cheat on Byzantine generals' problem: Generals send mail and the confirm results in a Zoom call. But what if Zoom doesn't work? What are the assumptions for 1+ redundant components/failover/STONITH?

namibj · on Jan 12, 2023

Formal verification of not having such fatal bugs would allow a real-time system without reliance of backups to not screw up, but still of course having logs/rollbacks for (human) input actions to cope with mistakes.

It's just that production software essentially never used formal verification to a sufficient extend.

ilyt · on Jan 12, 2023

Well, it is hideously expensive to do on bigger pieces of code. And I'd imagine you can still get the spec you verify against wrong

controlller · on Jan 12, 2023

Note that this most recent outage (NOTAMs) had nothing to do with ERAM.

ERAM is employed the the 23 air route traffic control centers [1] throughout the nation as their primary operating system. If there were a system-wide outage of ERAM, the consequences would be magnitudes more consequential than any NOTAM outage. Basically every flight in the air and not close to a terminal facility would lose radar contact and controllers would be working blind, causing widespread chaos and likely many safety incidents. Non-radar air traffic control is a thing, but generally controllers do not have adequate training or currency to do it safely, and definitely not at anywhere near normal capacity.

[1] https://123atc.com/facilities#centers

JCM9 · on Jan 12, 2023

The NOTAM system is something that a room full of decent engineers could easily build from scratch and make it infinitely better in a short time. It’s essentially just a database of categorized posts with some APIs for sending entries and and returning them when requested. These government IT teams spend way more than what it should cost and end up with bloated ancient tech that barely works.

That’s not speaking poor of the engineers (which in my experience can be very good) but of the management and innovation culture of these agencies, which is too often terribly broken. They would say they are “risk averse” but as yesterday highlights their poor approach to this creates a ton of risk.

Someone1234 · on Jan 12, 2023

"I haven't seen the requirements, know little to nothing about the system, but I could knock that out in a weekend with a few Red Bulls." This stuff is such cringe, it is the type of response you see from fresh CS students who haven't started working yet and think everything is a piece of demo work where the requirements don't matter and that everything is simple if you just start writing some code.

Obviously the NOTAM system isn't up to scratch, that's why we're talking about it. But "hard problems are easy" isn't constructive nor very realistic. I bet someone could put in a three-hour presentation talking about all the complexity that led us to this conclusion.

JCM9 · on Jan 12, 2023

I hear you. But the flip side is there is often a tendency to massively over complicate things in a way that bakes in valueless complexity and bloat, which is very much the norm in large government systems. I’m practice this attitude is far more common and destructive long term.

nonethewiser · on Jan 12, 2023

I think it's way worse than adding complexity and bloat. There are processes that specifically prevent anyone from understanding or owning the system. The complexity and bloat is a side product of the fact that the work was siloedd, contracted out, and everyone washed their hands off the result. Which also takes exponentially more time and people.

Its optimized for inefficiency.

gnarbarian · on Jan 12, 2023

this is extremely accurate. it's like a giant rube Goldberg machine that takes both good code and garbage as input and produces a tremendous amount of garbage that nobody can understand once it comes out the other end.

bink · on Jan 12, 2023

Do you know that each of these systems actually has a defined "owner"? It's part of the FISMA process and every so often these "owners" have to attest that certain security processes are in place and functioning and that no significant changes have been made without a full review... among many other things.

That "owner" is also a federal employee, not a contractor.

https://en.wikipedia.org/wiki/Federal_Information_Security_M...

CamperBob2 · on Jan 12, 2023

What makes this a hard problem, exactly? It has to work no matter what, but it also doesn't have to do very much.

gnarbarian · on Jan 12, 2023

I've worked directly with this system a number of times. it's basically a pub sub rss feed optimized for low latency and molested by bureaucrats for decades

rattlesnakedave · on Jan 12, 2023

> This stuff is such cringe, it is the type of response you see from fresh CS students who haven't started working yet and think everything is a piece of demo work where the requirements don't matter and that everything is simple if you just start writing some code.

This stuff is such cringe, it is the type of response you see from jaded boomers more focused on box-ticking and punching out as opposed to doing anything new.

This is a situation where tossing the whole damn thing out and starting over again would be productive. The systems that lead to the creation of these half-fossilized government projects (that still don't work!) will not change and needs to be tossed.

jsight · on Jan 12, 2023

The problem with replacing NOTAMs is that it actually is a fairly simple system that works pretty well from a technical standpoint.

The biggest issues in general relate to the data itself and the processes involved.

nonethewiser · on Jan 12, 2023

> "I haven't seen the requirements, know little to nothing about the system, but I could knock that out in a weekend with a few Red Bulls." This stuff is such cringe

You mean the quote that you just made up?

sumtechguy · on Jan 12, 2023

There is a whole group of people who their job depends on the system existing like it does. One dude I worked with re-did a procurement system with a few spreadsheets. He got tired of waiting for 30 people to 'do it'. As soon as his boss found it the whole project he was on was scrapped and he was demoted shortly thereafter. His 'sin'? He had accidently found a way to put 30 people out of a job. These systems exist like this because our gov wants them that way. Not because they are the best.

Yesterday should have been 'run this on these 2 old boxes and make sure they still work' regression test. Sounds like either that step is not there, totally skipped, or does not match reality. Building a real five 9s type system means taking each piece and pondering 'what are the different ways this can fail'. Then mitigating each of those. It is mind numbing tedious work than takes a long time to do. Also most of this is probably run by contract houses. Which means the people who use it do not really 'own it'. Which is by design, for CYA. Which costs more and takes more time because it is all paperwork.

nonethewiser · on Jan 12, 2023

And the kicker is that you're paying for it.

_fat_santa · on Jan 12, 2023

Well they claim to be "risk averse", the wisdom there seems to be that if you fill out a million forms that qualifies as risk averse because it shows you did due diligence. The problem is they wholesale took that process from the rest of the org and applied it to software which doesn't work.

It's like making a crap sandwich and filling out a bunch of forms proving that it's not crap sandwich, rather than just spending the time you would be filling out all those forms on i don't know...not making a crap sandwich.

phaedrus · on Jan 12, 2023

One common problem is building new things with tech that is already obsolete. Given the choice between new and old, the old is perceived as more tried and tested. Granted it's sometimes difficult to distinguish "new and going to last" from "new and shiny", but when it comes to software choosing to go with "old and tested" can be counter to cyber security since the old software is no longer updated.

Of course you also have "old but still developed, and likely to be supported in near-perpetuity". Things like SQLite fall in this category. Unfortunately there's this other problem of management being inexplicably down on anything open-source. I don't know whether they've been exposed to too much FUD from vendors selling proprietary solutions, or just that the idea that security through obscurity is no security at all has failed to reach them.

killerstorm · on Jan 13, 2023

Engineers following current best practices? They'd run a Kubernetes cluster with Kafka. Because those are the best CotS tool for reliability. Battle-tested. The system will be down every week because Kubernetes need patching.

jacquesm · on Jan 12, 2023

It's risk avoidance to the point that that avoidance leads to new kinds of risks. The whole idea that you can architect yourself out of failure modes to the point that you no longer need to make backups is one that I see every other week or so and the number of companies out there that believes that because they have redundancies they don't need backups any more is staggering.

jrockway · on Jan 12, 2023

What's funny about this is that the FAA obviously knows what can go wrong with "yeah we have two of them" as they wrote ETOPS regulations to avoid some of the common pitfalls or amateur mistakes. They then failed to apply that to their software.

Obviously at a big government agency, the same person is not writing both aviation regulations and software procurement contracts, but the institutional knowledge is there. Nobody thought to be as paranoid about software as they are about planes flying over the ocean, but honestly, paranoia is good if you want reliability.

fsckboy · on Jan 12, 2023

> they wrote ETOPS regulations ... then failed to apply that to their software

you are pointing out that they prioritized that planes in the air could land safely over allowing more planes to take off. I'm actually quite reassured now.

dmix · on Jan 12, 2023

This seems to pose an interesting question that's out of my pay grade. The fundamental problem seems to be: you've replaced two distinct systems (one new and far more capable + 1980s-era one that always works but lacks [new feature x100]) with the same one running on 2x different machines. So the weak point is you ultimately share the same database/data structures/memory+logic flows between two systems. So if you keep them in sync the distinction comes down to hardware and lower-end systematic issues.

But most orgs can't realistically have two distinct software systems. How do you create proper isolation or failure mechanisms between them?

I'm guessing this sort of thing is what you mean by their experience with ETOPS.

simonh · on Jan 12, 2023

It’s a fundamental limitation of identical redundant systems that they have vulnerability to some of the same threats, particularly bad inputs and capacity issues. It’s important to understand it’s only giving you physical redundancy, such as if one data centre goes down. But the same software bugs, the same bad input data, even the same memory overruns are likely to hit both systems.

It’s not bad design, it’s just you have to understand what resiliency you have and plan against each of various such threats according to your risk appetite.

phaedrus · on Jan 12, 2023

It's very much like the advice that mirror RAID is not a backup solution. That seems to be almost literally what went wrong here.

atoav · on Jan 12, 2023

My intuition on that is that two is also a bad number to choose in that case. One could fo full lunar mission on the thing and have three models and in case of inconsistency the majority wins.

ilyt · on Jan 12, 2023

It is but it doesn't matter if it is the input message that crashes it. You just need 3 bad message (sooo a message and 2 retries) to crash it whole

atoav · on Jan 12, 2023

In the end you get never around getting some parts of a system bulletproof and reliable. The question is just how big those parts are.

hgsgm · on Jan 12, 2023

or 1 message that crashes all three on the first try vecause you are voting.

UniverseHacker · on Jan 12, 2023

Ideally three models developed to the same spec independently, with majority voting

weaksauce · on Jan 12, 2023

you could go to a model that verifies the integrity of the data coming in and makes sure the limits on the data are sane before committing it to the db. using a language with strong safety principles (that are not very "hip") like ada or fortran. or you design it so that the system is robust to failure and expects failure like something from teleco like erlang. redundant hardware is fine and great but having them do verification on the data and monitoring the limits of the system is pretty important too.

phaedrus · on Jan 12, 2023

This reminds me of a particular hardware system I'm familiar with whose design specified "dual power supplies". However late in the acceptance process it was discovered that the condition of one power supply up and one power supply down causes the system to lock up. (My guess would be a phantom current path between the powered and un-powered halves causing an unintended circuit state. Or perhaps just a software bug; maybe the one-supply-down notification code path was never tested.)

The vendor simply changed the procedures to say the user must use a two-fingers procedure to simultaneously flip both power supply switches on or both off at once. It bothers me that we still don't know WHY the original problem happened. How do we know there isn't electrical damage occurring during the brief period between the two switch contacts (since no human can do that perfectly)? If it's a software problem, what is the most time one power supply can be up and one down before the bug is triggered? That's not been characterized, to my knowledge.

But in the context of this thread, what's relevant is to avoid addressing the problem the vendor changed the meaning of "dual power supplies" from an OR condition to an AND condition! They met the letter of the spec while completely violating the spirit of the requirement.

ilyt · on Jan 12, 2023

Turns out having 2 copies of something doesn't matter if same message re-tried after sending to first node crashes another node, who could possibly predict that /s

rmrfchik · on Jan 12, 2023

Just my 2 cents: 2 billons are payed not for reliability but for size. Size of the system (and size of the team).

dmix · on Jan 12, 2023

$2 billion is an insane amount of money even for a major gov software project

And even after $2B it was completely broken and late and required another year and hundreds of millions.

I get the feeling people see thousands of millions of dollars as some abstract thing. You could build an A+ team with 1/100th that cash. Yet it still ends up sucked down a blackhole that's only designed to consume more money.

And there's zero consequences for failure. The same few contractors will get the contract next time.

hgsgm · on Jan 12, 2023

Reliability makes the system and team bigger.

mgsouth · on Jan 12, 2023

Quote from FAA: "Our preliminary work has traced the outage to a damaged database file." UK news source The Independent is reporting that Nav Canada's NOTAM system also suffered an issue.[1] Speculation: corrupting input, either international or North American? E.G. UTF-8, SQL escape, CSV quoting.

edit: Better reporting of Canada's issue from Canada's CBC (and frankly, better reporting about the US, too). [2] "In Canada, pilots were still able to read NOTAMs, but there was an outage that meant new notices couldn't be entered into the system, NAV Canada said on social media." "NAV Canada said it did not believe the outage was related to the one in the U.S., but it said it was investigating."

[1] https://www.independent.co.uk/news/world/americas/canada-fli... HN thread https://news.ycombinator.com/item?id=34347520

[2] https://www.cbc.ca/news/us-air-travel-chaos-notam-outage-1.6...

asveikau · on Jan 12, 2023

> "Our preliminary work has traced the outage to a damaged database file."

> Speculation: corrupting input, either international or North American? E.G. UTF-8, SQL escape, CSV quoting.

I read it as filesystem corruption from a bad disk, coupled with redundancy that doesn't actually work.

tigerBL00D · on Jan 12, 2023

What kind of database are they using, I wonder, to end up with such a spectacular failure?

lopkeny12ko · on Jan 12, 2023

Why would this be an indictment of any specific database technology? If your disk fails and corrupts the filesystem, you're toast, regardless of what database you are using.

mdavidn · on Jan 12, 2023

The technology to detect and recover from disk failures does exist. RAID and ZFS, for example.

I would not expect a disk failure to replicate to the backup.

ct520 · on Jan 12, 2023

Working with critical infrastructure and lack of true in depth oversight it wouldn’t surprise me DR plans were not ever executed or exercised in a meaningful manner.

ChrisMarshallNY · on Jan 12, 2023

This is quite common.

Comprehensive DR testing is really difficult. Many orgs settle for “on paper,” or “in theory” substitutions for real testing.

They do it right; no problem.

Doing it right, though … there’s the rub …

tanelpoder · on Jan 12, 2023

Yep and if you ship WAL transaction logs to standby databases/replicas, corrupt blocks or lost writes in the primary database won't be propagated to the standbys (unlike with OS filesystem or storage-level replication).

Edit: Should add "won't be silently propagated"

ilyt · on Jan 12, 2023

Neither checks the checksum on every read as that would be performance-prohibitive. So "bad data on drive -> db does something with corrupted data and saves corrupted transformation back to disk" is very much possible, just extremely unlikely.

But they said nothing about it being bad drive, just corrupted data file, which very well might be software bug or operator error

sigotirandolas · on Jan 12, 2023

This is wrong, both ZFS and btrfs verify the checksum on every read.

It's not typically a performance concern because computing checksums is fast on modern hardware. Besides, historically IO was much slower than CPU.

guenthert · on Jan 12, 2023

> Neither checks the checksum on every read as that would be performance-prohibitive.

It is expensive. It might be prohibitive in a very competitive environment. This is hardly the case here. Safety first!

ExoticPearTree · on Jan 14, 2023

RAID does not really protect you from bit rot that tends to happen from time to time. ZFS might because it checksums the blocks. But if the corruption happens in memory and then it is transferred to disk and replicated, then from a disk perspective the data was valid.

techie128 · on Jan 12, 2023

> If your disk fails and corrupts the filesystem, you're toast, regardless of what database you are using.

There are databases that maintain redundant copies and can tolerate disk / replica failure. e.g. Cassandra.

efficax · on Jan 12, 2023

journal databases are specifically designed to avoid catastrophic corruption in the event of disk failure. the corrupt pages should be detected and reported by the database will function fine without them

redox99 · on Jan 12, 2023

If you mean journaling file systems, no. They prevent data corruption in the case of system crash or power outage.

That's different from filesystems that do checksumming (zfs, btrfs). Those can detect corruption.

In any case, if you use a database it handles these things by itself (see ACID). However I don't believe they can necessarily detect disk corruption in all cases (like checksumming file systems).

rini17 · on Jan 12, 2023

We had Oracle corrupt itself due to software bug. It similarly went undetected for some time and thus ended in backups.

eurasiantiger · on Jan 12, 2023

Well, for example, MySQL/MariaDB using utf8 tables will instantly go down if someone inserts a single multibyte emoji character, and the only way out is to recreate all tables as utf8mb4 and reimport all data.

colinjoy · on Jan 12, 2023

Surely nobody would use that format and allow a commit message including emojis to cause an effective DOS for a large Sonarqube project.

NavinF · on Jan 12, 2023

It doesn't block inserts with invalid data? I thought that was the whole point of telling the database what types you're using

dpcx · on Jan 12, 2023

MySQL historically isn't very good about blocking bad data. Sometimes it would silently truncate strings to fit the column type, for example. It's getting better as time goes on, though.

ilyt · on Jan 12, 2023

It does and poster above is incompetent

eurasiantiger · on Jan 12, 2023

I have had customer production sites go down due to this issue when emojis first arrived. It was a common issue in 2015. I would hope it is fixed by now!

lsaferite · on Jan 12, 2023

Having dealt with utf8mb4 data being inserted into the utf8mb3 columns many many times in the past, I've never had a table "instantly go down". You either get silent truncation or a refusal to insert the data.

eurasiantiger · on Jan 12, 2023

Well, your applications haven’t used a serialized or JSON column. That’s how you go from truncation to downtime.

That said, I do remember this being an issue even with plain text.

dolmen · on Jan 12, 2023

I need more info about this.

lsaferite · on Jan 12, 2023

In MySQL the `utf8` character set is originally an alias for `utf8mb3`. The alias is deprecated as of 8.0 and will eventually be switched to mean `utf8mb4` instead. The `utf8mb3` charset means it's UTF8 encoded data, but only supports up to 3 bytes per character, instead of the full 4 bytes needed.

https://en.wikipedia.org/wiki/UTF-8#MySQL_utf8mb3

birdyrooster · on Jan 12, 2023

Imagine you have one node which is running as a replica of another and it takes the backups. Well, let’s pretend it is backing up the corrupted data once in a while and it happened to overwrite their cold backup. They could have any number of databases and still had this failure. It’s more their methodology for taking backups. They should have many points in time to choose from to rebuild their database. They should be testing their databases before backing them up blindly.

tenken · on Jan 12, 2023

> They should be testing their databases before backing them up blindly.

Oh you mean they should be testing/validating the generated backup db file before replicating it to long-term archive ...

readthenotes1 · on Jan 12, 2023

Way back when use cases were a thing, I used to chide people for saying that Backup was a use case.

No, Restore is a use case.

(Replace "use case" with "requirement" or "user story"...)