You all seem to think this is similar in value or operation to a web app. It is not. It is a safety-critical system that requires very stringent operational and development guidelines ON PURPOSE. The idea that the FAA shouldn't be risk averse in this system is absolutely ridiculous. The complexity of operating the airspace of an entire nation is nothing to scoff at and the importance of the NOTAM system should not be minimized in any way. This is not some government corruption thing. There are hundreds of thousands of lives at stake every day.
I, for one, appreciate the work that the FAA does to keep us safe in the air and appreciate that this was handled appropriately. Everything breaks. It's just a matter of how and when and what we do with it when it does. The FAA handled this outage appropriately and in a timely manner. I feel for the engineers who had to work on this incident.
You've got it exactly right. There are a lot of people here who are completely deluded into the "move fast and break things" mindset not realizing that sometimes you really do not want to move fast, because you REALLY do not want to break things. A corrupted file throwing up panics like this is a good thing, because you don't want corrupted files to pass through like everything is fine.
If the corrupted file is in the backup, it DID pass through like everything was fine. What's clear to me is that the FAA has no post deployment validation, hasn't tested its DR strategy, and that errors can go unseen for long periods of time.
> What's clear to me is that the FAA has no post deployment validation, hasn't tested its DR strategy, and that errors can go unseen for long periods of time.
It is possible to have all of those mitigations in place and still experience a failure like this.
Post deployment validation is only as good as the validations executed. 99% coverage still leaves the door open to failure.
A DR strategy is just that - a strategy.
A failure of this sort is not an automatic implication that those things do not exist, just that they failed in this particular case.
I would find it incredibly surprising that an organization of that complexity could have survived as long as they did without a major incident if none of those things were in place.
They’d be either incredibly lucky, or incredibly competent, and if they are the latter, they would not operate without such mitigations in place.
It seems far more believable that an organization of the FAA’s age and complexity missed something along the way.
> incredibly surprising that an organization of that complexity could have survived as long as they did without a major incident
I'm not surprised. FAA does not fly each plane. Government organizational complexity helps ensure the government organization survives through next round of Congressional appropriations.
Org complexity + opaque oversight + 'safety' + 'homeland security' + taxpayer funded = playing around and more budget.
The pilot is responsible for safety. Air travel has rules to avoid collisions (eastbound gets altitude levels different than westbound, pilots shall broadcast on known frequencies) and pilots have distributed intelligence to keep their flight safe.
Yes, somehow there needs to be coordination of runway use. Many ways to provide reservations and queuing.
We can make excuses all day long. A simple query of the database/table would have produced an error. Sure, the FAA does some complex stuff, but the tech I see in airplanes looks ancient. I'm willing to bet most of the FAA complexity comes from budget (lack thereof) and old computer systems.
This has nothing to do with excuses - I’m challenging the assertion that “because something bad happened, they must not have any mitigations in place at all”.
This seems like a bad case of binary thinking, and my point was that the occurrence of an incident like this is not sufficient to support that claim. It’s just as likely that an ancient process that wasn’t accounted for somewhere in the architecture broke down, and this is how it manifested.
Clearly improvements are needed, as is always the case after an outage. That doesn’t justify wild speculation.
Anecdote time: I once worked for a large financial institution that makes money when people swipe their credit cards. The system that authorizes purchases is ancient, battle tested, and undergoes minimal change because the cost of an outage could be measured in the millions of $ per minute.
Every change was scrutinized, reviewed by multiple groups, discussed with executives, and tested thoroughly. The same system underwent regular DR testing that involved quite a lot of involvement from all related teams.
So the day it went down, it was obviously a big deal, and raised all of the natural questions about how such a thing could occur.
Turns out it had an unknown transitive dependency on an internal server - a server that had not been rebooted in literally a decade. When that server was rebooted (I think it was a security group insisting it needed patches despite some strong reasons to avoid that when considering the architecture), some of the services never came back up, and everyone quickly learned that a very old change that predated almost everyone there established this unknown dependency.
The point of this story is really about the unknowability of sufficiently complex legacy enterprise systems.
All of the right processes and procedures won’t necessarily account for that seemingly inconsequential RPC call to an internal system implemented by a grizzled dev shortly before his retirement.
And then you find an obscure service doesn’t come back up on the 10,000th or 100,000th reboot because of <any number of reasons>. And now you have multiple states, because you have to handle failover. It’s turtles all the way down.
It’s always easy to say that in hindsight. But keep in mind this is an environment with many core components built in the 80s. Regular reboots on old AIX systems wasn’t a common practice - the sheer uptime capability of these systems was a big selling point in an environment that looks nothing like a modern cloud architecture.
But none of that is really the point. The point is that even with every correct procedure in place, you’ll still encounter failures.
Modern dev teams in companies that build software have more checks and balances in place from the get go that help head off some categories of failure.
But when an organization is built on core tech born of the 80s/90s, there will always be dragons, regardless of the current active policies and procedures.
The problem is that the cost to replace some of these systems was inestimable.
I don't even know if I'd call this a disaster recovery fail. Depending on what they meant by "corruption", a roughly 6-8 hour turn around time is not awful for a database restore.
People like to think that the alternative to "move fast and break things" is "move slowly and not break things" but it's not, it's "move slowly, break things anyway, then take days to resolve the problem because you never learned how to move fast".
You act like "moving fast" is all you need to know to "move fast". As if it's simply the skill of making time move faster, and you don't need any other skills than that, because once you've broken the laws of physics and changed the speed of time, everything just works faster without any differences or consequences. Do you watch a lot of Superhero movies?
You should work smarter, not harder. Just turn up your smart knob. But why didn't you ever think of that before? Probably because you had your smart knob turned all the way down.
Frantic is often a natural outcome of “moving faster” when the environment one is moving in is not conducive to that speed of movement.
In my experience, this tendency towards frantic is multiplied the larger and more complex the organization and architecture becomes.
The entire point of “move fast” in software circles is to leave behind the constraints of legacy tech and management practices in favor of building something “better”.
In a mature org that grew up before these ideas were mainstream, maybe one or two teams can manage to move faster, but invariably they end up depending on other teams, who in turn depend on deeply ingrained and established company culture and procedures.
We can talk about why those impediments are a Bad Thing, and I wouldn’t advise a consumer startup to adopt those methodologies in 2022, but there’s still the harsh reality that where they exist, “just move faster” doesn’t help much more than telling a depressed person to “just do cardio every day”. There’s often a lot of inner work that’s gotta happen to make way for the new.
The only way I’ve seen this sort of work in a large org is when a brand new “emerging tech” group is spun up and given autonomy to work outside of the legacy norms. This is not perfect either, and seems much better for greenfield projects. When applied to deeply entrenched legacy systems, all of the problems mentioned above come to a head.
This also creates a weird in/out group dynamic which tends to further stratify the old tech and widen the gap between the old practices and the new.
In the context of this particular conversation though, I think the concept of “move fast” has lost all meaning and has little to offer for an org like the FAA.
When I hear "move fast and break things", I am reminded of a guy that I worked with. He delivered work fast......full of bugs...0 planning...and his daily ritual was to just keep patching the "pile of shit" he put together.
He has now moved on and we sometimes chat, he works for a huge corp, still sucks at writing SQL.
And as I said, people think the alternative to "fast, full of bugs, 0 planning" is automatically "slow, no bugs, lots of planning" but it's often "slow, lots of planning, just as many bugs"
When you are in a complex spiderweb you simply cannot move fast. If you are moving fast you are not looking at everything and it will blow up in your face.
The worldwide air traffic control system is not simple or easy to understand and iterate on. And that's not because it was designed by idiots, or that you're so much smarter and more experienced and a vastly better programmer than the combined efforts of everyone in the world working on air traffic control, as you seem to be implying from your comfortable armchair.
Are you proposing the entire world simply give up air travel, because government regulations and industry standards and the laws of physics and chaos theory prevent you from having the simple easy to understand air traffic control system you envision?
Change is the most common reason for breaking things. Moving fast means more broken things, hence the slogan. The alternative is indeed move slow, break things less often. It's a bad strategy when you NEED a LOT of change. But if you don't NEED a LOT of change, and you do need a lot of stability, it seems perfectly valid?
So for the things that don't need a lot of change, what are some characteristics of the system?
Does CI/CD exist? Does CI even exist? Are deployments automated? Is data sanity checked before loading? Is there a development environment? Do things like hourly snapshots exist? Can you easily provision a replacement system from scratch and restore data from a known good snapshot?
Or, is every process manual, slow, and error prone because there's never been a need to move fast.
Look at this one sentence in the article:
> In the overnight hours of Tuesday into Wednesday, FAA officials decided to shut down and reboot the main NOTAM system -- a significant decision, because the reboot can take about 90 minutes, according to the source.
So they do a reboot, that takes 90 minutes for some reason, and then that didn't even fix the problem. Their system that needs a lot of stability is now broken.
The FAA has risk reduction baked into it's very heart and soul and this is evident the more you learn about how the system operates (go look at a flying textbook).
It is not perfect organization, but I think it deserves more credit than it is receiving in this thread.
I guess this is the comment chain where we address the room?
I'm seeing a lot of misuse of the word "risk". The FAA prioritizes safety over mission. The mishap that resulted in downtime affected the mission. The common-cause failure of the backup system affected the mission. That is not evidence that they're bad at managing safety risk.
Given that there was an article published at the time about removing the dissimilar backup, it's probable that they explicitly accepted this mission risk.
The only way to optimize for lowest overall risk is to optimize for speed of change.
All the checklists in the world to prevent something from happening are fine and dandy until something happens anyway (which it will). And then they hamstring you from actually fixing it.
Instead, if you can move fast consistently, you can minimize the total downtime.
> Instead, if you can move fast consistently, you can minimize the total downtime.
In safety critical software where _a_ failure can result in loss of life, is “total cumulative duration of downtime” really the metric we’re optimizing for?
Yep, this is the exact point I tried to make above and got heavily downvoted.
If you can't move fast when things are working well, you can't move fast when things are broken. Acting like moving slow is going to prevent things from ever breaking is just wishful thinking.
Downtime isn’t the metric their procedures are optimised to minimise. It’s optimised to minimise air traffic accidents. Moving fast might minimise total down time (though I seriously doubt that), but what effect would it have on accuracy and reliability? Mistakes mean dead people. In this incident zero people died. You really sure you know you can improve on that?
This demonstrates a gross misunderstanding of how the FAA actually operates in its efforts to address “risk.” It’s far more theatre via bureaucracy and paperwork than actual proven engineering efforts that demonstrably reduce risk.
This is a sweeping claim with really nothing to back it up. Are you saying this from an inside knowledge of the FAA, or is this just an opinion?
On the surface, the relative safety of air travel and the lack of major stoppages over a span of 22 years seems like a major counter example.
You’re making this statement emphatically and authoritatively, though, so I’m curious to understand where that certainty comes from and how it accounts for the other publicly visible properties of the FAA and air travel.
2. The FAA's pilot medical vetting process, while thorough, is behind the times. There are people who took ADHD medicine in high school that are unable to obtain a medical certificate due to the FAA's overly-strict policies on prescription drugs. There are current pilots with serious mental issues who are afraid to see a doctor about them due to fear of losing their medical license (https://www.flyingmag.com/why-pilots-dont-want-to-talk-about...).
> 1. The FAA basically handed their risk-management keys over to Boeing when authorizing the 737-MAX, contributing to those deaths
Couldn’t this also be interpreted as: when the FAA holds the keys, disasters like the 737-MAX tend not to happen? Obviously this raises questions about how that decision came about in the first place, but as an example, it seems counterproductive to your point, i.e. evidence that shifting away from some long standing policies directly led to harm, implying the original policies might have been better ones.
In a thread that seems eager to move fast and break things, this seems like a big problem, and would seem to indicate the need for a return to founding principles, not the opposite.
> 2. The FAA's pilot medical vetting process
This is an interesting one for sure, but also seems like an incredibly complex issue. Have there been studies about the safety of operating machinery while on those drugs that would obviate the need for a policy change?
The potential risk averted by such a policy would need to be weighed against the negative impacts of the 2nd order undesirable behaviors - obviously it’s bad that the policy discourages much needed mental health support, but how bad this is depends entirely on how effective the initial screening process is.
I’m not saying these mental health policies shouldn’t be changed, but neither do they seem to have obvious or measurably better alternatives at the moment.
And taken in the context of the original claim - that people are grossly misunderstanding the FAA and all of this is theater - they seem like weak examples to use as evidence of broad organizational failure.
Regarding #2, while there are a lot of issues with the FAA's medical process, you can absolutely get cleared to fly after having previously taken ADHD medication.
What you have to do is:
1. Not take those meds for at least 1-2 years.
2. Show documentation that getting off the meds hasn't impeded your performance. This generally means showing a stable work history if you've been off them for a long time or documents showing no change in performance between before you stopped taking them and x months after if you recently got off them.
3. Take an FAA ADHD re-evaluation.
Then they'll clear you. It's an annoying process but it's absolutely doable.
The FAA was gutted in the name of deregulation and competition - you might want to ask the GOP what happens when you don't have adequate govt oversight and regulation.
Two planes crashed due to a design flaw - hundreds of people killed and one manufacturer and model forever tarnished like McDonald Douglass and their DC-10.
>On August 5, following the PATCO workers' refusal to return to work, the Reagan administration fired the 11,345 striking air traffic controllers who had ignored the order, and banned them from federal service for life. In the wake of the strike and mass firings, the FAA was faced with the difficult task of hiring and training enough controllers to replace those that had been fired. Under normal conditions, it took three years to train new controllers. Until replacements could be trained, the vacant positions were temporarily filled with a mix of non-participating controllers, supervisors, staff personnel, some non-rated personnel, military controllers, and controllers transferred temporarily from other facilities. PATCO was decertified by the Federal Labor Relations Authority on October 22, 1981. The decision was appealed but to no avail, and attempts to use the courts to reverse the firings proved fruitless.
My late friend Ron Reisman worked at NASA Ames Research Center on air traffic control and flight safety, and he hired up a bunch of the professional air traffic controllers who Reagan fired, and taught them to program.
Because it's much easier to teach an air traffic controller how to program, than it is to teach a programmer how to control air traffic.
And we have them to thank for how safe the air traffic control system is today.
>Ron Reisman has BA in Philosophy and Classical Greek, and an MS in Computer Science. He joined NASA Ames Research Center in 1988 as one of the original members of the Center Tracon Automation System development team. Since the late 1990s he has worked on traffic flow management research and development. He is currently supporting the Next Generation Air Traffic System research.
I saw him give an earlier version of this talk at the November 1989 Usenix Montery Graphics Conference, where he discussed training air traffic controllers to program, and he subsequently gave me a tour of the flight simulators and air traffic control systems at NASA Ames:
>Ron Reisman and James Murphy, NASA Ames
Research Center, and Rob Savoye, Seneca Software
>This introduction to air traffic control systems summarizes the operational characteristics of the principal Air Traffic Management (ATM) domains (i.e., en route, terminal area, surface control, and strategic traffic flow management) and the challenges of designing ATM
decision support tools. The Traffic Flow Automation
System (TFAS), a version of the Center TRACON
Automation System (CTAS), will be examined. TFAS
achieves portability across platforms (Solaris, HP/UX,
and Linux) by adherence to software standards (ANSI,
ISO, POSIX). Software engineering issues related to
design, code reuse, portability, performance, and
implementation are discussed.
Based on what I know about Ron's and other people's diligent methodological work on air traffic control and safety, I feel extremely safe and confident flying, and I find it insulting to his memory and the legacy of his work when the armchair architect ex-Facebook employees on this thread (and the GOP) glibly and patronizingly implore the FAA to "move fast and break things", as if they had no idea how many lives and fortunes are at stake.
Here's a video of Ron showing Marvin Minsky the flight simulator, an early AR headset, and the hydraulic lifts:
“It’s gotten to the point that I never say anything about intelligence in general. I don’t know what it means any more. I used to. But then I started trying to test it. And if you think about it for a while, you don’t know what it is.” -Ron Reisman
737Max saga has nothing to do with air traffic control.
"The FAA, citing lack of funding and resources, has over the years delegated increasing authority to Boeing to take on more of the work of certifying the safety of its own airplanes."
"There wasn’t a complete and proper review of the documents,” the former engineer added. “Review was rushed to reach certain certification dates.”
Seems like FAA certification was a disaster in the making.
Is this conjecture or based on documented shortcomings of their approach though? Much of this paperwork, especially if created in response to NTSB crash post-mortems, could very well be examples of https://fs.blog/chestertons-fence/ - is there reason to believe otherwise?
you nailed it. Plus the FAA is unable to even perceive the fact that this culture is counterproductive to the goal nor would they know how to address the problem in any way other than adding more layers of paperwork and bureaucracy.
FAA regulations, more so than nearly anything else regulated in the US, are written in blood. They are not written out of government corruption. Many many many hundreds of people have died in airplane accidents since airplanes were invented, and the reason why your average person can board a commercial flight and act like they just got on the bus (with a /better/ risk profile than a bus on public roads) is exactly due to these regulations. The FAA has and continues to lead the entire world in how to do reasonable and meaningful air traffic regulation.
If they’re so risk averse, why are their systems failing so severely?
It’s absolutely a corruption issue, in that the government prefers to pay 2-3x what they would to solve things internally to contractors who then perform a poor job and lobby to keep whatever they build in place for decades.
> It’s absolutely a corruption issue, in that the government prefers to pay 2-3x
Being serious, I wish I could find one of these 2-3x multiple payouts in government. Every time I've looked at anything government related (including direct contractors) the pay is garbage. Usually 15% to 50% of what the private market pays for my skill set.
The contracting company receives usually 2-3x what the actual developer makes- so if the gov't pays $300K for a developer, they get a $100K developer. Some of that overhead is justified, but a lot of it is the company's profit.
The old answer used to be for the government to hire directly, but that's been hamstrung for like 40 years by now.
At my old firm, our federal project profit margin was a bit lower than median, though once you factor in the sales, contracting, and leg overhead the bottom line was somewhat worse than the project actuals made them look. Since our side of the house did relatively short-burn contracts (6-12 months), federal work was generally not that valuable for us; I used it for filler work when our usual sales pipeline was weak. The real value was for the side of the house that did long-term software and support work or heavy citizen support outsourcing, when contract durations can be measured in decades. Same for a friend who inked a $5bn DOD deal; the margin isn’t great, but it’s a 10-year deal that gives her a stable cash flow basis to grow on.
(Also, OP is probably underestimating the full-sheet cost of a federal FTE, as well as the complexities of fund-based budgeting and forecasting.)
Worked for a govt contractor years ago and I would imagine the payout from the govt contract was at least 2-3x. There were usually 2-3 layers of people getting paid and I don't think any of them were hurting. Of course no employee was getting a massive payout compared to private.
Govt > contractor > sub-contractor > employee was pretty common. I never knew of anyone being a direct contractor to the agency since the contracts were so large and involved a lot of employees.
It's been several years since I've done contracting work (and for the FAA no less), but that can't really happen to the extent you're implying. The open roles for contract positions specify a pay range based on experience (degrees and/or years of experience). You can have a prime and then a sub contractor, but the amount that each can add for management overhead is limited and depends on the contract vehicle. I want to say it was around 5-10%.
The actual employer can pay the employed contractor whatever they want, but the rates are published and if they underpay too much the employee will be poached by a competitor. Because the rates are public info the employee can look up how much profit their employer is making any time they want.
I am not sure if you ever dealt with a very entrenched system that has been in place longer than you have been alive, but it is not easy. It is also not easy to hire to deal with these things. Especially at government rates which are lower than the private sector.
If you were the FAA and started up an internal startup to find only the best to replace systems that have been running forever you would face a lot of problems.
1. Nobody would give you the funding until something like what happened yesterday did
2. Your developers getting paid more than the FAA directors will get lots of political attention
3. Even a great internal engineering team would most likely take ages to do something like this. This isn't move fast and break stuff with a greenfield, it is painful deconstruction and analysis of a very complicated system.
4. Many people won't want to work on this no matter how much you are paying.
It is also out of the wheel house of an agency tasked with flight safety. So expensive contractors arise. Do they do a poor job often? Yes. But these are jobs that are really hard to scope and execute. If it is really a case of overpaid contractors coming in and not doing the work, people here should start a startup and hire a SEAL team six of 1970's computer system rip and replacers and make a lot of money.
Imagine how bad it would be if Kim Jong Un *weren't* so pro-american. It's a ridiculous hypothetical. Everything can always be worse, that is not a testament to an entity's quality.
There’s a lot to unpack in your claim, and this really hinges on your definition of “effective”.
What do you consider to be effective, and how would you differentiate it from theater?
Right up front, it seems necessary to account for the relative safety of flying if that safety has nothing to do with FAA policies.
It also seems that any safety policy that is effective enough will eventually appear indistinguishable from theater as people become more and more disconnected from the possibility of disaster.
2) the fact the 737 Max disasters happened outside the US was a matter of chance not exemplary policies by the FAA. FAA policies did not prevent the deployment and roll out of this dangerous aircraft in the United States and did not lead to grounding of the aircraft until multiple events occurred.
3) I'm mostly talking about software here, and I believe a big part of this issue is shoe-horning software into the policies and procedures specifically designed for aviation. there are so many little pointless (in most contexts) requirements that cause the engineers working on these systems to lose the forest through the trees. the FAA creates its own complexity which prevents thinking holistically about our systems in a meaningful or effective manner.
in many ways this is a force of entropy. the more lines of code you have to support thousands of requirements and the more revisions you make to those lines of code without a top to bottom refactor the more likely you are to have insidious bugs that pop out like this incident.
Another way to think about it is that safety is good for profits, at least in so far as consumers care about it.
And I think consumers do care quite a bit.
Of course there are arguments to be made for regulation, but I think your statement is overly emphatic.
In theory, if you ask explicitly, of course they do. In practice, they choose the cheapest ticket and maybe avoid airlines that have had a high profile incident recently. They don't have the time or means to actually evaluate an airline's safety culture.
Meanwhile, executive decisions are driven by quarterly earnings reports, and you can cut a lot of corners for quite a number of quarters before your luck runs out and 200 people die.
My statement is, if anything, not emphatic enough.
That doesn't make the statement any less ridiculous.
What was the ultimate reason for the 737 MAX debacle? That airlines want to save money on type rating training.
Look at accident reports, and half the time the airline's safety culture (or lack thereof) is at least a contributing cause.
The FAA may be in many ways dysfunctional, but so are the airlines, and it's sure as hell not them who are pushing for better safety standards, it's the FAA and (especially) the NTSB.
On the other hand, if Boeing actually cared about safety instead of profits they wouldn't have done their utmost to hide the fact that they were avoiding the FAA's safety regulations to improve profits.
Demonstrably flu vaccines do not effectively avert risk. Demonstrably auto safety standards do not avert risk.
But they do. And they've been so incredibly effective that we have collectively forgotten what risk used to feel like, so we're ready to say we don't need these standards/organizations or that they're not working. Obviously this is not to say they are perfect or free from criticism. But it's not just theatre.
> You all seem to think this is similar in value or operation to a web app
This line of reasoning is pure, 100%, cope. There is a reason this was broken and grounded flights nationwide. Providing cover for systems that allowed this to happen is not productive.
There’s a meaningful difference between providing cover and reminding people that major systems like the one in question are in a different category than the average web app.
The fact that such a nationwide stoppage hasn’t occurred since 2001 speaks to the stability of these systems, and I’m not sure what you’re suggesting here.
> There is a reason this was broken and grounded flights nationwide. Providing cover for systems that allowed this to happen is not productive.
I’m not trying to sound snarky here, but things do tend to break for reasons. Are you suggesting that there is a cure?
Earlier today another HN user linked to a PDF from a previous 2018 (cira 2014) investigation that pointed to the "dual-channel back up" system being fragile and likely insufficient.
> ERAM’s original design did not include a dedicated backup system. FAA believed that ERAM did not need one due to the redundancy provided by the system’s dual channel design. This design was intended to prevent outages because it allows for seamless switching between the two channels without impacting air traffic control should a problem occur in the active channel. The Agency believed this redundancy would make it unlikely that a problem in one channel could migrate to the other channel. In addition, to provide additional backup capabilities during ERAM’s implementation, FAA planned to temporarily maintain EBUS, its pre-existing backup system, before phasing it out completely beginning in 2015, to rely solely on ERAM’s dual channels.
> However, problems experienced during and since ERAM’s implementation have shown that the system remains susceptible to dual channel failures. As a result, FAA decided to maintain EBUS much longer than intended because air traffic controllers currently rely on EBUS for backup. However, with FAA’s ongoing and planned upgrades to ERAM, which will span the next 7 years, EBUS will soon become incompatible with the new hardware. As such, FAA plans to begin phasing out EBUS in April 2019, leaving ERAM without a backup system to supplement the system’s redundant dual channels.
A $2 billion dollar Lockheed Martin system where a memory overflow takes down BOTH the main system and the backup system. Surely that's not a signal of fragility.
I spent a lot of time at the FAA writing software. (6+ years) there is a huge culture of process, policy and not a whole lot of thinking or analysis or actually understanding the problems that they are working on. it is maddening.
imagine a spreadsheet with 700 lines in it telling you that you need to do ABCDEFG each of those lines is instructing you to write a document detailing a procedure with the chain of custody forms and keys and whatever password rotations etc etc. follow this process and fill out this presentation template wait 3 months and then present it to a board who doesn't give a shit or have any understanding of your project.
it's all an insane amount of work and it gets us the opposite in terms of the goals these processes are intended to achieve.
I have zero doubt that the only thing that will happen as a result of this catastrophic incident is another 50 lines in a spreadsheet somewhere telling you to do stuff that nobody will comprehend or implement correctly or even verify until there is a similar incident causing an investigation into it.
Agencies like the FAA are notoriously risk adverse. Basically the motivations of most employees seems like, "if I mess up once, I get fired; if I don't produce any movement, I can't get fired". Naturally, the output is glacial progress and the introduction of tons of safety-theatre procedures (that get the implementers promotions for "increasing a culture of safety").
We all know this from the subject the FAA regulates: flights. New unleaded gas gets forever to approve. Simple changes to instruments takes years of certification, leaving 1960s technology in place when clear improvements have happened in the last 60 years.
I always wondered what would happen if this culture were carried over to another space. We know what it looks like in medicine because the FDA has similar priorities. Rarely do we see it in tech, which is known to "move fast and break stuff". But here we get a glimpse of the dystopian crossover between FAA-procedure-culture and software engineering.
We know what it's like in software because we have e.g. the example of how the Space Shuttle's flight control software was developed and audited. It's weeks of meetings, tests and change processes etc. obsessing over changing a single instruction in assembly.
Which also suggests to me that the problem isn't per-se the glacial progress & approval processes, but that they're using the wrong glacial processes.
If this FAA software was developed with something like the Space Shuttle's process it would still take forever to change something, but at least you'd end up with a function that you could mathematically prove would be able to handle any conceivable input.
I think you're never going to convince the government of a "move fast" culture. Even if the overall cost of grounding planes would be less than the cost of more reliability they'd never go for it.
Too many people's asses are on the line, and they're not having to spend their own money. The only thing they have to "pay" for is possible loss of face, or loss of political capital, both of which can be insured against effectively for free with taxpayer dollars.
But you might just be able to convince them that they're using the wrong sort of bureaucracy. You'd still spend a billion or two on something that should cost a million, but at least you'd get actual reliability as a result.
The NASA Space Shuttle flight software was among the best ever developed. There was never a defect that impacted safety.
But that team had kind of an "unfair" advantage in that were able to program in assembly code on bare metal with no real software stack. Whereas the rest of us are forced to build on a foundation of sand using multiple layers of low-quality third-party software in order to deliver any useful functionality.
> But that team had kind of an "unfair" advantage in that were able to program in assembly code on bare metal with no real software stack
The Space Shuttle’s avionics software was not written in assembly, rather HAL/S, a high-level language invented for the project. Assembly was mainly used for the custom real-time OS kernel. They also maintained their HAL/S toolchain, which was written in XPL-a PL/I dialect which was popular for compiler development in the 1970s. The development environment ran on IBM mainframes, and the main CPUs on the Shuttles were the aerospace derivatives of the IBM S/360 mainframe architecture, System/4pi, model AP-101. The same CPUs were used by USAF (e.g. the B-1 bomber), but USAF mainly used JOVIAL to program theirs. Another big user of JOVIAL was the FAA, who used it to write a lot of their original mainframe-based air traffic control software (FAA HOST).
The Space Shuttle team inventing their own programming language was a byproduct of the time the project started (1970s). If they’d started a decade later, they probably would have used Ada instead. But Ada didn’t exist yet, and they thought inventing their own language was a better choice than JOVIAL
It sounds less like risk averse and more like cover-your-ass though; go through long checklists and committees and people so that no one person can be held responsible for any problems.
The closest thing you'll see in tech will be at back-end software like banks, insurance companies, pension funds, investment companies, embedded engineering / SCADA, ERP systems, etc. The "move fast and break things" mindset seems to mainly be a thing in Silicon Valley internet companies / start-ups, and mainly the latter because they're driven to pump up their own value instead of provide reliable software. Because if Twitter is down, it's an inconvenience, but if airplane software fails, lives are on the line.
> Agencies like the FAA are notoriously risk adverse.
There is a difference from risk averse ("we require heavy testing before deployment") and dysfunctional ("we are terrified of breaking anything but are unwilling to invest in maintenance").
Take the Air Force. Their risk decision is: can we accomplish a specific mission at hand with ideally minimal loss of warfigher life. If they don't maintain their planes, they cannot achieve the warfighter life loss minimization.
Similar with IT generally. Kicking the can down the road just grows the problem.
yeah working for government is strange. I'm in a position where the product owners don't want to make any enhancements/changes to a production system out of budget concerns. However, the actual invoice for my team's time is the same whether they make changes or not. They don't want to "spend the money on enhancements to a functioning system" but the invoice amount is the same every month regardless.
oh and in my experience, government fte's can screw up an infinite amount of times with no risk to their job.
The early aviation industry had a very substantial "move fast and break stuff" mentality. YouTube has plenty of videos of those folks. However "break stuff" usually meant smearing aircraft and bodies all over the landscape. Much of FAA's regulation is to try to prevent killing "too many" people. If the engine on your car stops, you can usually pull over to the side of the road. If the engine on your aircraft stops, you're going to be landing soon and hopefully not landing on a building full of people. And if you're really lucky when the engine stops, all the people sitting in the back can walk away.
The FAA is frequently called a "tombstone agency" - who only act long after fatal accidents and the bodies have been buried.
When that risk-adverse culture is carried over to another space, it just fails and never gets noticed. Risk-adverse organizations get outcompeted by more efficient risk-tolerant ones, every time that such competition is possible.
Risk aversion only works for an entity that has a forcible monopoly in its space. The FAA and FDA do. Another is the Nuclear Regulatory Commission, which exhibits the same behavior: their job is to prevent accidents, and the surest way to do that is to never approve anything at all.
You're right, my "mess up once, get fired", I really mean if a flight crashes and the FAA could have done something, they'll get political flak. High level political appointees may need to resign. Whereas if an agency does nothing for a term, the political leaders get to stay.
For routine FTEs, as a sibling post mentioned, you can probably mess up in a lot of ways (short of murder or being really non-PC) and still keep your job.
This is common for government agencies. I worked at Labor and your second paragraph hit very close to home. The only plus side was the insane amount of free time you spent waiting. I tried to go back on the civilian side of the contracts and I nope'd out of it as soon as I hit red tape
All the federal employees I have met are EXCELLENT. It is very difficult to get hired into one of the dwindling jobs at the agencies. A lot of the government has been contracted out. Having worked at a federal IT contractor, I can say that in my experience most of those workers are very good and dedicated to their work. HOWEVER, they don’t always understand well the mission at hand. Some of this is to be expected given that they are contractors who come and go more frequently than federal employees charged with implementing government programs.
Ironically extreme selection pressure is exactly what leads to extreme risk aversion.
"We only hire the absolute best" does not lead to "move fast and break things" (which sounds awful in a FAA context anyway) it leads to people who devoted their lives to being the absolute best at coloring inside the lines and never straying off the path, to being the best follower out there, to the ultimate authoritarians desiring to grow into being the authority.
The heaviest selection pressure usually does not lead to the most efficient system, it generally leads to a system able to endure heavy selection pressure.
There's a sociologist who wrote a famous book about bureaucracy and its in my library at home and the name of the sociologist and his book are at the tip of my tongue but he wasn't near the top of a quick google search; the above is a paraphrase of his book. No its not Douglas Adams or even Scott Adams although those two are correct about the problem in general LOL.
People who want a job like that should just be on UBI instead. Then at least we'd have systems that could change to meet the needs of their users in a timely way.
I don't think that is unreasonable to expect that government pay in a world where the government is a welfare program with a governing hobby might have less purchasing power than UBI in a world where we prioritize effective governance over beaurocracy. It's not a zero sum game.
There are just under 3 M federal (civilian) employees. I think it's entirely unreasonable to think that we would pay a UBI to ~210 M adult citizens (a 70x multiple) at levels that would represent a greater amount of purchasing power than to the people nominally working for the federal government.
If you're firm in your view that that's reasonable, I'd like to learn more about the proposal as to how the math would work.
I'm not going to code up a simulation because the research hasn't been done to confirm my choice of constants, but I can sketch it. Each workday is a function of the macroeconomic climate and some set of cultural norms during which we exhibit some blend of the following personae. As we'll see, introducing UBI reduces the prevalence of the bureaucrat persona which has knock-on effects leading to surplus.
---
The Missionary - has a mission and is working towards it. Cares more about the mission than prestige.
The Worker - doesn't have a plan, but likes to be a part of something meaningful. Will gamble with prestige in order to ensure that the work stays meaningful.
The Bureaucrat - willing to tolerate or create waste in favor of preserving prestige. Sometimes manages to trick a worker into believing they're a missionary.
---
Obviously people are more complex than this. Also, I'll use dollars to indicate productive output even though I think that most of the time collapsing such things to a single dimension is a slippery slope to somewhere awful. All this to say: gimme a break, it's model.
Here are my totally made up constants, note that X is a parameter which will depend on UBI:
---
Missionary creates 100$ of output always, plus a 1% daily chance to inspire a worker to become a missionary, a 1% chance to inspire a bureaucrat to become a worker, and a 1% chance to burn out and become a worker.
Worker creates 80$ of output if they're following a missionary and -20$ if they're following a bureaucrat because it's likely that they're causing more harm than good. They have an X% chance of burning out and becoming a bureaucrat.
A Bureaucrat creates -$20 of output, because they're definitely doing more harm than good.
Now lets say that everybody consumes $5 each day to stay alive.
---
So X is our worker burn-out rate.
As with most systems of this kind, it's very sensitive to initial conditions. If you start with a high enough concentration of workers and missionaries, your bureaucrat rate will be very low and you'll have a surplus. Too many bureaucrats and most of your workers are doing more harm than good, the system is carried (if it survives at all) by the missionaries and the minority of workers following them.
Critically, X is a function of risk tolerance. The worker becomes a bureaucrat because they cannot tolerate the risk of pointing out the wastefulness of the bureaucrat above them.
Introducing UBI does two things. It makes standing up to your Bureaucrat less risky, reducing X, and it creates a fourth type, the Video Gamer, who consumes $5 to stay alive but doesn't sabotage the output of any workers like the bureaucrat does.
Some percentage of the Bureaucrats will become Video Gamers if UBI is implemented. That percent depends on the size of the surplus. If the surplus gets big enough, UBI can be so comfortable that there's no reason to be a bureaucrat, because it doesn't afford a significant quality of life increase.
---
So to answer your question about the 3M and the 210M, I'd guess that today we've got 213M people living on the positive output of maybe 50M--the rest are bureaucrats or are following bureaucrats. They're busy fighting over their slice of the pie instead of baking it. Bureaucrats sort of expand to consume available resources, so as automation improves worker output, that ratio will get worse unless we find a place to put them.
We'd have to do research to come up with better constants and run that model for real to be sure, but I don't think it's unreasonable to assume that reducing both the bureaucrat concentration and the worker burnout rate by 50% would triple the system's output once you let the personae conventions find a new equilibrium. I'm not sure how much more federal employees will get paid above UBI, but I think there's room for the end result to be that future UBI is cushier than today's government work.
We issue it to ourselves, more or less like CirclesUBI is doing it in Berlin.
They're just letting it be inflationary and setting the payout to increase over time to adjust for inflation. So maybe you get $5 per week this year and $8 per week next year... This can be balanced so that it amounts to a more or less constant purchasing power.
Personally I prefer the demurrage approach where account balances just have a decay rate--that way you've got a better shot at $5 written down today having the same meaning to people who read it next year, but the economics are the same (more on the theory here: http://en.trm.creationmonetaire.info/ ).
It's gotta be decoupled from the government so that, as discussed in my model, it can act as a safety net while you're ridding yourself of wasteful bureaucracy. It doesn't really work if the bureaucrat you're deposing can threaten to take away your UBI.
The people who need a pyramid structure to strive for, and office politics to fight, will never settle for "from each according to their ability and to each according to their need", and those are the people we've selected for.
Here here. Also, if there weren’t red tape, then there would be more corruption and lack of accountability. The staff of agencies are damned if they do, damned if they don’t. If one wants to critique government agencies, criticize the political appointees who are in thrall to the industries they are supposed to be regulating. The rank and file generally work hard and in good faith. They are just trying to be good stewards of public resources. I’ve seen this at the federal level and state levels (primarily in North Carolina and Louisiana).
I just spent a month doing an E-Business Suite platform migration and it was very similar: follow the step-by-step instructions to apply patches and run commands. Each patch has a README file with dependent patches or commands which need to be completed first. It works mind-numbingly great until you run into the first of many circular dependencies.
That's one problem with treating the implementer as a machine to run code. The whole procedure can't be tested, so when parts are changed they can break the whole. It relies on the human in the loop to resolve the conflicts, which is not repeatable.
The other problem is the "mind-numbing" part. No-one can maintain 100% perfection all of the time. And in the context of presenting to people who don't know what it all means, I can see why mistakes would be made.
The problem is that spelling out complicated things is hard. Take the law for example - in theory we have a coherent code that specifies exactly what things are crimes and the appropriate methods of dealing with them. In practice, it takes teams of highly trained professionals and an elaborate system of courts to clarify what these laws mean in all but the most trivial cases.
Generally you need some flexibility to handle slight variations in circumstance whenever making a decision, and at times things come down to judgement calls that can not be turned into an algorithm. But bureaucracies don't like empowering their workers to make decisions, and so you get ever more conoluted instructions to shift the decision making process higher up the ladder.
That is a different use of the word theory, and serves as an excellent example of why the problem of communicating complex ideas so unambiguously as to eliminate the need for interpretation is so intractable.
In my experience, this arises as an unintended consequence of the quest to lower costs and reduce bureaucracy.
About ten years ago, a new manager was brought in to make us act less like a moribund government department and behave more efficiently. As an example of government waste, he pointed to the money we were spending on storage for data back-ups. We wouldn't need back-ups if we stopped making mistakes.
You might think that this no-back-ups policy would be an instant disaster, but it lasted years without issue. When there was a failure, the manager would hand the sys-admin a soldering iron, the admin would fix the hard drive, and we would be back on track. Finally, the sys-admin retired and a new one replaced him. Not long after, a critical system failed and data that we were required by law to maintain was lost. The manager handed the admin a soldering iron and told him to fix the hard drive. The admin said it was impossible and the manager fired him (yes, you can get fired from a government job). Other candidates were interviewed, but no one applying for a $30k job was confident that they could repair a broken hard drive.
Finally, there was talking of hiring the old admin to come out of retirement and fix the drive. Except he explained that it had always been impossible. During his tenure, he'd spent 5% of his salary (gross, not net) paying for back-ups and replacement drives. When the manager gave him a soldering iron, he'd just chuck out the old drive, by a replacement off Newegg with his personal credit card, and load it with the data he'd backed up to his personal S3 storage. His back-up script was still running on the server, but he'd stopped paying for the storage space the moment he retired.
Eventually, the manager was forced to spend a whole year's budget on an expensive data-retrieval firm to collect the data (which was still cost an order of magnitude less than the fine the department would have had to pay if we'd lost the data). He was fired and a new manager brought on board. Because of the money which had been lost on the data retrieval, new measures were put in place to prevent this from ever happening again. This included a new back-up system and audits to ensure that other employees were using personal funds to pay for departmental resources. Of course, this meant rigorously documenting exactly what resources each employee was using...
Six years after the manager was brought in to decrease cost and increase agility, we were now more over budget and tightly controlled than we'd ever been.
I see this a lot with people are experts in real time operating systems environments, particularly in aviation/space stuff (maybe because that’s where I worked for a while).
They have excellent intuition around making things redundant to single pieces of hardware failing but don’t really grok making stuff resilient to wider failures.
Anything involving transaction logs, rollbacks, and plain old backups take a backseat to live hardware-redundant environments. “It’s OK though because we follow the NASA software development process which has a rigorous set of validation steps that prevent bugs.”
> They have excellent intuition around making things redundant to single pieces of hardware failing but don’t really grok making stuff resilient to wider failures.
I always feel like making single components redundant is a fairly well-defined process -- generally speaking, the mechanisms are the same (1+ redundant components, failover, STONITH, etc), where making things resilient on a higher level is not as well-defined, and often requires bespoke solutions to each unique situation.
BFT state machine replication is well-understood and well-defined: use N of M agreement for inputs and run them through a deterministic state machine. Optionally, do N of M signature of outputs.
OTOH what are properties of failover? "Failover" seems like an attempt to cheat on Byzantine generals' problem: Generals send mail and the confirm results in a Zoom call. But what if Zoom doesn't work? What are the assumptions for 1+ redundant components/failover/STONITH?
Formal verification of not having such fatal bugs would allow a real-time system without reliance of backups to not screw up, but still of course having logs/rollbacks for (human) input actions to cope with mistakes.
It's just that production software essentially never used formal verification to a sufficient extend.
Note that this most recent outage (NOTAMs) had nothing to do with ERAM.
ERAM is employed the the 23 air route traffic control centers [1] throughout the nation as their primary operating system. If there were a system-wide outage of ERAM, the consequences would be magnitudes more consequential than any NOTAM outage. Basically every flight in the air and not close to a terminal facility would lose radar contact and controllers would be working blind, causing widespread chaos and likely many safety incidents. Non-radar air traffic control is a thing, but generally controllers do not have adequate training or currency to do it safely, and definitely not at anywhere near normal capacity.
The NOTAM system is something that a room full of decent engineers could easily build from scratch and make it infinitely better in a short time. It’s essentially just a database of categorized posts with some APIs for sending entries and and returning them when requested. These government IT teams spend way more than what it should cost and end up with bloated ancient tech that barely works.
That’s not speaking poor of the engineers (which in my experience can be very good) but of the management and innovation culture of these agencies, which is too often terribly broken. They would say they are “risk averse” but as yesterday highlights their poor approach to this creates a ton of risk.
"I haven't seen the requirements, know little to nothing about the system, but I could knock that out in a weekend with a few Red Bulls." This stuff is such cringe, it is the type of response you see from fresh CS students who haven't started working yet and think everything is a piece of demo work where the requirements don't matter and that everything is simple if you just start writing some code.
Obviously the NOTAM system isn't up to scratch, that's why we're talking about it. But "hard problems are easy" isn't constructive nor very realistic. I bet someone could put in a three-hour presentation talking about all the complexity that led us to this conclusion.
I hear you. But the flip side is there is often a tendency to massively over complicate things in a way that bakes in valueless complexity and bloat, which is very much the norm in large government systems. I’m practice this attitude is far more common and destructive long term.
I think it's way worse than adding complexity and bloat. There are processes that specifically prevent anyone from understanding or owning the system. The complexity and bloat is a side product of the fact that the work was siloedd, contracted out, and everyone washed their hands off the result. Which also takes exponentially more time and people.
this is extremely accurate. it's like a giant rube Goldberg machine that takes both good code and garbage as input and produces a tremendous amount of garbage that nobody can understand once it comes out the other end.
Do you know that each of these systems actually has a defined "owner"? It's part of the FISMA process and every so often these "owners" have to attest that certain security processes are in place and functioning and that no significant changes have been made without a full review... among many other things.
That "owner" is also a federal employee, not a contractor.
I've worked directly with this system a number of times. it's basically a pub sub rss feed optimized for low latency and molested by bureaucrats for decades
> This stuff is such cringe, it is the type of response you see from fresh CS students who haven't started working yet and think everything is a piece of demo work where the requirements don't matter and that everything is simple if you just start writing some code.
This stuff is such cringe, it is the type of response you see from jaded boomers more focused on box-ticking and punching out as opposed to doing anything new.
This is a situation where tossing the whole damn thing out and starting over again would be productive. The systems that lead to the creation of these half-fossilized government projects (that still don't work!) will not change and needs to be tossed.
> "I haven't seen the requirements, know little to nothing about the system, but I could knock that out in a weekend with a few Red Bulls." This stuff is such cringe
There is a whole group of people who their job depends on the system existing like it does. One dude I worked with re-did a procurement system with a few spreadsheets. He got tired of waiting for 30 people to 'do it'. As soon as his boss found it the whole project he was on was scrapped and he was demoted shortly thereafter. His 'sin'? He had accidently found a way to put 30 people out of a job. These systems exist like this because our gov wants them that way. Not because they are the best.
Yesterday should have been 'run this on these 2 old boxes and make sure they still work' regression test. Sounds like either that step is not there, totally skipped, or does not match reality. Building a real five 9s type system means taking each piece and pondering 'what are the different ways this can fail'. Then mitigating each of those. It is mind numbing tedious work than takes a long time to do. Also most of this is probably run by contract houses. Which means the people who use it do not really 'own it'. Which is by design, for CYA. Which costs more and takes more time because it is all paperwork.
Well they claim to be "risk averse", the wisdom there seems to be that if you fill out a million forms that qualifies as risk averse because it shows you did due diligence. The problem is they wholesale took that process from the rest of the org and applied it to software which doesn't work.
It's like making a crap sandwich and filling out a bunch of forms proving that it's not crap sandwich, rather than just spending the time you would be filling out all those forms on i don't know...not making a crap sandwich.
One common problem is building new things with tech that is already obsolete. Given the choice between new and old, the old is perceived as more tried and tested. Granted it's sometimes difficult to distinguish "new and going to last" from "new and shiny", but when it comes to software choosing to go with "old and tested" can be counter to cyber security since the old software is no longer updated.
Of course you also have "old but still developed, and likely to be supported in near-perpetuity". Things like SQLite fall in this category. Unfortunately there's this other problem of management being inexplicably down on anything open-source. I don't know whether they've been exposed to too much FUD from vendors selling proprietary solutions, or just that the idea that security through obscurity is no security at all has failed to reach them.
Engineers following current best practices? They'd run a Kubernetes cluster with Kafka. Because those are the best CotS tool for reliability. Battle-tested. The system will be down every week because Kubernetes need patching.
It's risk avoidance to the point that that avoidance leads to new kinds of risks. The whole idea that you can architect yourself out of failure modes to the point that you no longer need to make backups is one that I see every other week or so and the number of companies out there that believes that because they have redundancies they don't need backups any more is staggering.
What's funny about this is that the FAA obviously knows what can go wrong with "yeah we have two of them" as they wrote ETOPS regulations to avoid some of the common pitfalls or amateur mistakes. They then failed to apply that to their software.
Obviously at a big government agency, the same person is not writing both aviation regulations and software procurement contracts, but the institutional knowledge is there. Nobody thought to be as paranoid about software as they are about planes flying over the ocean, but honestly, paranoia is good if you want reliability.
> they wrote ETOPS regulations ... then failed to apply that to their software
you are pointing out that they prioritized that planes in the air could land safely over allowing more planes to take off. I'm actually quite reassured now.
This seems to pose an interesting question that's out of my pay grade. The fundamental problem seems to be: you've replaced two distinct systems (one new and far more capable + 1980s-era one that always works but lacks [new feature x100]) with the same one running on 2x different machines. So the weak point is you ultimately share the same database/data structures/memory+logic flows between two systems. So if you keep them in sync the distinction comes down to hardware and lower-end systematic issues.
But most orgs can't realistically have two distinct software systems. How do you create proper isolation or failure mechanisms between them?
I'm guessing this sort of thing is what you mean by their experience with ETOPS.
It’s a fundamental limitation of identical redundant systems that they have vulnerability to some of the same threats, particularly bad inputs and capacity issues. It’s important to understand it’s only giving you physical redundancy, such as if one data centre goes down. But the same software bugs, the same bad input data, even the same memory overruns are likely to hit both systems.
It’s not bad design, it’s just you have to understand what resiliency you have and plan against each of various such threats according to your risk appetite.
My intuition on that is that two is also a bad number to choose in that case. One could fo full lunar mission on the thing and have three models and in case of inconsistency the majority wins.
you could go to a model that verifies the integrity of the data coming in and makes sure the limits on the data are sane before committing it to the db. using a language with strong safety principles (that are not very "hip") like ada or fortran. or you design it so that the system is robust to failure and expects failure like something from teleco like erlang. redundant hardware is fine and great but having them do verification on the data and monitoring the limits of the system is pretty important too.
This reminds me of a particular hardware system I'm familiar with whose design specified "dual power supplies". However late in the acceptance process it was discovered that the condition of one power supply up and one power supply down causes the system to lock up. (My guess would be a phantom current path between the powered and un-powered halves causing an unintended circuit state. Or perhaps just a software bug; maybe the one-supply-down notification code path was never tested.)
The vendor simply changed the procedures to say the user must use a two-fingers procedure to simultaneously flip both power supply switches on or both off at once. It bothers me that we still don't know WHY the original problem happened. How do we know there isn't electrical damage occurring during the brief period between the two switch contacts (since no human can do that perfectly)? If it's a software problem, what is the most time one power supply can be up and one down before the bug is triggered? That's not been characterized, to my knowledge.
But in the context of this thread, what's relevant is to avoid addressing the problem the vendor changed the meaning of "dual power supplies" from an OR condition to an AND condition! They met the letter of the spec while completely violating the spirit of the requirement.
Turns out having 2 copies of something doesn't matter if same message re-tried after sending to first node crashes another node, who could possibly predict that /s
$2 billion is an insane amount of money even for a major gov software project
And even after $2B it was completely broken and late and required another year and hundreds of millions.
I get the feeling people see thousands of millions of dollars as some abstract thing. You could build an A+ team with 1/100th that cash. Yet it still ends up sucked down a blackhole that's only designed to consume more money.
And there's zero consequences for failure. The same few contractors will get the contract next time.
Quote from FAA: "Our preliminary work has traced the outage to a damaged database file." UK news source The Independent is reporting that Nav Canada's NOTAM system also suffered an issue.[1] Speculation: corrupting input, either international or North American? E.G. UTF-8, SQL escape, CSV quoting.
edit: Better reporting of Canada's issue from Canada's CBC (and frankly, better reporting about the US, too). [2] "In Canada, pilots were still able to read NOTAMs, but there was an outage that meant new notices couldn't be entered into the system, NAV Canada said on social media." "NAV Canada said it did not believe the outage was related to the one in the U.S., but it said it was investigating."
Why would this be an indictment of any specific database technology? If your disk fails and corrupts the filesystem, you're toast, regardless of what database you are using.
Working with critical infrastructure and lack of true in depth oversight it wouldn’t surprise me DR plans were not ever executed or exercised in a meaningful manner.
Yep and if you ship WAL transaction logs to standby databases/replicas, corrupt blocks or lost writes in the primary database won't be propagated to the standbys (unlike with OS filesystem or storage-level replication).
Neither checks the checksum on every read as that would be performance-prohibitive. So "bad data on drive -> db does something with corrupted data and saves corrupted transformation back to disk" is very much possible, just extremely unlikely.
But they said nothing about it being bad drive, just corrupted data file, which very well might be software bug or operator error
RAID does not really protect you from bit rot that tends to happen from time to time.
ZFS might because it checksums the blocks.
But if the corruption happens in memory and then it is transferred to disk and replicated, then from a disk perspective the data was valid.
journal databases are specifically designed to avoid catastrophic corruption in the event of disk failure. the corrupt pages should be detected and reported by the database will function fine without them
If you mean journaling file systems, no. They prevent data corruption in the case of system crash or power outage.
That's different from filesystems that do checksumming (zfs, btrfs). Those can detect corruption.
In any case, if you use a database it handles these things by itself (see ACID). However I don't believe they can necessarily detect disk corruption in all cases (like checksumming file systems).
Well, for example, MySQL/MariaDB using utf8 tables will instantly go down if someone inserts a single multibyte emoji character, and the only way out is to recreate all tables as utf8mb4 and reimport all data.
MySQL historically isn't very good about blocking bad data. Sometimes it would silently truncate strings to fit the column type, for example. It's getting better as time goes on, though.
I have had customer production sites go down due to this issue when emojis first arrived. It was a common issue in 2015. I would hope it is fixed by now!
Having dealt with utf8mb4 data being inserted into the utf8mb3 columns many many times in the past, I've never had a table "instantly go down". You either get silent truncation or a refusal to insert the data.
In MySQL the `utf8` character set is originally an alias for `utf8mb3`. The alias is deprecated as of 8.0 and will eventually be switched to mean `utf8mb4` instead. The `utf8mb3` charset means it's UTF8 encoded data, but only supports up to 3 bytes per character, instead of the full 4 bytes needed.
Imagine you have one node which is running as a replica of another and it takes the backups. Well, let’s pretend it is backing up the corrupted data once in a while and it happened to overwrite their cold backup. They could have any number of databases and still had this failure. It’s more their methodology for taking backups. They should have many points in time to choose from to rebuild their database. They should be testing their databases before backing them up blindly.
This could be due to some third party service caching the NOTAMs. Even in the US Foreflight had all notams available but just couldn’t fetch news ones.
I have no idea how often NOTAMs typically get updated but would that explain why flights were able to operate for a few hours before ultimately a ground stop was called for?
In the past years, I've grown increasingly concerned about backups. To me, it feels like whatever you back up should be validated, before it's considered "okay".
So, if you have a database of some sort, a part of the backup process would be checking that the backup can be used to run an instance of it. If you have images or videos or PDF files, all of those should be validated as well, to make sure that they're not corrupted (e.g. display/playback fails). If you have compression as a part of the backup process, then the compressed archive should also be validated, that it's not corrupt and that the data can also be extracted.
Only then would checksums of the backup be actually useful, once you'd know that you don't just have a bunch of mostly useless binary data on hand. Sadly, the tooling just isn't there (e.g. CLI utilities for validating everything) and there are far more file formats than most people want to concern themselves with.
> In the past years, I've grown increasingly concerned about backups. To me, it feels like whatever you back up should be validated, before it's considered "okay".
I've been trying to toot this horn at every place I've consulted/worked at, with the most common response from software engineers being: "We're using the AWS backup system of the database, of course it works, no need to test it"
People have way too much trust in systems overall, especially ones they have no inside information about. And people also seems to default to "Of course it works" without testing it, instead of "How would I know it works if I haven't tested it?" or even "I don't know".
Seems like it would be prohibitively time-consuming. For a moderately large data-set (of say a few hundred Tb), by the time the backup of your data is vetted and validated, it's months old and effectively useless.
Thanks! I haven't seen this before, though I have heard these 7 general principles. My personal strategy is broadly:
1. daily encrypted restic backups to my local NAS, running ZFS raidz2
2. weekly rsyncs of the backup directory to an offsite machine
3. yearly full offsites of the entire NAS
I test my backups frequently if only because I have some odd persistent bug that eats my zsh history file about once a month. I now have a "recover-latest-history" script that pulls that file back down from the latest restic backup. I have just completed an offsite and exchanged it for last year's, so I can test that now too!
Well I would say that you must ensure that using the backup will solve the problem before choosing that solution. This implies that you must take time to analyze the source of the problem first.
What’s even more hilarious is expert professionals being openly derisive and spreading stereotypes about others based on programming language preferences.
Do you hate JavaScript programmers?
Do you also agree with your friends how of people of different color are all the same?
Do you feel the need to call out other people’s sexual orientation, at least behind their back when only your friends are hearing?
Some of your friends might be JavaScript programmers, you know.
Are you aware that an Amtrak passenger train was stuck in South Carolina for 39 hours this month? Some of the passengers felt they were being held hostage. At least at an airport you can walk outside and catch an Uber.
There’s a reason for that. Freight has priority because the freight companies actually own the rails, and actually use them, and it’s a significant contributor to the US economy. Amtrak is an unwelcome guest that they are required by law to put up with. If they wanted better service, Amtrak could pay for it, except the big cross country routes are already crazy unprofitable.
The actual sane solution is to shut down those routes and run buses. American heavy rail outside the NEC is better suited for freight anyway. Build new dedicated corridors for fast passenger services, or stay home.
Your comment gives the impression that Amtrak has no business operating on the tracks they do.
The reason the freight companies own those rails is primarily from the land grants made my the federal government which came with obligations, such as providing passenger service. Amtrak has trackage rights because the freight companies wanted to divest their passenger rail operations. Maybe an unwelcome guest, but essentially a former part of their own operations.
Anyhow, buses are an insufficient substitute for passenger trains and already available from other operators. If they were sufficient people wouldn't be on the trains.
Yeah. They gave railroads the initial land — in 1872. There’s been a lot that’s happened since then, like the bankruptcy and near-complete collapse of functional passenger rail transportation in most areas in the 1970s-1980s, as it faced stunning new competition from cars and planes.
But the argument of “is this justified given the history?!” should take a back seat; certainly Congress is able to force the industry’s hand whether or not it’s justified. The first question should be whether it’s a good idea in the first place, since rail is doing useful things for the US economy, which would ultimately shoulder more costs for it than the railroads themselves as a business.
Damaging the supply chain and raising prices across the economy while putting more trucks on the taxpayer-funded roads emitting more carbon dioxide is a steep price. Incremental improvements to the reliability of seldom-used cross-country routes through the sparsely inhabited West at speeds of about 60mph aren’t worth that price.
Well, another option is that the government could pay for Amtrak’s track the way they do for roads. We don't make car manufacturers or taxi/bus companies pay for the roads, so I don't see why it's on Amtrak to pay for their own rail.
On the highways, the government has taxed motor vehicle fuels at the federal and state level (supplying occasional infusions from the general fund) and occasionally set up tolls for specific projects, generally setting higher for more intensive use (trucks). Thus bus and truck operators pay for use of the highway, perhaps occasionally joined by taxpayers generally.
By contrast, on the railroads, Burlington Northern Santa Fe or whoever purchased the rail from the decrepit husk of bankruptcy that preceded it, and upgraded it, and installed all the new safety equipment, and maintained it over time. The capital involved here is, by and large, private.
So there shouldn’t be any surprise that the situations are different.
I don’t know why you want to drag in manufacturers, though. It seems to muddy the water.
> Build new dedicated corridors for fast passenger services
If only, one of the best investments for NA would be an actual HSR network outside freight. California is trying and there are so many people who are doing everything they can to stop it. Including musk inventing an impossible alternative he never invented to build[1], hyperloop, as HSR would compete with tesla and hurt his sales.
Doesn't Amtrak pay for that? I have a feeling if there was an effort to shut them down and replace it with buses, the freight companies would block it. They benefit from the relationship.
This feeling is grossly at odds with the reality: hostility between host railroads and Amtrak is well known as the order of the day. Amtrak pays a well-below-market sweetheart rate, for trains that often miss their slots and snarl operations. No host railroad would miss them.
Half a dozen news stories about passengers getting trapped on planes that were sitting on the tarmac from the past 30 days alone.
I'll take being "trapped" on a train that has a cafe car, numerous bathrooms, power outlets, likely cell phone service or wifi, plenty of room to walk around, and considerably more leg room and seat-reclining...over being trapped in a metal tube, crunched into a tiny seat, with a bathroom that probably won't function past a few hours, limited food, no power, and nowhere to get up and walk around.
Not to mention, if you're anywhere near civilization, if push comes to shove: you have at least some possibility of being able to just leave. On a jet airliner in an airport, you are completely trapped.
Federally airlines should be required to deplane passengers after a certain amount of time, or immediately if the plane becomes too hot/cold, runs out of water, or the bathroom stops working....and the flight crew criminally punished if they don't. But that will never happen because of airline industry lobbyists.
That's terrible. On the other hand, airports for me are guaranteed to be awful, and I've always enjoyed the Amtrak. If I'm in a hurry to get someplace I'll take a plane, sure. I wish we had high speed rail in this country (or I'd never fly again except internationally) but of course we don't. If there's no particular rush I'll take my sweet time on the Amtrak, and if it's delayed for 30 hours - like I said, you'll find me minding my reading with no complaints.
in some countries in Europe you actually can do that too from trains (and not only in Romania, where a family crossed the tracks (2 <4 y.o. kids, 4 pieces of luggage) to conveniently enter the parking lot, where their car was located)
Well, sometimes it's because you already boarded and started to taxi and there are so many planes on the ground that there are no gates open to get back to the terminal so you spend hours on the plane waiting for a slot to open so you can get off, meanwhile the galley runs out of food/drink and the toilets fill up. Then when you finally get into the terminal, it's chaos, no one knows when planes will be flying again so you're not sure if you should stay there or try to get a hotel (which fill up quickly from all of the stranded travelers)
It was United or Alaska and was a few years back and fortunately it wasn't me on the plane, it was my wife - a big east coast storm and airport closures diverted her flight (along with a bunch of other flights).
More or less. The chairs are hostile to stop you from getting too comfortable. I've had 6 hour layovers before and had to sleep in those terrible chairs under fluorescent lights. On the last flight I took there was barely any good near my gate, there were like 4 bars instead. You can't leave your luggage anywhere, so you're dragging it all over the place. You get jammed like sardines into the plane, and you can't stretch your legs. You can't really relax because you have to stay abreast of announcements, in case they change your gate or cancel your flight or what have you. Everyone is on edge. Et cetera.
On the train there's more room, you can move around, and the atmosphere is more relaxed.
Some people like planes. I saw an HN comment where someone said they took lots of "flights to nowhere" over COVID and used the plane as their office. I don't understand that at all. But to each their own.
It's hard to agree with that. There are some amazing train routes through Austria, Norway, Switzerland, Germany that keep me looking out of the window the whole way.
I love the views from the train too, but I feel like the view flying is something magical everyone should experience at least once in their life.
The ground fading away, the tiny cars, glittering rivers, seeing it in reverse for landing. That is something I'll wax poetic about. But if I could take the train the rest of my life I would.
Clearly the middle ground here is to bring back zeppelins. /s
No sarcasm needed for zeppelins: the idea of gliding over the landscape low enough to hear through the openable window what's happening below fast enough to be useful for a domestic voyage, and in reasonable comfort: What's not to like?
I was waiting for a Ryanair flight from Friedrichshafen when one of the Zeppelin NT craft flew past, heading over the Bodensee towards Switzerland. It looked simultaneously classic and futuristic, the sort of shiny utopia future of 1950s sci-fi book covers. I envied HARD.
The expectation of speed at the airport, because planes move fast, creates urgency for everyone. When you're on a ship or a train there's a time component to your expectations of travel. A delay in an airport feels worse than a delay at a train station or at a port.
Final decision vests with the TSA agent at the desk.
But there are needles which are short metal/plastic parts, with a long flexible back (eg for circular knitting) which are very unlikely to cause an issue: the point is smaller than a pen.
My seat on the Amtrak Cascades in September was more comfortable and had about 6 inches more legroom than any airplane I’ve ever flown on. First or Business class on an international flight will be way better than a roomette on Amtrak however.
I remember a time in the mid 2000s where a 20 hour delay was the norm for the northbound Coast Starlight. I used to catch the previous day's train from Salem to Seattle...
They do have an odor to them, especially since most people tend to boil them to death but I can think of a hundred things more offensive in a confined space.
I never had an egg that smelled unless it was bad (which too hasn't actually happened to me). My sense of smell is pretty good, I smell all kinds of little things (incidentally, I think that loss of sense of smell can be an early signal for approaching death, if there is no non-deadly cause to explain it).
I read the replies here: https://www.quora.com/Why-do-boiled-eggs-stink-but-scrambled... -- but I've never seen this in five decades of egg eating. I think I would remember a stinking egg (would make me more reluctant to make more for quite some time)? What does "overcook" mean in this context? I've done up to a little under 10 minutes for some larger eggs.
I think there may be something very wrong with the food, and subsequently the bodies, of your mass-incercerated and terribly treated chickens. I did see some funny eggs (inside colors) - that UI refused to eat - in places that bought the cheap mass-produced eggs. My mother also gets eggs with unnaturally deep colors, I think that's something they must have added to the chicken's prison food.
That's an issue I have with discussions - but also with studies - about this or that food, for example "meat is bad", but anything really, tomatoes or potatoes too. Each food type has a vast spread, and everybody just uses the name of the food but people are all talking past one another because they have vastly different actual food in mind.
It’s common to over-boil eggs that results with a green copper sulfate patina on yolk. Some people vehemently prefer that. We had some friends visit us and they took that on a plane… Yikes!
"The source said the NOTAM system is an example of aging infrastructure due for an overhaul."
The concept of "lint" a file and the concept of verifying a backup are truly ancient and coincidentally are also completely absent from the description of the problem. People that felt no need to do either 20 years ago are certainly not going to start doing it today, especially when the inevitable system failure in the distant future results in yet another lucrative replacement contract in the distant future for system 3.0.
I think you're misunderstanding what the "backup" was in this context (in particular as the article is unhelpful by calling it a backup file, which isn't accurate).
In this case it is two PRODUCTION systems running concurrently. The primary and the secondary (article calls the "backup"). Primary went down due to corruption, but the identical secondary system couldn't be switched to because the corruption also occurred there.
No, it's the RAID vs Backup distinction. If you have 2 disks in Raid 1, you have a system that can survive one drive dying (high availability), but you also need a backup system (offline).
It sounds like their high availability system failed but they probably were able to restore from backup (offline)
>Due to temporary lack of access to Internet and malfunction of the electronic document flow system of Rosaviatsia the Federal Agency for Air Transport is switching to paper version.
>“The document flow procedure is being determined by the current records management instructions.
>“Information exchange will be carried out via AFTN channel (for urgent short message) and postal mail.
>“Please make this information available to all Civil Aviation Organizations.”
I recall years ago in the mid 2000s a moderately sized US ISP that sold internet access in office buildings to mostly businesses had a rather interesting outage. They had switches (Extreme switches - known for their purple color aka Barney the dinosaur switches) in the basement or many buildings and backhauled to a colo to the rest of their network.
They were doing some mass network upgrade during the early hours maintenance window and devices weren't coming back. I don't think they had OOB access to their boxes and it requires someone going out and recovering them after hours of downtime and getting techs there.
The culprit? They downloaded the OS image using FTP and forgot to set binary mode. Various other network kit vendors I recall would do some level of validation (and anyone downloading should be checking the hash). RCA sent to customers was vague and just said it was a corrupted image / failed upgrades.
I had a boss purchase those switches without consulting me, he liked the color. They were pure garbage. Thanks for reminding me of that, I'll have to ping a former co-worker and have a laugh.
> The culprit? They downloaded the OS image using FTP and forgot to set binary mode. Various other network kit vendors I recall would do some level of validation (and anyone downloading should be checking the hash). RCA sent to customers was vague and just said it was a corrupted image / failed upgrades.
I've always been paranoid about this sort of thing when applying BIOS or firmware updates, despite those almost always having checksum validation, but I guess my caution is not unwarranted.
I was recently rewatching some old Taleb talks about fragility. Our software systems are extremely fragile. I wonder if there is any way we can make them anti-fragile? What would this look like?
The telecom industry ran into similar issues and developed Erlang.
I think a big takeaway from it is that designing systems which are failure-free is a fool's errant - no matter how hard you try, you can never get rid of 100% of the bugs.
Instead, make it failure-tolerant: sooner or later every part of the system will break, so it should be constructed in such a way that it can gracefully recover from failures, and even operate with some parts of it unavailable. Crashes are expected, so the system is designed to handle them properly.
I think they are. When a crash happens, the cause is identified, we patch the bug, and redeploy the system. After that, the software cannot fail for the same reason again—it's less fragile than before.
A fragile thing is like a wine glass. Once it breaks, it cannot be restored to its original state and especially not made better than before.
However, if you're talking about patching the system while it's still running, check out "Stop Writing Dead Programs" from Strangeloop '22: https://youtu.be/8Ab3ArE8W3s
TDD helps a lot against fragility imo. The main issue is explaining why your project takes so much longer as your tests force you to work out every edge case.
A long time since I read it but I think he gave examples like mithridatism, poisoning yourself to build up resistance. Anti-fragility is about incorporating change as an expectation instead of seeking stability and fearing it.
For software, that would mean practices like Chaos Monkey. If your production system stays up while an external process is constantly killing processes and deliberately corrupting memory and files then you have good confidence of riding through unexpected failures.
Actually, it would've been pretty reliable if it ran on an IBM mainframe. That's their entire selling point.
There are two fundamental philosophies in fault tolerant systems. One is designing fault-tolerant hardware and running non-fault-tolerant software on it. This is what mainframes do. Practically any component of a mainframe can be hotswapped without shutting down the OS.
The other is designing fault-tolerant software and running it on non-fault-tolerant ("commodity") hardware. The latter is so popular that it's pretty much the default now, but it's not the only way of doing things.
> Actually, it would've been pretty reliable if it ran on an IBM mainframe.
How would an IBM mainframe help you with a corrupted database file? I understand that reliable hardware makes the corruption less likely to happen for hardware reasons, but it can also be the result of a software bug, or some unexpected and not correctly checked input.
From the article and comments I don't understand - are they talking about backup (copy of data) or backup system (nuke your primary system, failover to backup system and just keep working) ?
I think it is a SYSTEM
> It has a backup, which officials switched to when problems with the main system emerged, according to the source.
> Officials ultimately found a corrupt file in the main NOTAM system, the source told CNN. A corrupt file was also found in the backup system.
Perhaps the backup system is just using data from main system which currently is compatible but won't be in the near future?
Still there should be data copy somewhere, right (with corrupt data...)?
I am starting to wonder if this "backup" is an online log replica of the production system.
Failover doesn't work out in the situation where you are replicating trash. A "reboot" from an actual backup/snapshot would be required if you ate a bad log stream.
>In the overnight hours of Tuesday into Wednesday, FAA officials decided to shut down and reboot the main NOTAM system – a significant decision, because the reboot can take about 90 minutes, according to the source.
How often is the system rebooted? I haven't heard of it happening before and some quick searching didn't find any historical examples. Is it a scheduled event and no flights have departure times while this maintenance is taking place?
I've found that issues tend to manifest on production systems with infrequent restarts.
It sounds more like there are a pair of systems where one operates as the primary and the other as a backup, and these can swap in case of problems with the primary. That's fairly typical in critical infrastructure.
I've done testing on these types of systems in the past (carefully) and the owners will often let you test against the system that's currently the "backup".
So they probably restarted the "primary" after performing a fail-over.
I wonder if the corruption could have been caused by a cosmic ray bit flip. It would be funny to think that one single tiny subatomic particle knocked down our entire flight system.
Cosmic rays are a cop out but a good plot device. They're so rare, but one could write a bit flip into a script for a novel or a tv show as the solution to any computer mystery.
Cosmic rays are more common than you think. Google's early infrastructure was impacted by a supernova (because their nodes were so cheap). But something like NOTAM can handle these single bit flips without a problem.
You can expect 250 or so cosmic ray events per second in a 42 litre sodium iodide crystal pack at 100m above sea level.
Source: 10 years airbourne geophysics, radiometric calibrations.
Addendum:
In-flight upset 154 km west of Learmonth, WA 7 October 2008 VH-QPA Airbus A330-303 [1] was a probable (but uncertain) example of cosmic ray events causing multiple spikes in one of three air data inertial reference units (ADIRUs) that also went on to cause a failure mode of the "best of three" reporting system leading to a pitch down in which [2]
> 110 of the 303 passengers and nine of the 12 crew members were injured; 12 of the occupants were seriously injured and another 39 received hospital medical treatment.
HOWEVER .. despite 250 events per second in a 42 litre volume, it took 128 million hours of unit operation to see a failure mode.
It's a lot of billiard balls going through a lot of space and a high bar for "something bad" ( just the right bit flip ) to happen.
One cosmic-ray bit-flip per 256MB per month. Significantly more for computers on aircrafts at higher altitudes. I kinda wish I hadn't learned this fact, how fragile everything is. On the other hand, it makes me appreciate the importance of error tolerance and recovery.
> Cosmic ray flux depends on altitude. Computers operated on top of mountains experience an order of magnitude higher rate of soft errors compared to sea level. The rate of upsets in aircraft may be more than 300 times the sea level upset rate.
The reason we all now wear our seatbelts through the entire flight is there was a cross-Pacific flight that had a sudden altitude drop where several people were injured from rebounding off the ceiling. When the news reported on it, they reported the cause was believed to have been an undetected massive downdraft that shoved the plane down.
I lost the thread on that story before the investigation concluded the actual cause was autopilot error. It is, in some ways, more comforting to me to know that the issue wasn't novel atmospheric phenomena, but instead relatively-mundane cosmic radiation flipping one packet of data from the sensors to the avionics that the avionics lacked sufficient redundancy to detect or discard. As a result, the autopilot believed the plane had suddenly pitched 90 degrees and drastically corrected to escape stall.
Surprising that only two people in all of the comments here suggest to use ZFS. It could have detected any broken data before it was moved to the backup system.
It seems like every year or so we have some gigantic technology meltdown in this industry.
Imagine something like this happening at the NYSE, CME, et. at. Or, simply think about the last time you heard about a nationwide credit/debit card outage...
Why can't we have our national infrastructure systems running at least as reliably as the Amex network?
These systems are all information clearinghouses at the end of the day. If we have matching engines that flawlessly process millions of trades per second every day and mainframes that provide resilient source of truth, I think we could consider the same for a life safety critical system as well.
All of those systems have outages all the time, mostly isolated incidents when it comes to NYSE, CME and similar (here is the outage page for NYSE for example: https://www.nyse.com/market-status/history). Visa, Mastercard and others have nationwide outages from time to time too, they're in no way invincible like you seem to imply. Latest Visa outage I can remember must have been around 2018/2019 sometime, but probably it happened later than that too, except I didn't notice it then.
Makes one think if they should have hourly snapshots combined with integrity checks of the snapshots (and the main database, assuming it is feasible to run that for an online database). They will know within an hour if a snapshot is corrupt + they can restore from the last known good snapshot and not lose too much data and be up relatively quickly. Obviously like most software, there is a bunch of "it depends" but would be interesting to know more about the system.
I have a feeling the summarized age of the hard drives in that system exceeds the age of the United States itself (which is 247 years). When fsck was introduced in 4BSD in 1980, it checked every filesystem on boot because it had no better idea. If this thing is, say, forty years old then that's exactly the right age for this...
Obviously it was due to details being entered using the now standard International foot when the software expected the now obsolete US survey foot [1] .. the discrepency across a transcontinental flight was large enough for a pilot to fall through.
> In the overnight hours of Tuesday into Wednesday, FAA officials decided to shut down and reboot the main NOTAM system -- a significant decision, because the reboot can take about 90 minutes, according to the source.
90 minutes; this certainly appears to be an advertisement about Windows OS.
I worked for an organization in the 1990s that ran a legacy IBM mainframe (later replaced with an AS/400). After an unplanned power outage, the disk confidence test and rebuild took over 1/2 day.
It sounds like the kind of problem that blockchains were invented to prevent. Important systems should not blindly replicate data, they should validate it.
Ok so it wouldn't be a public blockchain. The point is that when you replicate data you reject it if it doesn't conform to some ruleset. That way data that would break the system fails to replicate, and the scope of the problem is limited to whoever (or whatever) entered the bad data.
Data validation is a basic expectation in many technology stacks. I don't understand why blockchain would be uniquely positioned to resolve this problem.
Accident or not, there are zero situations where a op/drill or real apt hack or would be officially ack'ed. Plausible deniability is a requirement of many centralized control systems.
AFAICT, this NOTAM system is a nationwide bulletin-board, using some cryptic standard abbreviations (to save space as if they were paying 1990s SMS), usually filtered by locale/coordinates/path, so pilots have the latest news hat might affect their flight plans.
Has anyone seen exactly how many NOTAM messages are generated per day, and how long of a look-back is required?
From 50k feet, it looks like something that could be replaced with a cryptographically-authenticated massively-replicated virtual data structure that's oblivious about, and robust against all sorts of failure in, the exact systems, languages, update-paths, etc used to keep it in sync or implement any one user's view.
All that pilots need to know is: "I have a full local copy, signed by the right update-authorities, as of roughly-now."
From a glance, NOTAM's uptime this year looks worse than Ethereum, but a bit better than Solana or Binance Smart Chain.
This is great stuff, thanks! Can't shake the feeling a tiny team of professional modern software/system designers, paired with some aerospace old hands, could create a far-better (but also backward-compatible) system in short order.
> using some cryptic standard abbreviations (to save space as if they were paying 1990s SMS)
When I did my initial pilot training in the late 1980’s, the codes for METAR, TAF, and NOTAM had already long been in place. It was explained to me at the time that the encoding was a practice that dated back to its origins in the teletype era. I suppose the limited baud rate of these devices meant that economizing on symbol density was a good idea. I’d still much rather read these succinct formats because it’s easier to chunk it at a glance.
The abbreviations way pre-date SMS - they were standardised back in the teletype days, 1940s-1950s, when printing speed was 30-100cps! Now NOTAMs (mostly) comply with global standards so, like VHF AM aviation radio, substantial changes are impractical.
There is significant debate concerning the number (too many) and size (too long) of NOTAMs, but I can assure you that when you get an unexpected rerouting on a dark, bumpy, busy night, you do not want to be reading pages on plain-text prose! All experienced pilots are comfortable with the cryptic NOTAMs, and are used to scanning the abbreviations for important items :)
> cryptographically-authenticated massively-replicated virtual data structure that's oblivious about, and robust against all sorts of failure in, the exact systems, languages, update-paths, etc used to keep it in sync or implement any one user's view. All that pilots need to know is: "I have a full local copy, signed by the right update-authorities, as of roughly-now."
You've solved one problem and created several more problems.
Crypto-authenticated update & replication works really, really well nowadays. Lots of open-source support, extensive testing, proven track records even in adversarial deployments.
Especially for a simple log-like system, with a limited number of permissioned authorities.
cryptic abbreviations in the NOTAM system reminds me of telcos using CLLI codes for unique geographical locations of physical telecom infrastructure sites - a legacy of 1960s mainframe stuff where the number of characters in a database field for text entry was extremely constrained.
I, for one, appreciate the work that the FAA does to keep us safe in the air and appreciate that this was handled appropriately. Everything breaks. It's just a matter of how and when and what we do with it when it does. The FAA handled this outage appropriately and in a timely manner. I feel for the engineers who had to work on this incident.