One of the most important pieces of legislation in the UK that helped improve safety in workplaces was the Health and Safety at Work Act 1974, which placed a duty of care on organisations to look after the wellbeing of their staff. One of the tenets is to take near misses seriously. According to a health and safety engineer in a course I attended, near misses are like gifts from the heavens, as they show you exactly how something could go wrong despite no one getting hurt, giving organisations a chance to improve their processes. If a workplace accident does occur, the Health and Safety Executive (who enforce the act in large organisations) can levy enormous fines for preventable accidents especially where there is evidence of a near miss beforehand.
- An accident is a combination of the system being in a hazardous state and unfavourable environmental conditions that turn the hazard into an accident.
- A near miss is when the system was in the hazardous state but the environmental conditions were in our favour so the accident didn't happen.
We can't control environmental conditions, we can just make sure to keep the system out of the hazardous state. This means a near miss contains all we need to perform a proper post mortem. The environmental conditions are really immaterial.
Near misses are absolutely like gifts from the heavens. I was part of a post-investigation for a near miss that would've looked like this if things had gone a bit differently:
By this logic, the recent Boeing 737 incident should be considered a gift from the heavens. Which I think is actually a good way to look at it, though we don’t yet know whether this gift will be squandered.
I think a company taking things seriously would view it as a gift from the heavens, but the fact that Boeing is seeking an exemption for certification doesn't really seem like they're putting safety ahead of profits.
With the number of data breaches we see cropping up, I wonder if a similar law could be written to hold companies liable for the safe handling of personal data.
If you want to call yourself "engineer" then at a minimum all those standards and minimum requirements should apply, no questions asked.
I've heard stories of hauling civil engineers out of retirement and in front of a tribunal after some structure that collapsed before it's time, or it was found that the original design was flawed in some way.
An engineer "signing off" on something actually means something, with actual stakes.
Of course the "developers" will revolt against this, because they are not engineers. A developer does not get to sign off on anything, and everything they do must be scrutinised by the engineer in charge before it can be deployed.
Doing this for your common run-of-the-mill CRUD app or HTML website is obviously overkill. Just like you don't need an engineer to replace a broken window or hang a new door. But when it comes to things that actually matter, like data safety and privacy, you must ensure you bring in the right culture (let alone job title).
> An engineer "signing off" on something actually means something, with actual stakes.
It's all about the social impact of a mistake.
The reason "engineers" sign off stuff is not because they are engineers. It's because society frowns upon people dying due to a faulty building, wants to see those responsible for the disaster punished and not be behind another project again.
Society's response to ensuring incompetent professionals can't sign off on stuff is to manage an Allowlist of individuals who can sign off, and have mechanisms in place to remove from the Allowlist those who screwed up badly.
These Allowlist are things like professional engineering institutions, and someone who is registered in the Allowlist is called engineer.
Now, does the software engineering world have anything of the sort in terms of importance in a society? No. A service crashing or a bug being sneaked into a codebase does not motivate a strong response from society demanding some people do not ever get behind a keyboard again.
There is absolutely no use case for misuse of any mundane service or desktop app that comes even close to match the impact on society of an apartment building collapsing and killing occupants.
> How many lives have been destroyed by shoddy software?
In relative terms, if you compare any grievance you might have with the sheer volume of services and applications running 24/7 all over the world... Zero dot zero percent.
You can't even make a case against using free software developed by amateur volunteers in production environments.
Self-driving cars absolutely will kill someone. The driver may be at fault for being distracted, but as long as the system allows for distracted drivers, the software is the process. See: Aviation post-mortems.
Medical equipment is another example. We do not need to go as far as Therac, there has been a number of near-misses which in the absence of independent post-mortems, are never recorded anywhere.
Drones and autonomous warfare also occupies many developers.
Social scoring systems, and taxation and payment systems, should also not be trivialized. Even if they do not make spectacular media coverage, they absolutely kill people. No indenpendent post-mortems to be seen, blameless or otherwise.
Social media would like a word. The suicide rates of teen girls have tripled since the introduction of social media. To say nothing of the corrosive force on democracy and our ability to make sense of the world as a society.
A building collapsing might kill 10 people. A rare tragedy. Social media impacts billions. Even if you want to downplay the negative impact on each person, the total impact across the globe is massive.
Software has eaten the world, for good and ill. I think it’s high time we treated it seriously and took some responsibility for how our work impacts society.
Adolescent suicides rose from 8.4 per 100,000 during the 2012-2014 timeframe to 10.8 deaths per 100,000 in 2018-2020, according to the new edition of America's Health Rankings Health of Women and Children Reportopens in a new tab or window from the United Health Foundation.
But regardless, I find it pretty ridiculous to claim that us software engineers don't have a massive impact on the world. We've inserted software into every crevice of human civilisation, from my washing machine and how I interact with loved ones, all the way up to global finance and voting systems. You think technology companies would be the largest industry on the planet if we didn't have an impact on people's lives?
Leaving the social media point aside, all I'm arguing is that when harm actually occurs due to negligence, companies needs to actually be held responsible. Just like in every other industry that doesn't have convenient EULAs to protect them from liability. For example, if Medibank leaks the health records of all of their customers, they should be punished in some way as a result - either fined by a regulatory agency or sued by their customers. Right now, they shift all of the harm caused by negligent behaviour onto their customers. And as a result, they have no incentive to actually fix their crappy software.
I don't want companies the world over to look at things like that and say "Well, I guess Medibank got away with leaking all their customer's data. Lets invest even less time and effort into information security". I don't want my data getting leaked to be a natural and inevitable consequence of using someone's app.
Even from a selfish point of view, people will slowly use less and less information technology as a result, because they have no ability to know which companies they can trust. This is already happening in the IoT space. And thats ultimately terrible for our industry.
Oh, social media companies have an enormous impact on the world - through decisions made at C-suite and senior management level about what to demand of software engineers and how to deploy that work.
The impact by software engineers perhaps falls more in the "failed to whistle blow" category than the "evil Dr. Strangelove" box .. save for those very few that actually rise to a position of signifigance in strategic decision making.
That aside, the teen girl suicide rate underpinning your reference seems to be about 2x, from 2.8 per 100K (ish) circa 2000 to 5.5 per 100K in 2017
As a research letter from JAMA I take that as fair reporting of some raw CDC data - I don't know how representative that result is in the fullness of reflection, normalisation, and other things that happen with data over time. To be clear I'm not quibblling and I thank you for the link.
Haidt also makes clear that Correlation does not prove causation and argues that No other suspect is equally plausible.
I'm 100% willing to align myself with the "social media as it stands is a scourge on humanity and young minds (for the most part)" camp.
I'm equally onboard with corporations are shit at personal data security and should be held with feet to the fire until they improve.
The link to mental health and suicide rates is far from shown, and could have any number of confounding factors.
Perhaps a better example would be the situation in Myanmar. It has been shown beyond doubt that it was in fact a genocide in the full meaning of the term, and that it was made much worse by Facebook.
Both by design where their algorithms are designed to maximize the impact of this type of social contagion, but also by their manned staff which were either unwilling or not allowed to help. Both situations are equally bad.
Not to mention building software that decides who gets health care, insurance or mortgage and discriminates based on bugs and faulty premises. And we're not even at Tesla killing people with software bugs.
All those engineers need to be hauled up because they’re killing people. Software engineers by contrast are high performance: barely any fatalities and so much value created. It’s why it’s not construction firms that are the most valuable companies but software companies. You can count on software engineers to build things that won’t kill, for the most part. Other kinds of engineers, on the other hand, are constantly killing people.
We need higher standards for engineers. They could learn from software folks. If you can’t do it safely don’t do it.
I have a friend who works in a hospital. Apparently the hospital software they use constantly freezes for like, 30+ seconds while they're trying to get work done.
Meanwhile my friend has had her passport, entire medical history and all sorts of personal information leaked by Medibank then Optus, within a few months of each other. Neither company as far as I can tell has been held in any way to account for the blunder.
Meanwhile the Post Office Scandal is rocking the UK - where a software mistake landed a bunch of completely innocent people in jail and led to multiple suicides.
And don't even get me started on the impact of social media.
We might not kill as many people as engineers. Maybe. But we certainly cause more than our share of harm to society.
I think you can prevent a lot of revolt by just finding a way of putting this other than "software engineer output" should be held to standards. The whole point of this article is that the individual is not the problem. Rather, the system should prevent the individual from being the problem.
Nobody wants software quality to be regulated by people who don't know anything about software. So, probably we should be regulating ourselves. It's not an easy problem though. Can you really state a minimum safety standard for software that is quantifiable?
I don't think it would be the developers that would revolt. Well some would for sure, but the real opposition will come from the C-level, as this would give their employees way too much power to say no.
I'm am engineer by many definitions, although I don't have or wear a stripey hat, so not the most important definition.
I'm not licensed though and have no signoff rights or responsibilities. If I were to consider signing off on my work or the work of my colleagues after review, the industry would have to be completely different.
I have been in the industry for more than 20 years and I can count the number of times I've had a complete specification for a project I worked on with zero hands. I can't sign off that the work is to spec it the spec is incomplete or nonexistant.
Writing, consuming, and verifying specs costs time and money and adds another layer of people into the process. The cost of failure and the cost to remediate failures discovered after release for most software is too low to justify the cost of rigor.
There's exceptions: software in systems involved with life safety, avionics, and the like obviously have a high cost for failure, and you can't take a crashed plane, turn it off and on and then transport the passengers.
You don't get a civil engineer to sign off on a one-story house in most cases either.
Any software system that might produce legal evidence should also be an exception… it feels a bit too much of a common case to call it an exception though.
(This is in reference to the Horizon scandal, yes.)
NCEES which administrates the regulation of Professional Engineers in the US made a Software Engineering Exam which had some standards: https://ncees.org/wp-content/uploads/2016/01/Software-Engine... However so few people took the test they discontinued it. But I at least liked the idea of it.
So you propose no idea of "minimal set". If a liability is proposed, it has to be reasonably binary state: compliant/non-compliant.
Just like every time, there is no concrete proposal what constitute "minimal set". That's like "make education better", with no concrete plan. We can agree on goal, but on on the method.
There are several ways we could write a list of best practices. But the simplest would be to simply attach a financial cost to leaking any personal data to the open internet. This is essentially how every other industry already works: If my building falls down, the company which made it is financially liable. If I get sick from food poisoning, I can sue the companies responsible for giving me that food. And so on.
We could also write a list of "best practices" - like they do in the construction and aviation industries. Things like:
- Never store personal data on insecure devices (eg developer laptops which have FDE disabled or weak passwords)
- Install (or at least evaluate) all security updates from your OS vendor and software dependencies
- Salt all passwords in the database
And so on. If you locked some competent security engineers in a room for a few days, it would be pretty easy to come up with a reasonable list of practices. There would have to be some judgement in how they're applied, just like in the construction industry. But if companies are held liable if customer data was leaked as a result of best practices not being followed, well, I imagine the situation would improve quite quickly.
Its boring work. Compliance is always boring. But it might be better than the current situation, and our endless list of data breaches.
If a system is attacked or fails in novel way, and the system design was done in accordance with best-practices, the follow-up investigation will note this, the system implementers will in all likelihood not face any repercussions (because they weren't negligent) and the standards will be updated to account for the new vulnerability. Just like other engineering disciplines.
Yes, things are constantly evolving in computer engineering, but how many of the failures that make the news are caused by novel new attack vectors, versus not following best-practices?
We don't live in a hard, binary world of extremes. Yes -- buildings and locks are, for the most part, easily compromised. But...
If I make a window of bulletproof glass, it's a lot harder to compromise.
If I surround my building with a wall topped by razor wire, it's a lot harder to compromise. (Unless, of course, I'm making an action film.)
Depending on what or who I'm protecting, I might find these solutions very valuable. Similarly in software, depending on what or who I am protecting, I might be very interested in various security measures.
So far, we've managed to escape an incident where bad actors have compromised systems in a way that's led to severe loss or harm. People are terrible at evaluating risk, so people tend to assume the status quo is okay. It's not, and it's only a matter of time before a bad actor does some real damage. Think 9/11 style, something that causes an immediate reaction and shocks the population. Then, suddenly, our politicians will be all in favor of requiring software developers to be licensed; liability for companies that allow damages; etc.
> If I get sick from food poisoning, I can sue the companies responsible for giving me that food.
Except that is not how food safety is achieved. There are strict rules about how food must be stored, prepared. How the facilities must be cleaned and how the personal must be trained. If these are not met inspectors can and will shut a commercial operation down even if nobody got sick.
And if they follow all best practices they have a good chance of arguing they were not at fault with your illness (besides the point it makes it much less likely to happen in the first place.)
Food safety is achieved via a combination of strict rules (equivalent of best practices) and liability in case of harm. There's been plenty of quite famous court cases over the years stemming from bad food being sold leading to people getting sick and dying.
> But if companies are held liable if customer data was leaked as a result of best practices not being followed, well, I imagine the situation would improve quite quickly.
One might argue that GDPR does exactly this, it holds companies financially liable for data leaks. Would you say it has improved the situation?
Well, I interpreted your question 'How do you define "minimum safety, correctness and quality standard"?' as to be about the process, not about the outcome.
I actually have not invested much thought into what the outcome should/would be as I think it's unlikely to happen anyway. So why invest time? But maybe ask the author of the original comment?
The EU is doing something like this with some proposed legislation. Basically it's a CE mark for software. The current version is very much a bare minimum, though. But it does impose liability on people selling software.
> But of course the developers will revolt against that.
Why would developers revolt? OTOH users would definitely be unhappy when every little app for their iPhone suddenly costs as much as an airplane ticket.
It might do the opposite. Imagine if the company was held responsible for the quality of the software regardless of how it was made. I suspect it would be much easier to meet any quality standards if software was written in-house.
It would get outsourced to the first provider who can indemnify the company against any failures in the software. Whether or not any provider would dare to provide such a service however...
However it happens, it still attaches a legal & financial cost to lazy security practices. And makes it actually in companies' best interest to do security auditing. I think that would be a net win for computer security - and consumers everywhere.
I've been obsessed with the US Chemical Safety Board videos on YouTube that describe in great detail the events that lead up to industrial accidents. One of the common themes I've seen among them is that there's usually some sort of warning sign or near miss that goes ignored by the people responsible for them since they don't cause any major damage. Then a few days or months later that system fails in an entirely predictable way with catastrophic consequences. A good example of this is the fatal phosgene gas release at a DuPont chemical plant[1].
It is worth keeping in mind that you don't see the other side of the equation in these reports: how many warning signs and near misses that didn't result in a major accident. Part of that is just the odds, and why people and organisations can become complacent to them, and part of it is that while most of them may be addressed, some can still slip through the cracks.
YEs. Almost everyone has seen at factories the prominently posted signs updated daily: "XX Days Since a Time Lost Accident". Some years ago, I noticed in some more technically advanced shops a change: "XX Days Since a Near Miss", along with an invitation and method to report near-misses.
Seems like progress; a near-miss noticed is indeed a real gift, if followed-up.
> One of the most important pieces of legislation in the UK that helped improve safety in workplaces was the Health and Safety at Work Act 1974, which placed a duty of care on organisations to look after the wellbeing of their staff.
I've not been following closely, but I suspect the Post Office did not follow this. Are there ramifications happening along these lines for the Post Office fiasco? Or is it already a big enough thing that this would just be "oh, and also that too I guess" kind of thing?
Amongst many other complications in the UK Post Office scandal that make it different from what's being discussed here: the victims in the post office scandal were franchisees, not employees.
...by Kyra Dempsey, a.k.a. Admiral Cloudberg, who has a long series of in-depth airplane crash investigation articles on Medium: https://admiralcloudberg.medium.com/
The USAir 1493 crash happened in 1991, a decade after the PATCO controllers were fired. I have a hard time believing that there is any connection - by 1986 or 87 the controller position has already been fully properly staffed with trained controllers. Throwing the mention of PATCO in a discussion of a 1991 incident would just further a political agenda and add nothing to the discussion.
Yeah, that's probably why she didn't do it. But if you imagine having to suddenly hire 10,000+ people in conditions which made the previous occupants of those positions so frustrated that they went on strike despite not being allowed to, I guess you can't be too picky with who you hire...
The article makes it pretty clear that Washer was not incompetent at her job, and that the role of the PATCO aftermath hiring spree was not to allow her to get hired when she otherwise couldn't, but to push her to apply when she otherwise might not have for specific personal reasons.
A different, and likely more productive, way of looking at a possible connection between PATCO firings and USAir 1493 is in the form of institutional knowledge lost.
I think a decent argument here is that the mass firing of a large portion of existing ATC personnel damaged the institutional knowledge and rigor of the ATC profession as a whole. What highly technical profession wouldn't be damaged by that? And this damage just increased the risk of serious mishap across the board; it suddenly became much more likely an ATC would make a mistake, and this one is one of those mistakes
And if you like this sort of thing, you might also enjoy the various people that do reconstructions on YouTube - the real audio played over animations/flight sim of what the planes are doing. The most interesting/lively ones are those with a 'possible pilot deviation' or 'number to call' IMO if you want a search term to narrow them down.
I'm not a pilot or planespotter or particularly interested in planes generally, I've just fallen down that rabbit hole a couple of times for some reason, can be quite interesting. It has made me wish the in-flight entertainment included a 'listen to pilot/tower comms' option.
So, "fun" tale: the https://en.wikipedia.org/wiki/American_Airlines_Flight_191 crash had a new feature, a closed circuit television system that showed passengers the view from the cockpit. They very possibly got to watch their demise first hand from that camera. For that reason, it might never happen again.
Too bad really. So what if you get to watch yourself die horribly, you die either way! You will still be pasta sauce on the other end of the incident, you really shouldn't care. You might have a stronger spike of adrenaline than otherwise, but that's not going to hurt you any more than being pulverized will.
There's a reason it doesn't. The last thing any pilot or controller would want is to have something sensationalized or misconstrued and then spewed all over social media.
United Airlines used to have this back in the days where your seat had an audio jack and a few channels of music to listen to. Channel 9 was the audio from the cockpit.
Nowadays you can listen to ATC from nearly anywhere by visiting https://liveatc.net/
I've listened to liveatc.net, and then took a flight in an Extra 330LS stunt plane and got to listen to ATC from an actual headset...
...and I can't understand hardly a word they say. Even having a slight idea of the phraseology from countless hours in MSFS and watching VATSIM streams on Twitch, when it comes to the real thing, they talk so fast and there's so much static that it blows my mind that anybody can understand them at all.
Commercial rated pilot here. It's definitely a bit of a learned skill, but it doesn't honestly take that long to learn, and it works better than any other alternative might. The most important thing is for everyone in the system to be willing to speak up if they don't clearly understand what's been said to them. "Say again slower" is a perfectly acceptable response, and is a required one of a pilot or controller who did not understand what was said.
Controller cadence is a bit geographical - the busier the airspace the faster the tempo, and the more important it is to speak up.
AM simplex radio works amazingly well overall for this purpose, and none of the potential technology that could replace it would be as safe or reliable..
As former aviator working in tech it, amuses me how periodically on this site, someone learns how the aviation community is flying around using GPS and simple UHF/VHF radios and NAVAIDs, immediately goes into a tizzy about how unsecure and hackable it allegedly is, and then goes round the bend about things like encrypted comms and starts answering a question no one asked.
Like, sure, in the military there's the ability to do various and sundry things to secure your comms and navigation which are not worth discussing here. But in the civilian world, there's largely no reason to go to the trouble and expense.
I don't think that's a reason, they're not secret (the YouTube reconstructors I mentioned (legally) get them after all - there are public if paid archives) they're just not available on that system. Just like all kinds of other stuff isn't. I think it's simply niche interest. Plus it's only any good when you haven't taken off yet and are about to land, i.e. the times cabin crew will do what they can to have you not enjoy any kind of entertainment whatsoever anyway.
Thank you for pointing that out! I’m a big fan of Admiral Cloudberg, and was in fact even thinking of her while reading this article, but I did not make the connection. She’s such a fantastic writer.
Does anyone know how she became so good at writing about airline disasters? She doesn’t seem to have any sort of aviation background.
Great article. I agree a lot with the safety culture in commercial aviation and the commitment to investigate and fix root causes.
Now can we please apply the same standards to car crashes? The same human errors and bad infrastructure keep getting repeated over and over again. And the problems are getting worse, with SUVs and distracted driving on the rise.
The cognitive dissonance people have around this issue is astonishing. We are willing to ground the 737MAX fleet when a few people get a (surely terrifying) open air flying experience; but 44,000 people are killed on the road in the US _every_ year (and rising!) and very few people seem to care. In most age cohorts, death by car is the largest killer. In the US you have a 1 in 107 chance of dying in a car crash in your lifetime. Even simple and completely reasonable measures to reduce these insane numbers are seen as some kind of tyrannical affront to ones freedom (see the current CA measure to add speed limiters).
The car industry, car culture, and car centric thinking in the US and much of the world is totally out of control.
I too have thought a lot about this cognitive dissonance. I think that there are several differences to the two issues, and so, in people's attitude towards it too.
One, I think, is that flying is something that is being done to us, and driving is something that we do ourselves. So the agency, the point of view is very different. Something bad happening while being passive is much more horrifying because of the powerlessness.
The other is that cars and driving environments differ a lot, while planes are much more similar to each other. What I mean by this is that it's easier to dissociate the car deaths, because that happens to some other people over there, nothing like me, but plane badness happens to everyday folk in a big winged tube, like me.
I think that if we drove the planes ourselves, the issues would be much more similar. And similarly, if everyone took the train, the bus, or a ship, and similar things would happen to a train, bus or ship, the freakout would be similar to what we see now with planes.
> is that flying is something that is being done to us, and driving is something that we do ourselves. So the agency, the point of view is very different.
Correct. An all-too-popular viewpoint is, "I'm a good driver, unlike everybody else on the road!".
> if everyone took the train, the bus, or a ship, and similar things would happen to a train, bus or ship, the freakout would be similar to what we see now with planes
I believe this is the case already. A train or bus kills a few hundred a year, it makes national news for a week. Don't get me wrong, it's a tragedy and needs to be fixed. But then, those 44k car-related deaths are continually brushed aside.
The illusion of control. Most humans are instinctual predisposed to it, and it generally has to be unlearned. It's why blackjack is much more popular than roulette in a casino, even though there is much more room to play blackjack sub-optimally, but effectively zero chance of playing roulette in a way that hurts your chances of winning.
Even if you know you aren't a good driver, you would still carry around the idea that it was your own fault for not being a better driver. Its really the level of personal agency involved (and the sense of that agency being taken away) that causes the brushing aside.
I think the dissonance is surely curbed when sitting in a plane at cruising altitude realising that I am much more worried if these engines break down as opposed to my car, as I am likely to fall out of the sky.
Planes if broken down fall from the sky. Cars that breakdown don't. The people operating the plane are well trained, any tom dick and harry can drive a car without any check on their mental and physical state before they hop behind the wheel.
TIL that car fatalities were declining until 2019, and then reversed and are getting worse.
What happened in the last five years? Safety features of the cars themselves are improving (emergency braking). Alcohol may be a factor, but why in the last five years? Cars have been big for a while. WFH probably reduced commuting time. Other countries appear to be on the decline.
Ironically, heavy traffic is one of the better "safety features" of our automobile transportation system. Since crowded roads are higher-conflict roads, there is a bit of luck in the fact that traffic slows down when it gets crowded. There may be more collisions, but they are less deadly.
Suddenly there is a pandemic, and there are orders of magnitude fewer people on the roads. The number of collisions goes down, but the number of deaths goes up, because all of the collisions are at higher, deadlier speeds.
---
Driving Went Down. Fatalities Went Up. Here's Why. by Charles Marohn
> TIL that car fatalities were declining until 2019, and then reversed and are getting worse.
Per your link, fatalities per mile drive bottomed out in 2014 and barely dropped after 2010. There's been some speculation that this reversal in safety was tied to the rise of smartphones and a corresponding increase of distracted driving.
Note that deaths per vehicle mile traveled (VMT) went up during Covid. Reducing commuting time doesn't necessarily reduce deaths under that metric -- just the opposite! Commuter trips are probably the safest per VMT since they're the most familiar.
There were stories about this effect, even as total collisions went down because of the bigger drop in VMT[1]. I speculated that this was a combination of
a) the above effect (stripping out the safer commuter trips), plus
b) the roads being dominated by people least willing to follow the advice to stay home, which correlates with being anti-social and reckless (mean though that sounds! [2])
My facebook friends suggested
c) the immense stress of coping with the Covid world made the average person less able to concentrate.
I also suspect:
d) traffic enforcement was reduced and drivers gradually started branching out into more aggressive maneuvers as they became aware of this.
Note that people saw their car insurance rates go down during covid because the typical personal policy only cares about accidents per unit time, and rarely adjusts for miles driven.
But I don't know why it hasn't regressed to the pre-covid levels -- probably because WFH hasn't completely reversed.
The driving majority has members who support buying tall heavy SUVs and pickup trucks "for their family's safety", running over pedestrians, speeding wherever they please, and protesting any kind of traffic calming or enforcement. And then there are the teens who run over cyclists for fun, such as in the case of Andreas Probst. Do you support that methodology?
I don't necessarily know that you can conclude from a few memes that people are actually out there doing this, I see it more as venting frustration. They're certainly advocating for it, though.
The vandals in question call themselves "Tyre Extinguishers". Search for that phrase in that subreddit, and you'll see it's full of people defending them.
I would never recommend this methodology to a friend or a family member. Drivers are vindictive, violent people and the risk is too high to be worth it.
And I hope they all go to prison. Used to be you'd be hanged for stealing a horse. Taking a person's means of transport is evil. I am, indeed, a vindictive and violent driver, apparently.
You just wished prison and positively mentioned hanging people in response for relatively harmless activism (tyres can be inflated again, you know). So yes, I think you are violent, at least when cars are mentioned.
To be clear, deflating tyres is illegal, but most successful activism campaigns are. I support the cause they're fighting for, unnecessarily huge cars hurt our environment and neighborhoods.
Wishing prison is morally correct. If you read what I was saying as "we should hang people for deflating tires", there may be another issue here.
I think my reaction highlights the clear issue in their thinking. I will never support anything these people want, because they hate me personally. Taking away people's transportation is a bad thing to do, period. If I go kick a bunch of puppies to campaign against people dying, I'm still a bad person.
> Taking away people's transportation is a bad thing to do
Drivers have been taking away transportation from non-drivers for decades. Look at suburbs without sidewalks. Look at winding suburban culs-de-sac that massively increase travel distance and make walking and cycling impractical. Look at the drivers who shout down any attempt to install bike lanes, or those who refuse to fund public transit. https://www.youtube.com/watch?v=dqQw05Mr63E
> they hate me personally
Sounds similar to drivers who want to run cyclists off the road. "Get off the road!" "You don't even pay insurance and road tax!" Also sounds similar to drivers who deny pedestrians the right to exist.
Overall, rpmisms, I don't think you appreciate your privileged position as a driver. Society has spent the last century bending over backwards to serve motorists like you. I hope you can develop a sense of compassion for those who are not in a car and are continually subjected to verbal and physical aggression from motorists.
I want better walkability and bicycle paths. I ride a motorcycle, I also have issues with cars acting awfully.
It's still wrong to damage others' property, even if your feelings are hurt. Motorists are the majority of the country, especially geographically. Don't you think more people would appreciate the privilege of being close enough to ride a bike? The real display of privileged behavior is injurious action against moral enemies. I have a lot more tolerance for someone losing their temper, as opposed to what is essentially gentle terrorism.
<whiney rant> After notjustbikes got big, he dropped any humility and started preaching to the choir, calling people who didn't agree with him racist idiots. I complained in the comments that I used to be able to recommend his videos sight-unseen to the people who needed to be convinced. He replied and insulted me as someone who just found his channel and couldn't take the truth. </whiney rant>
Unpopular opinion, but the truth: We have to drive and it's a right. I agree about the infrastructure issues. Still the modern "safe driving" mentality that people want is at odds with that REALITY.
This could change with public transit, however we don't do this. Even the best countries fall short, and most don't get close. It's actively discouraged in the USA.
Actually, you could argue that people already have to deal with too much BS to drive. The paperwork, rules, and so forth clearly are challenging for people to adhere to. So we throw the majority of people into a situation they aren't equipped for on a daily basis. Then this becomes a major cause of legal efforts, that alienate perfectly good citizens from society and cause us to forfeit our civil rights.
You can claim that people just need to be responsible, or we need to pass some safety law but it's delusional. The truth is we don't have a replacement for this. The average person can't afford to be carted around, and we would need to get to a place where driving was purely recreation.
So we are going to need very good public mass transit options, and IMO some kind of vast network of professional drivers that are more affordable than Uber, for the occasional "cargo trips" or to go "the last mile" (more like 10-50).
WTF? Driving is by definition a privilege. You're maneuvering a 2-ton murder machine.
We got into the current mess because people treat driving as a right - I have to have parking, I have to get there fast, I have to bend everything society to fit my car. And when someone gets caught for a DUI, judges go lenient because they know that taking away their license will obliterate their means to work and live.
> So we are going to need very good public mass transit options, and IMO some kind of vast network of professional drivers that are more affordable than Uber
Sort of. We need denser cities and mixed-use neighborhoods so that people can live close to the goods and services they need. Transit is a feasible problem as evidenced by Europe and Asia. No society has operated under the assumption of a vast network of professional private vehicle drivers (so, disregard bus drivers).
> go "the last mile" (more like 10-50)
That's the problem. We have to stop designing cities like that.
The point was to give an UNpopular opinion. The current driving system is a joke in need of reform. The poor suffer the most.
IMO people just need to cut some of the BS with this driving stuff. I don't think judges really go particularly leniently anymore. You can't function without a car.
The reality is that we probably need to accept that driving is here to stay as a basic NEED and act accordingly. Certain people that are extremely dangerous drivers are going to get their rights curtailed, however we should also support everyone functioning in society.
I think we need to get closer to the system I saw in Netherlands, however IMO it really fell short of what I want to see. Until we surpass this, I think we just need to accept driving.
Lots of jurisdictions, even in EU have vast areas with poor service. Most cities I've been in would be fine with public transit. It's networks of large population centers that can be easily connected, especially with good bike culture we could be just like the Netherlands. It's large swaths of the non-city land that contain people and towns that would be a challenge.
You can't just not drive. It raises alarm bells and attracts unwanted attention. I'm not the only one that's experienced this, I think you can find this on /r/fuckcars. An employer almost called the cops on me once. It's fucking weird. That's not even getting into how unsafe it can be.
How can we pretend this is a "privilege"? IMO it's like calling healthcare a privilege (which btw I also think is a right)
> Driving is by definition a privilege. You're maneuvering a 2-ton murder machine.
I think this point needs sharpening a bit by making the implicit explicit: "Driving" per se isn't the issue, it's where one does it and who else they may endanger when doing it badly or recklessly. People can maneuver 2-ton murder machines alone on their personal closed track as much as they want.
At least in the US I can see a reasonable argument for most people needing to drive, but it's much less obvious people need to drive the vehicles they do the way that they do to meet basic transportation requirements. One obvious if almost certainly unpopular solution would be to have the basic drivers license you get now with essentially no effort only cover small speed limited cars that would pose far less risk to everyone else on the road. If you want or need something larger or faster, require a separate license with testing, experience, and re-certification requirements in-line with the extra responsibility operating such a vehicle should involve. You abuse that even a little, back to a GPS speed limited Corolla hatchback you go.
This addresses the issue with a lack of driving alternative while also addressing the cases where people need more than basic transportation. But I guarantee people would lose their minds and start setting fire to their local DMV if such a change were even theoretically proposed. Because the problem with car culture in the US (and elsewhere) isn't that driving is a necessity, it's that being a driver comes with such a massive sense of unearned entitlement that any restrictions at all for any reason are treated as a massive violation of God given rights. Even attempts to enforce existing laws are met with absurd levels of hostility, like the opposition to red light or speed cameras.
I say this as someone who likes driving and likes cars, but as someone who is also a private pilot, the reason we don't have a driving safety culture like we have a flying safety culture is that the absolute worst possible example of an unsafe pilot you could imagine is basically the average driver. And not only is that driving behavior accepted in a way it never would be in the aviation community, but it's treated as an unassailable right.
They see it as an "entitlement", because that's how society works. Don't want to drive and you will immediately start to experience discrimination, including microaggressions (no it's not like being the victim of racial oppression, just another way wind up in an outgroup).
The REALITY that I think is still being ignored in this thread is we don't rely on planes for our day to day travel, on average. I'm never going to need to hop in a plane just to eat, make it to work, or meet my friends.
I don't think it's "an entitlement" to want food, to get to work, and not be treated as a burdensome freak. To the vast vast majority of people, even frequent business travelers, flying is for occasional long distance travel, never for day to day activities.
If we really get to the point where non-driving is treated equally to driving, I might agree. However this seems exceedingly unlikely and is incredibly expensive. It would require me to be able to get around as a non-driver with the same cost and convenience. This turns out not to be economically feasible.
People only have to drive because they (either directly or via elected officials) chose to design large swaths of land that were only livable if cars were elevated to the preferred mode of travel. Better infrastructure design (in general, points of interest being built closer together) would obviate something like 90% of car trips.
But being forced to drive was absolutely a choice, and one that, with some effort, can be undone.
There are a lot of factors that really complicate this system.
Enforcement focuses on the wrong parameters. Police love to remind us that, "Speed is always a factor." Well, it's certainly the most easily measured one. The trouble is, what speed is wrong? Is it the 5 cars who want to drive 90mph on the freeway, or the 2 cars who want to drive 65 in both lanes? By obsessively enforcing speed limits, police have managed to perversely incentivize the difference in speed between drivers who are willing to risk a ticket, and drivers who self-righteously bottleneck traffic. I have never even heard of the law, "keep right except to pass" being enforced, even though it seems pretty obvious that it would help.
People don't want to drive. People have to drive. That reality perversely incentivizes drivers to minimize effort. There is very little motivating drivers to drive better than they need to. How could we possibly change this incentive? Fear tactics are not working. Threats are not working. This is an open question: what if it has no answer?
Infrastructure demands backwards-compatibility. We can't just build a new part of a city without roads. Where would people put their cars? If people living there didn't have cars, then where could they go? Where could they work? Could anyone visit? Utopia must have a compatibility layer: somewhere that alternative transit can transition to highways. But where? If you have to park your car away from home, then you need to manage the risk of unattended property. If you have to get a ride in someone else's car, is that affordable? If you are going to take a bus, how long will you have to wait? How many busses will you have to transfer between? No matter what we do, the compatibility layer will always be inefficient.
Infrastructure demands infrastructure. The only solution to more traffic is more road. The only solution to more road is longer driving distance. The only solution to longer driving distance is more time in traffic, which is identical to more traffic. The only way to stop growing this system is to stop growing the amount of drivers. How can that be done? In order to become a non-driver, you need a minimum viable alternative, otherwise you will have to drive just like everyone else. In order for alternative transit to be viable, there needs to be enough of it around that it feeds its own infrastructure demand. If alternative transit is built with a highway-compatibility layer, then that compatibility will effectively remove the need for alternative-transit growth. Realistically, alternative transit won't start demanding itself until its utility is equal to the highway system. We can't build that all at once, so how can we overcome the already-present highway infrastructure demand cycle?
No matter how you look at it, the highway system is feeding the worst aspects of itself into its own growth. The only way to change this cycle is to build a lot of alternatives really fast, and make them dirt cheap. It's a political nightmare. Even so, I would much rather live in that nightmare than the one where rubber meets the road.
I don't really agree that this is the "trouble". Any competent transit-related professional knows what type of driving is wrong. Even American cops know [0] that in terms of traffic the key thing is to not impede traffic and target dangerous driving.
However, hunting school zone speeders, left lane hoggers and traffic weavers will never generate as much revenue as using an automatic machine that points at cars going faster than arbitrary number.
I love driving in the country, too. What I hate is driving in the city.
There is no need to completely eliminate the highway system. The need is to provide viable alternatives so that people who participate in the system do so by choice.
Besides, public transit doesn't need to involve any interpersonal interaction. You could have a private compartment and still be more safe and more efficient than commuter traffic.
This is where I stopped reading the story. I couldn't tell that we were getting to the point of answering the titled question. I'm guessing it's the blameless culture/process that keeps information flowing and improvements made.
Reminds me of the guy at GitLab, I believe it was, who accidentally and irreversibly deleted a bunch of repositories. GitLab had a fantastic approach to that whole ordeal and public embarrassment. They essentially blamed themselves for not having the proper contingencies in place instead of blaming the individual. It's worth reading up on.
It happened to me and my team like 10 years ago or so: An engineer deleted the production database. We were able to recover it. But the postmortem basically focused on why the he heck did we let that happen. The guy stayed in the company for 8 years more IIRC. We fixed our systemic issue as well.
I've read about blameless postmortems from Medical professionals and I always tell my teams: if those guys (doctors) can do blameless PMs after someone has died, we can do them.
Also reminds me of the Reddit thread [0] where someone accidentally used write access credentials that were on an onboarding document and wiped the production database. GitLab guy actually joined the conversation.
From that link you should also be able to find conversations on Hacker News and the like. It was talked in a lot of places at the time.
If I’m remembering correctly (take this paragraph as imperfect memory), at the time a lot of people on the outside were looking to assign blame but the team tweeted something to the effect of “yes, we know who did it, and no, they won’t be fired” and didn’t even reveal who it was. Then they live-streamed the process of trying to recover as much as they could. They got a ton of community encouragement and it was widely viewed as the right way to handle things.
Intended to focus on maximal learning about all factors that contributed to an accident. It subsumes several approaches, including a blameless analysis and that each factor doesn't just have a single "cause" but that different factors form a network of interaction.
Also nitpicking about the headline: You haven't been in a place crash because you wouldn't be reading this if you were :)
Whoever got curious enough to enter this comment section: carve out some time to read the CAST handbook. It changed how I look at accidents but also how I look at organisations and the rest of the world. Should be mandatory reading for anyone in any position of responsibility.
I have long wanted to write a review/summary of it on my blog but it's so dense in useful content it's hard to compress further. (There's a reason I have not published it so please try not to judge it for its rough edges, but this is what the draft looked like when I gave up last time: https://two-wrongs.com/root-cause-analysis-youre-doing-it-wr...)
Thank you for posting your summary! I appreciate summaries because each individual finds something different to emphasize about the text.
Have you looked at the book "Handbook of Systems Thinking Methods" (2023, Salmon, Stanton, et al)? It's all about applying systems thinking to safety and talks about the STAMP-CAST model.
Thanks for that, I really appreciated this breakdown of the accident at Three Mile Island and the concept of 'the second story' with regards systems thinking; https://www.youtube.com/watch?v=1xQeXOz0Ncs
Yeah, probably, and the counterfactual where airlines were worse at preventing crashes would probably have worse crash survival rates as well. Maybe planes bumping into each other at very low speeds on the ground would skew the stats in that case, but you know what I mean by "plane crash".
>In round and, of course, fluctuating figures it is estimated that of the 1500 who die each year in air transport accidents some 900 die in non-survivable accidents. The other 600 die in accidents which are technically survivable and crashworthiness, fire and evacuation issues are all important. Of these 600 perhaps 330 die as a direct result of the impact and 270 due to the effects of smoke, toxic fumes, heat and resulting evacuation problems.
> Also nitpicking about the headline: You haven't been in a place crash because you wouldn't be reading this if you were :)
Do you mean to imply that being in a plane crash is certain to result in death? Because that's definitely not true. Even when they crash hard enough to explode, there are sometimes still survivors.
Well-executed RCAs are just so satisfying to me. Blameless culture is absolutely critical to getting to the "right answers".
Across the teams in my company, I hit nirvana when reviewing something like a major/impactful outage and the involved members in discussion are drilling down to the the essential state and sequence of events that caused said outage. Focus on actions and outcomes, including both those central/peripheral to the outage and those with zero knowledge prior to the event. It takes a high degree of trust in yourself, your peers, and your organization to get to nirvana.
Go through a few of these in a proper way, and a simple principle tends to emerge: if your mechanism is dependent on the perfection of humans, it will eventually fail. The only real discussion beyond that is basically what to do next -- do we need a mechanism that protects against human imperfections? Is the cost of implementing a solution worth the mechanism it would be designed to protect? Can we live with the infallibility of humans in this scenario?
Organizations that can achieve this level of discourse have a distinct advantage in execution.
That was a well written article. I think my favorite thing about it, is how it showcases the power of long term thinking over short term thinking. It also showcases an organization that was not using hope as a strategy.
I think the focus on "blameless" is incorrect. It is a culture of responsibility that results in better outcomes and it starts with engaged responsible leaders. Blame is in many ways the opposite of taking responsibility, but you can have a blameless culture without having a culture of responsibility. A culture of blame is always a culture of irresponsibility.
Blamelessness was a function of a leader actively choosing to take responsibility for the problem. They said "it is our organization that is responsible for this tragedy, not the individual." "We caused this through institutional negligence, not the ATCer."
Boeing still has problems because Boeing leadership has not taken responsibility. Boeing leaders have not said "I have created a system of incentives and punishments that have resulted in unsafe airplanes," which is why they are still having safety problems.
The paradox of leadership is that "while a leader is responsible for the actions of the organization, the actions occur from the individual decisions of those who follow."
If the air traffic controller was not consciously making an error, it is a clear problem for leadership to solve. Leadership has a responsibility to make a change. Blame would have prevented that change.
Admiral Rickover brought this culture, a culture of responsibility, to the nuclear navy which has quite a good record of safety. This article echos a good amount of what I have read about Americas Naval engineerng tradition. These are quick short reads to give a taste of Admiral Rickover:
I am very confident that the author would enjoy reading about Admiral Rickover and his philosophy if they have not already.
I also think anyone who enjoyed this article would also enjoy reading Extreme Ownership, which is a much much much better book than the cover and subtitle implies and is applicable to every job in silicon valley.
I also just wanted to highlight two things. Not only did he create a highly safety conscious organization but an incredibly technologically innovative one that was born despite huge resistance from the US Navy. Think about trying not only to build a small nuclear reactor, something that had never been done before, but one that could be put on a submarine. Not an easy technical goal. And on top of that, aiming to complete this task in the face of resistance from the highest levels of the organization he made his life, including efforts to get rid of him entirely. He wasn't perfect but what he achieved quite alot. Definitely worth reading more up on particularly if you work more in the mechanical engineering world.
Second set of Rickover Rules:
Rule 1: You must have a rising standard of quality over time, and well beyond what is required by any minimum standard.
Rule 2: People running complex systems should be highly capable.
Rule 3: Supervisors have to face bad news when it comes and take problems to a level high enough to fix those problems.
Rule 4: You must have a healthy respect for the dangers and risks of your particular job.
Rule 5: Training must be constant and rigorous.
Rule 6: All the functions of repair, quality control, and technical support must fit together.
Rule 7: The organization and members thereof must have the ability and willingness to learn from mistakes of the past.
Just this month, there was a court case in Switzerland where an ATC was charged and convicted because he gave wrong instructions to a military jet, which then crashed into a rock wall. The case was quite interesting to follow, due to the various implications of a conviction or acquittal. It had quite some media coverage.
(Note that this was a military court, not a civil court. Proceedings might be different in these cases, even though both civil and military ATCs work for the same company - Skyguide.)
I had the privilege to work at a company that _actually_ had "blameless postmortems." That phrase is easy to say, but when you're on-call and get woken up at 3 in the morning to deal with someone else's mess, let's just say "blameless" isn't _my_ first reaction.
However--getting to the root of the issue, allowing everyone involved come up with solutions, and continuously improving the runbook is clearly the best way forward. Sure, it's reactionary, but proactive guidance is never going to be as imaginative in breaking things as production traffic.
In a team of individuals, the term isn't necessarily "blameless". I'd never hold a weird or accidental outage against you personally. The major question would be: Why did that happen?
However, you're forever going to be the fucking guy causing a database outage on Christmas. You might have had no choice, but fuck you.
All of us have these scars.
But then the question is: Why do we all have these scars? Maybe the system we're dealing with is not as great as it should be? Maybe the localized amount of anger is just a thing leading towards a more systematic problem. Oops, back into the ideas of blameless post-mortems.
Except my sister and her husband died in a plane crash 2 years ago. A lot of these ultra cautious safety requirements apply a lot more to bigger commercial airlines and less to small private chartered planes.
Pilot errors also have underlying causes. Was the pilot properly trained for the particular situation that arose? How is the safety culture of the airline? Did some aspect of the airplane design make that mistake more likely, or hinder the possibility of recovery? Was the pilot tired or not well and should not be flying in the first place? There’s a thousand questions like that which can and should be asked in a proper investigation. Hopefully, they were.
(I have been binge reading my way through Admiral Cloudberg’s blog on medium lately. Perhaps not to be recommended for you, as reading about a hundred fatal plane crashes may be too emotionally painful.)
The ability to drop blame from the equation is a credit to the aviation industry and would drastically improve any team, organisation, or industry to adopt it.
When the conference LeadDev happens in London, they often have Nickolas Means give a talk about an aeronautic topic. It’s usually 45 minutes of nail-biting in-depth analysis of a very complicated problem, and ends with a 5 minute rotation to how the lessons from that incident apply to engineering management in general. (You also feel like you could engineer an airplane, but no you can’t: that’s just Nickolas’ talent for explanation that is fooling you. Airplanes are very hard to make.)
This is all done in such a smart, seamless, obvious way to deliver a lesson that would make the Brothers Grimm feel cheated.
This very specifically works for Accidents - events nobody actually wanted to happen, where we can learn from experience and prevent it happening again.
It specifically won't work for things which were not accidents at all, where blame remains a useful tool.
There's some nuance, the Captain of the Titanic didn't intend to sink the ship and kill loads of people, in that sense it was an accident - but he also didn't need to head into an ice field at full speed. Various UK politicians didn't set out to drive postmasters to suicide, but they did give political cover to business people they must have suspected weren't being truthful.
Any decent accident investigation should highlight that the Captain was under strict instructions from the ship owners to break a record, wealthy influential owners that could and would destroy his career if he failed to push the ship.
Good investigations identify causes with a view to prevent repetition of circumstances.
Cases like these (Titanic, UK Post) strengthen the case for whistle blower protection.
Among other things, the investigation into the Titanic accident gave us the requirement of having enough life boats for everyone and standards in how to evacuate a vessel.
Captains being pressured by mamagement / organization to do hazardous stuff is a thing still, in aviation as well as in seafaring.
SOLAS (Safety Of Life At Sea) an international convention (ie International Law agreed by most countries) is indeed in big part a result of Titanic. SOLAS covers a lot of safety improvements and has continued to improve over time especially after a later version of the treaty makes updates "tacitly accepted" basically instead of signing treaties periodically the members agree that they're all automatically bound by any changed rules unless enough of them object. SOLAS is handled by IMO, the UN's specialized agency for the sea, which is based in London, on the far side of the Thames not terribly far from Westminster.
Unfortunately the thing most people remember (and which you highlighted) is life boats and, perhaps those are actually a bad idea, at least for most ships.
Here's how that goes: SOLAS requires life boats, but almost always you won't use them. Titanic is a rare example of a situation where life boats are very useful, an ocean linear breaks apart in the middle of the ocean. In most cases you're not very far from land, and so almost always you just get the people onto land and maybe the ship is damaged/ destroyed or maybe not, that's just stuff and it's insured. Fire? Control the fire, go to port. Hole in the ship? Pumps control sinking, go to port. Engine failure? Tow the ship to port. So in all these cases you don't use the life boats, doing so is basically a last resort.
But, even though you would very rarely need them, likely never for a vessel which operates close to shore, they must be maintained periodically because SOLAS, and maintenance of lifeboats is pretty dangerous because they're on the outside of a ship. So you may end up killing or seriously injuring more people by having lifeboats.
But, even though you would very rarely need them, likely never for a vessel which operates close to shore, they must be maintained periodically because SOLAS, and maintenance of lifeboats is pretty dangerous because they're on the outside of a ship. So you may end up killing or seriously injuring more people by having lifeboats.
Did the designers of the Titanic write this? You're arguing having enough lifeboats for all your passengers is bad because you have to maintain the lifeboats
I'm confident that I did not design the Titanic, a ship launched before my grandmother was born IIRC, however yes, I'm saying that this trade might well not be worth it in the bigger picture, not for all the ships covered by SOLAS.
Titanic is the sweet spot for wanting more lifeboats, they had a long time, but they were in the middle of the ocean and nobody was coming to help.
If you go down very quickly lifeboats are useless. Herald of Free Enterprise could have had ten lifeboats per customer, wouldn't have made a difference, there were 90 seconds between nothing is wrong, and oops the ship is laying on its side in the water, lots of people are going to die in that scenario.
On the other hand if the port isn't far you can make for port. Despite a ship being on fire, or badly holed it may have hours left, the Titanic had almost three hours.
If the maintenance of life boats is so dangerous and even costs more lives than it is expected to save, then in my eyes it would be the right reaction to invest into safety procedures for life boat maintenance, not getting rid of life boats.
If more people are killed by installing and maintaining lifeboats than are saved, then why wouldn't it be reasonable to disagree strongly with the status quo on life boats?
They didn't argue that the maintenance is bad because its expensive. How could you miss that point? You literally quoted it in your comment:
Exactly. When we have an incident, I want us to be very clear about who did what, what they were thinking that led them to that course of action, and the other facts about the incident. Knowing which individual took which action and when is very, very different from holding an individual responsible (in terms of suffering repercussions) from it. If team members trust they won't suffer from it, you get much clearer data, often without even asking for it.
Related, I recommend that teams subject to SOX-404 controls consider adopting a variant of FAR 91.3*. Our SOX docs explicitly permit our incident managers (and two additional high-level tech employee roles) to authorize any action they deem justified during a production emergency.
This system does in fact work very well for aviation in the USA. But I think we can't just blindly apply it to everything everywhere, assuming it will work the same. We should be able to consider what cultural factors allow it to work and what might cause it to fail. Among those reasons:
* It expects everybody in the organization to be earnest rule-followers doing the best they can to achieve the goals of their organization. The uncaring, reckless, rebellious, etc people have been screened out of the organization long before it gets to that point.
* It expects that the whole organization has cultural unity. Nobody is trying to hoard status, power, influence, etc for one subset of the organization at the expense of another.
* It expects that the organization as a whole is not, and does not perceive itself to be, under attack from hostile outside forces. Organizations can effectively fear being blamed and destroyed just as well as individuals can.
There are and have been plenty of uncaring, reckless and rebellious people in aviation. Part of a just culture is, in fact, efficiently identifying and weeding out such elements as well as building systems that are resilient enough to contain their impact.
But of course, it's easier to wring our hands over not having the perfect culture to implement what was itself the culture-builder than to actually put in the decades of work required.
As an aside, I think a lot of readers would be better served by setting aside their existing notions of what "blameless" means and looking into what it _actually_ means for the industry in question.
Yep, all you've done is created a scapegoat - which in some cases does satiate the public's bloodthirst - but does not actually address the fundamental issues.
How does the United States "lead the world in air safety" with only one of their airlines (and not one of the major ones) even making the top 10 from a US source[1]?
Looking back over the years, the US has never hit this metric. Have I misunderstood how it's being calculated?
Over 20 years ago, at my first job, a co-worker told me that the evening before, he got a phone call from their best friend who phoned him in shock and told him they just survived a plane crash: https://en.wikipedia.org/wiki/Crossair_Flight_3597
Wow, I remember this! Glad you were a-ok. Would love to hear more about your story and your method of egress from that bird.
For those that don't know, LAS to BUR is the best way to get from LA to Vegas (save private) and always has been.
I've been on that Sunday afternoon flight from LAS->BUR many times; with the afternoon flight informally known as the Hangover Express and the morning flight known as the Stripper Express. Good times.
And FWIW, I still prefer FORTRAN over Perl any day...
Mentour Pilot on YouTube, fabulous channel BTW, talks about the Swiss Cheese Model.
I’ve started to use this model when talking about software testing.
Unit tests, integration tests, functional tests, UAT, etc are all layers that help prevent defects from reaching production.
Those testing layers are deliberately redundant.
When we do a postmortem on a defect, it often turns out there was only one layer of testing between the defect and production and the incident would have been less likely to occur with multiple layers in place.
Over the last week or so I've become very fond of the "Pilot Debrief" channel on YouTube - basically an ex-military and current commercial pilot talking through the analysis of various incidents - I find it completely fascinating.
Something that appealed to me during my years at Bridgewater is the willingness to attribute failure to individuals without making those individuals suffer.
In "blameless" there's no personalities mentioned at all but in reality personality and individuals matter. EG if we had an outage because I put code into production w/o review, yes it's a systemic failure but it's also something about me, something about my manager, etc. By talking about those things as well, you make it easier for both the system and the individuals to evolve.
It's interesting how people carry their preconceived notions into articles like this, especially when those articles are about entirely different industries and/or contexts than they're used to.
For example, reading any NTSB report would make it quite clear that the actions of specific individuals - from pilots to controllers to maintenance techs and more - are very much mentioned and dissected under the "blameless" style of investigation that international aviation adheres to. But people already have their own ideas of what "blameless" means that inevitably colours their reading - in some cases, colours it so strongly that not even the evidence of success is enough to budge the conclusion that there must be something wrong with the way "blameless" culture is implemented.
The issue with attributing failure, no matter how little suffering there is, is that it provides a huge incentive to not self report and/or downplay incidents.
Unless you are acting with malice, your actions are a product of the system and procedures in place. Those should be the focus of improvement, not individuals. E.g. preventing prod pushes that contain unreviewed code.
Of course we need to talk about individuals, but that's performance management and not incident management.
Why don't we as a society apply similar safety culture to driving cars and trucks as we do to aviation?
MANY more people are harmed or killed by automobiles than by aircraft per person-mile traveled. A small improvement in automobile safety could make a large impact on society. Many of the types of improvements implemented by aviation mentioned in this article, when applied to the automotive world, would look like improvements or changes to road design.
Obviously, if you're a pilot and flying drunk who harms others or a driver who drives drunk and harms others, YOU are clearly to blame. But for the other kinds of crashes, it feels like the automotive world learning from aviation would have a lot of value.
I can think of many reasons it's difficult, though:
* The fact that accident investigations are designed to figure out which driver was at fault, for insurance purposes
* The fact that road designers are shielded from liability (good, right?), but only if they can show they followed the standard design handbooks, which not only fail to prioritize safety over traffic flow but also don't require or even contemplate performing a thorough case-by-case investigation into how to prevent a certain type of accident from happening again. (Crash Analysis Studio tries to demonstrate how to do this.)
* The sheer scale of the changes to roads that would be needed. In the States, access of hundreds of thousands of businesses and probably millions of homes depend on roads that are supposed to get people from A to B fast but also have lots of access points for businesses on them, which is like just like mixing taxiing and holding with takeoff and landing, but without the air traffic controllers. And don't get me started about unprotected lefts across multiple lanes of traffic.
* The lack of buy-in from the citizenry for enforcing professional-like standards on drivers. Despite the blameless culture that helps identify flaws in the system, air traffic controllers, pilots, maintenance crews, train engineers, etc also know that their job depends on making a sincere effort to follow the rules that are written, and in most cases their professional identity is tied up with following the rules. (I've heard that to be a pilot you really have to be comfortable with doing what you're told, all day long.) There's really no obvious way to get the masses of drivers to think and feel that way about their driving, and of course we can't just sanction our way to compliance because people need their licenses to go about their lives. And of course usually there is no reasonably alternative to driving yourself where you need to go.
> I've heard that to be a pilot you really have to be comfortable with doing what you're told, all day long.
Quite the opposite. The Pilot in Command is ultimately the person who is legally responsible for the safe operation of the aircraft. While you are following ATC instructions the majority of the time you are in contact with them, you are also culpable if you blindly allow them to put you in danger. If ATC tells you to do something unsafe, the proper response is to say "unable" and then say why. As described in the article, there are safeguards to protect pilots acting in good faith, and this is also one reason why airline pilots are unionized.
Ah, I worded that badly. Ultimately the pilot is the decider as you say. I was getting at the idea that (so I am told) the day to day experience of piloting is very much about working within a regimented system.
Strongly reglemented: yes; however, the rules are cut out in a way that requires _a lot_ of due diligence and experience to make sound decisions. For instance, EASA rules so no problem whatsoever dispatching an aircraft with minimum fuel (fuel planning regs) and no alternate planned (alternate planning regs) towards a destination that has thunderstorms in their weather forecast (weather regs)... it is up to the flight crew to mentally "fusion" different regulations together and make sound and safe decisions.
Well we do. That's why automobiles have seatbelts and airbags, rear-view cameras, antilock brakes, traction control, and have to pass crash safety tests.
Sure but we still have tens of thousands of traffic fatalities every year in the USA and the stats are not decreasing in any meaningful way. What we have done is not working to actually reduce automobile deaths.
When I first started to program professionally, I was fascinated with the blameless postmortem culture that my ex-employer had. I thought (and still do) it is incredibly conducive to systemic improvement. I’m happy it is legally codified in the aviation industry.
We need something like that to deal with the danger of genetically modified viruses causing pandemics. Covid may or may not have leaked from a lab in Wuhan but the investigation has been kind of shockingly bad with scientists in charge calling anyone suggesting that conspiracy theorists while privately saying a lab leak was "so friggin likely."
Just recently is was reported that Chinese scientists were experimenting with a new Covid-like virus where every rodent that was infected with the pathogen died within eight days, which the researchers described as 'surprisingly' quick. This time it's a Pangolin virus rather than a bat one though still on humanised mice. I think a rule that anyone doing such research would have to get say $100bn liability insurance first might make them more cautious. Covid cost $12.5
trillion + according to the IMF and with deaths ~100x all aviation accidents in history.
Boeing management should read this so they can maybe understand the nature of their quality problems is systemic and not the result of a few little oopses.
Boeing fucking up with the 737 Max so much is merely a symptom.
They so desperately needed it to remain a 737 so it wouldn’t be a new type and therefore wouldn’t need a new type rating was mostly because the airlines demanded a better 737.
Yes, Boeing is at fault for the MCAS disaster (no redundancies, holy fucking hell), but in reality it was the entire industry.
My point is, had nothing changed, it would have happened anyway, and perhaps with Airbus and not Boeing.
The one big oopsie is “we want a better plane and we also don’t want to retrain our pilots”, which is an oopsie that absolutely was not Boeing’s fault.
> The one big oopsie is “we want a better plane and we also don’t want to retrain our pilots”, which is an oopsie that absolutely was not Boeing’s fault.
Of course it is. Customers always ask for the impossible. Responsible firms say no to unreasonable requests.
There should definitely be some introspection around what constitutes an aircraft type and what doesn't. Perhaps it would be better if manufacturers could have more leeway in terms of redesigning newer models of aircraft without requiring retraining. Say e.g. if Boeing would make a completely different undercarriage to a new 737 so it was as tall on its legs as a 320.
That would have meant the MAX engines could have sat more like on the legacy 737. Which perhaps would have been a lesser change to the system overall than the changes made on the MAX? (larger engines, different engine placement, systems to counter the behavior coming from the new engine placement and so on and so forth). I'm not sure what the solution would be here, but it seems like any time a set of regulations is rigid-yet-full-of-holes it's almost better if it isn't so rigid.
I didn’t suggest that Airbus has similar issues. What I meant by the “entire industry” also includes the airlines.
I suggested that had nothing changed, it might have been Airbus that was low-key blackmailed into making a new plane that’s of the same type, with similar issues that would become evident only after a couple of crashes.
The very root cause of the issue was that Boeing was simply allowed to say “trust me bro, this is just a 737, it’s like any other 737, no retraining of any kind needed”, when, in fact, it wasn’t just the same old 737. Now this can’t happen anymore.
>> It’s often much more productive to ask why than to ask who. In some industries, this is called a “blameless postmortem,” and in aviation, it’s a long-standing, internationally formalized tradition.
If people die in an accident, blameless postmortem shouldn't be the answer but accountability. Otherwise events like the MCAS disaster would be just an "happy accident" with no one to blame.
No, accountability is a very insufficient answer, because it helps very little in identifying and fixing broken systems.
In a lot of cases, this one being a particularly good example, nobody was particularly incompetent, negligent, malicious, or greedy. But the system as a whole lacked safeguards so that a single person's inconspicous mistake could cause a disaster, and overloaded that person's attention.
Sure, there's other cases where you can clearly identify negligence, malice, or greed of specific people as causes of the disaster. But there as well, holding those people accountable doesn't fix the system than incentivised the greed, doesn't prevent incompetent, malicious or negligent people from getting into positions where actions based on those traits cause disasters, and doesn't provide safeguards against such actions.
The deterrence value of holding individuals accountable for extremely rare events is approximately zero.
The thing with the two MCAS accidents is that you don't have too few people to blame, but too many:
- the airlines who wanted a more fuel-efficient plane that should still have the same type certificate as the (at that time) 40+ year old Boeing 737
- Boeing management, who wanted to keep those airlines as customers
- Boeing engineering, who did what they did at the management's request
- the FAA, who failed to spot the issue
- the mechanics of the involved airlines, who dispatched the planes without working angle-of-attack sensors
- the airlines who ordered planes without the (optional) "angle-of-attack disagree" alert
- the pilots (just for the sake of completeness), who failed to notice the issue and either refuse to fly the plane or correct it in time to avoid the crash
"Blameless", in this instance, is referring more to individual people, not absolving organizations and processes of responsibility (assuming no negligence). The ideal it's trying for is to approach an accident not as a wrong to be avenged with punishment but a technical failure to be understood, allowing changes to be made to prevent it from reoccurring.
> Otherwise events like the MCAS disaster would be just an "happy accident" with no one to blame.
...it's the same aviation industry with the same approach to investigations outlined in the article that did, in fact, expose everything that you know about the 737 MAX's MCAS issues and led to their ultimate mitigation.
I don’t think that’s what’s implied here at all - the focus on blamelessness is to get to the root cause of failure and prevent future occurrences rather than to punish existing failures without affecting any actual change.
Crashed once in Nassau when I was a kid. Brakes failed on the left side on a Beechcraft Model 18, spun the plane around on touchdown. Slid sideways down the runway, broke the landing gear then rolled breaking the left wing. Everyone splashed through fuel walking away. There was no fire.
A year later I almost crashed in St Thomas in another Model 18. That was pilot error. Carburetor heat was on and we took all the runway trying to get off the ground. We took off towards the ocean and flew a long way so close to the water that we were sucking water up and spraying the plane. I was blissfully unaware as a kid. Thought the pilot was playing around.
> The United States leads the world in airline safety. That’s because of the way we assign blame when accidents do happen.
The communists were great at assigning blame, just usually not to the right person. If I remember correctly during the VW emissions scandal [1] they tried to pin it all on one engineer. The UK Post Office similarly tried to blame Fujitsu for a systemic failure [2].
> In the aftermath of a disaster, our immediate reaction is often to search for some person to blame. Authorities frequently vow to “find those responsible” and “hold them to account,” as though disasters happen only when some grinning mischief-maker slams a big red button labeled “press for catastrophe.”
With Health and Safety (H&S) in the UK, every person is each themselves responsible for raising concerns as they see them. If something were to happen in our work place for example, I think questions would rightfully be asked of all employees 'if you saw it and agree it was dangerous, why didn't you say anything?'. To foster this all you need is a work culture that accepts criticism and is open to discussion about it.
> It’s often much more productive to ask why than to ask who. In some industries, this is called a “blameless postmortem,” and in aviation, it’s a long-standing, internationally formalized tradition.
It's all good and well, but the blame is then just assigned to a system with no real accountability - any accident can be explained away with tighter control measures. I remember a story about nuclear lab technicians taking photos of radioactive materials and almost causing a serious incident. I couldn't find an article on it, but it happens regularly enough at nuclear facilities [3].
You can't use a system to entirely automate your way out of negligence. If you hire people that have no drive to improve their work place, you will inevitably have issues.
Just a few days ago I was at a theme park type event and noticed that a structure that was fully on the ground (easily accessible) had been allowed to rust. Most paints for this use have a 10 year rating, but the employees were young and had no interest in the future of this place. An accident will at some point occur and some rules will be changed, but the company itself will continue to hire people with no incentive for the company's future.
This concept of a blameless culture reminds me of one time when I was talking to a SWE at Facebook around 2010. I don’t know whether the story is actually true or just folklore, but apparently someone brought down the whole site on accident once, and it was pretty obvious who did it.
Zuckerberg was in the office and walked up to the guy and said something along the lines of “Just so you are aware, it would probably take a lifetime or more to recoup the revenue lost during that outage. But we don’t assign blame during these sorts of events, so let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again.”
Whenever this happens to someone it's always a horrible feeling where one feels very guilty and ashamed no matter what people say to you and unfortunately mistakes like these are almost the bread and butter of any extremely experienced grey beard so it's kind of normal that something like this happens to someone at some point sooner or later. The only people who never make costly mistakes are those who were never trusted with responsibility in the first place.
So having said that I would like to emphasise that the cost which often gets quoted with those mistakes is not a real cost, it's an unrealised opportunity cost and sure it hurts, but you know what, the same company culture that allows such mistakes to happen and miss out on opportunity costs is the same culture which also allows engineers to quickly roll out important features and updates and therefore create more opportunity in the first place, and much faster as a whole, so in theory the cost doesn't come without the opportunity and it all evens itself out. Don't feel too bad about it.
> the cost which often gets quoted with those mistakes is not a real cost
It is still money they would have made that now weren't made. It is very important to explain to people how much value is lost during these events so that we also correctly value the work to prevent such events in the future.
You're comparing "reality where accident happened" to "an alternate reality where everything is exactly the same but the accident did not happen" and this is not a sensible comparison.
The reality we have produced the accident. You can't have that reality and have it not produce the accident, because it was set up to produce the accident. Proof: it produced the accident.
To avoid the accident, you need an alternative reality that is sufficiently different so as not to produce the accident, and some of those differences may well have resulted in lower profit overall.
(You may argue that you're able to set up an alternate reality that does not produce the accident and results in higher profit overall – that's a completely different argument, but it also requires you to specify some more details to make it a falsifiable hypothesis. Without those details we can not guarantee a higher profit in that alternate reality.)
And to add to that - the number is almost always wrong because people tend to just count the money hose throughput times the downtime. But many of the people who would have spent money on the downtime will do so later. I guess maybe that's not true of advertising revenue? Although I imagine advertisers tend to have some monthly spend.
Sure, the probability that things that have happened will have happened is 1.
The real test for hard determinists is being able to conclude that the probability of things that will happen is also 1. At that point there's no such thing as "falsifiable".
If your shop takes $3600 an hour in revenue, but there's a problem with the till which means that people can't pay for 10 seconds, you haven't lost $10 in revenue, you've just shifted revenue from $1/second to $0/second for 10 seconds and $2/second for the next 10 seconds.
Yup, the only "real" cost there is a customer who decides not to buy after all, or buys elsewhere instead. But that's pretty unlikely, especially for short outages. And it's even less of an issue for entities with a lot of stickiness like social networks (Facebook, Twitter) or shopping websites with robust loyalty programs (Amazon Prime).
It's also hard to understand because it's largely illusory. If, say, Facebook is down and ad spending ceases for an hour, that money didn't just go up in smoke. It's still in somebody's ad budget, and there's a very good chance they're still going to spend it on ads. Thus, while there will be a temporary slow down in ad spend rate, over the course of the quarter the average may be completely unaffected due to catch up spending to use the budget.
Some usage (sales, ad views, whatever) will be delayed, some usage will be done somewhere else, some usage will be abandoned.
But costs are likely down too. If there's any paid transit, that usually drops during an incident. If you're paying for electricity usage, that drops too.
And significant outages can generate news stories which can lead to more usage later. Of course, it can also contribute to a brand of unreliability that may deter some use.
> the cost which often gets quoted with those mistakes is not a real cost
Oh that depends entirely on the industry.. in social media maybe not, in banking and Fintech those can most certainly be real costs. And can tell you - that feels even worse
But that isn't quite what you want in a blameless culture. The right response looks something like ignoring the engineer, gathering the tech leads and having an extremely detailed walkthrough of exactly what went wrong, how they managed to put an engineer in a position where an expensive outage happened and then they explain why it is never going to happen again. And anyone talks about disciplining the responsible engineer shout at them.
Also maybe check a month later and if anything bad happened to the engineer responsible as a result of the outage, probably sack their manager. That manager is a threat to the corporate culture.
Maybe Zuck did all that too of course. What do I know. But the story emphasises inaction and inaction after a crisis is bad.
They'll also be the person most able to identify what went wrong with your processes to allow the failure to occur and think through a mechanism to systematically avoid it happening again.
Also, they're probably the person least likely to make that class of mistake again. If you can keep them, you've added a lot of experiential value to your team.
Perhaps one slight amendment - maybe don't ignore the engineer, but ask them (in a separate, private meeting) if they have any thoughts on the factors that lead to it, and any ideas they have on how it could be avoided in future. Could be useful when sanity-checking the tech-leads ideas
Describing my last company’s incident process exactly.
We’d have like 3 levels of peer review on the breakdown too.
Once there was an incorrect environment variable configured for a big client’s instance which caused 2 hours of downtime (as we figured out what was wrong) and I had to write a 2 page report on why it happened.
That whole thing got tossed into our incident report black hole.
Personally I feel like the right thing to do is let the engineer closest to the incident lead the response and subsequent action items. If they do well commend them, if they don't take it seriously then it may be time to look for a new job.
I don’t think “blameless” and “shared responsibility” are mutually exclusive, in fact, they are two halves to this same coin. The dictionary definition of “blameless” does not encompass the practical application of a “blameless” culture, which can be confusing.
The “blameless” part here means the individual who directly triggered the event is not culpable as long as they acted reasonably and per procedure. The “shared responsibility” part is how the organization views the problem and thus how they approach mitigating for the future.
But when I think of “shared responsibility”, I think of everyone as sharing fault.
When something goes wrong, I think someone, somewhere likely could have mitigated it to some degree. Even if you’re following procedures, you could question the procedure if you don’t fully understand the implications. Sure, that’s a high bar, but I think it’s a preferrable to pointing the finger at the people who wrote the procedures.
On that note, someone or some group being at fault doesn’t necessitate punitive action.
> ... but I think it’s a preferrable to pointing the finger at the people who wrote the procedures ...
It is better to point the finger at the people who wrote the procedures. Their work resulted in a system failure.
If the person doing the work is expected to second guess the procedures, then there was little point having procedures in the first place, and management loses all control of the situation because they can't expect people to follow procedures any more.
Sure the person involved can literally ask questions, but after they ask questions the only option they have is to follow the procedure, so there isn't much they can do to avert problems.
When I was only a few years into my career I accidentally deleted all the Cisco phones in the municipality where I was a sowtware developer. I did it following the instructions of the IT operations guy in charge of them, but it was still my fault. My reaction was to go directly to the IT (who wasn’t my) boss and tell him about it.
He told me he wasn’t happy about the clean up they now needed to do, but that he was very happy about my way of handling the situation. He told me that everyone makes mistakes, but as long as you’re capable of owning them as quickly as possible, then you’re the best type of employee because then we can get to fix what is wrong fast, and nobody has to investigate. He also told me that he expected me to learn from it. Then he sent me on my way. A few hours later they had restored the most vital phone lines, but it took a week to get it all back up.
It was a good response, and it’s stuck with me since. It was also something I made sure to bring into my own management style for the period I was into that.
So I think it’s perfectly natural to react this way. It’s also why CEOs who fuck up have an easy time finding new jobs, despite a lot of people wondering why that is. It’s because mistakes are learning experiences.
I'd much rather hear about a problem from a team member than hear about it from the alert system, or an angry customer.
plus when the big fuckup happens and the person causing it is there, there is an immediate root cause, and I can save cycles on diagnosis; straight into troubleshooting and remedy.
I don’t know when this was turned into a Facebook trope, but I’ve heard it before as an engineer asking “Am I being fired?”, to which the director responds “We just invested four million dollars in your education. You are now one of our most valuable employees!”
Four million is definitely in the range of an outage at peak, that's not counting reallocated engineering resources to root cause and fix the problem, the opportunity cost of that fix in lost features, extra work by PR, potential contractual obligations for uptime, outage aftershocks, recruiting implications, customer support implications, etc.
If you have a once a year outage, how many employee-hours do you think you are going to lose to socially talking about it and not getting work done that day?
$116.6 Billion in revenue is ~13 million an hour. Outages usually happen under greater load, so very likely closer to ~25 mil an hour in practice.
> revenue wouldn't be lost if you had a 100ms outage
If that little blip cascades briefly and causes 1000 users to see an error page, and a mere five of them (0.5%) to give up on purchasing, boom you just lost those $700 (at least in the travel industry where ticket size is very high). Probably much more.
An error page can be enough for a handful of customers to decide to “come back later” or go with a competitor website.
If you think about experiments with button colors and other nearly imperceptible adjustments, that we know can affect conversion rates, an error page is orders of magnitude more impactful.
Probably, though when your business is making billions this is still just a few hours outage, or one long-running experiment dragging your conversion down by a few percentage points.
> Just so you are aware, it would probably take a lifetime or more to recoup the revenue lost during that outage. But we don’t assign blame
Assuming that’s accurate, it’s a pretty shitty way to put it. “Hey man, just so you know you should owe me for life (and I pay your salary so I decide that), but instead of demanding your unending servitude, I’m going to be a chill dude and let it slide. I’m still going to point it out so you feel worse than you already do and think about it every time you see me or make even the smallest mistake, though. Take care, see you around”.
It’s the kind of response someone would give after reading the Bob Hoover fuelling incident¹ or the similar Thomas Watson quote² and trying to be as magnanimous in their forgiveness while making it a priority that everyone knows how nice they were (thus completely undermining the gesture).
But it’s just as likely (if not more so) the Zuckerberg event never happened and it’s just someone bungling the Thomas Watson story.
I was an FB infra engineer in 2010. It's not accurate, there was already a "retro" SEV analysis process with a formal meeting run by Mike Schroepfer, who was then Director of Engineering. I attended many of them. He is a genuinely kind person who wouldn't have said anything so passive-aggressive. Also, many engineers broke the site at one time or another. I agree this is just a mutation of the Watson quote.
The only time I ever saw an engineer get roasted in the meeting was when they broke the site via some poor engineering (it happens), acknowledged the problem (great), promised to fix it, then the site went down two weeks later for the same reason (not great but it happens) and they tried to weasel out of responsibility by lying. Unfortunately for them there were a bunch of smart people in the room who saw right through it.
Look to your left, look to your right, count the heads. Now divide the money that was lost through the number of heads. This is the theoretical ceeling- how much you could make if there were no shareholders and you had your own company - or had a union.
> Now divide the money that was lost through the number of heads. This is the theoretical ceeling
So, if we assume a $10 million loss divided by 100 heads, that means your ceiling is -$100,000 if you were to organize yourself.
Let's see: Six months to build a Facebook clone on an average developer salary plus some other business costs will put you in the red by approximately $100k, and then you'll give up when you realize that the world doesn't need another Facebook clone. So, yeah, a -$100,000 ceiling sounds just about right.
Eh, that’s a really strange way to phrase it. Singling out the engineer isn’t blameless. Sure it’s a learning opportunity but it’s a learning opportunity for everyone involved. One person shouldn’t be able to take the site down. I have always thought of those situations as “failing together.”
Considering that everyone already knew who was responsible, I think saying "you won't be held accountable for this mistake" is the most blameless thing you can do.
> Sure it's a learning opportunity but it's a learning opportunity for everyone involved. One person shouldn't be able to take the site down.
The way I read the comment, it sounds to me exactly like what Zuckerberg said.
> Considering that everyone already knew who was responsible, I think saying "you won't be held accountable for this mistake" is the most blameless thing you can do.
What you’re describing isn’t blamlessness, it’s forgiveness. It’s still putting the blame on someone but not punishing them for it (except making them feel worse by pointing it out). Blamelessness would be not singling them out in any way, treating the event as if no one person had caused it.
> The way I read the comment, it sounds to me exactly like what Zuckerberg said.
Allegedly. Let’s also keep in mind we only have a rumour as the source of this story. It’s more likely that it never happened and this is a recounting of the Thomas Watson quote in other comments.
> But we don’t assign blame during these sorts of events, so let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again.
It's the latter half of the sentence that makes it blameless. Zuckerberg is very clearly saying the problem is that it was allowed at all.
sometimes the root cause is someone fucking up, if you're not willing to attribute the root cause to someone making a mistake then being blameless is far less useful.
What part of "so let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again" doesn't mean that it's happened, but let's see how we get there.
"It should not have been possible for one person to take the site down" - yes, and that's exactly what Zuck is addressing here? May be such controls are there across the development teams and some SRE did something to bring it down and now there needs to be even better controls in that department as well?
As told this is clearly not a Zuck quote because it’s shitty leadership. There’s no way Facebook got where it is with such incompetence. This is clearly a mistelling of older more coherent anecdotes.
Not really. If you single someone out as CEO that's a punishment. Even if your words are superficially nice what he really did was blame the engineer and told him not to do it again. He should have left it with the engineer's line manager to make that comment, if at all because essentially he's telling the employee nothing that he didn't know already.
> because essentially he's telling the employee nothing that he didn't know already.
The employee did not know that the CEO would be so forgiving. And it helps set that culture as other's here about the incidence and response.
Also, why is this so important? If your punishment for bringing down Facebook is your boss' boss telling you "Hey even if this is a serious mistake, I don't want you to worry that you're going to be out of a job. Consider this a learning opportunity," than that seems more than fair to me.
> Even if your words are superficially nice what he really did was blame the engineer and told him not to do it again.
The person being told that may feel that way, but IMO nothing from his phrasing implies that:
"let's just consider it an expensive learning opportunity to redesign the system so it can't happen again"
Note the "can't" in the "can't happen again" - he isn't telling the employee "don't you dare do that again!" as you seem to be saying, he's saying "let's all figure out how to protect our systems from such mistakes".
Strange way to describe the same situation and Zuck's thrust there in different words. Zuck is literally saying "failing together" and "learning together".
The other story you are referencing is "this was an expensive education in which you have learnt to not do stupid stuff".
This is framed as "this was an expensive learning opportunity for us to learn that we have a gap in our systems that allowed this downtime to happen".
These are different sentiments! To me the above quote is very explicitly the latter and directly refutes the notion of "this was expensive training for you" by stating that it's impossible for an individual to apply that learning in a way that would recoup the loss.
We don't count those that are eaten after the crash as they survived the impact and immediate aftermath.
More seriously:
But for those unlucky enough to be involved in the small percent of fatal air accidents, what are the odds of survival if your plane does crash?
The NTSB says that despite more people flying than ever, the accident rate for commercial flights has remained the same for the last two decades, and the survivability rate is a high 95.7 percent.
--
The European Transport Safety Council (ETSC) has also examined the survivability of aircraft accidents worldwide, estimating that 90 percent are survivable (no passengers died) or “technically survivable," where at least one occupant survives.
Most of those fatalities were a result of impact and fire-related factors including smoke inhalation after impact.
I guess it counts in minor accident that managed to land successfully and those more serious but crash landed with little or no casualties at all? As far as I know, there were no survivors in recent Boeing crashes. Correct me if I was wrong.
> You, reader, have never been in a plane crash, because if you’d have been in one, you’d be dead.
You’d be surprised:
> According to a study by the European Transport Safety Council, plane crashes technically have a 90% survivability rate, and this figure is increasing, largely thanks to modern aircraft design, which features enough exits to allow for a full passenger evacuation in around 90 seconds.
> Recent proof of plane crash survival came in October last year when a Boston-bound private plane taking off at Houston Executive Airport struck a fence and burst in flames. All 21 people onboard survived. In 2018, all 103 passengers survived a flaming plane crash in Mexico when strong winds brought down Aeromexico flight 2431.
Total loss of control basically never happens. I’ve been reading and watching a lot of videos on aircraft accidents and near misses, and I haven’t seen a single one where there truly was a total loss of all controls over the aircraft.
Even in the case of Japan Airlines Flight 123 (https://en.m.wikipedia.org/wiki/Japan_Air_Lines_Flight_123) the pilots still had some degree of control over the plane after its tail was ripped off due to a structural failure leading to explosive decompression at cruising altitude which also caused a total loss of all hydraulic systems.
Unfortunately what they had left (control over the engine thrust) was not enough, but they sure did their best. Apparently you can still control (to some extent) the plane using just the engines, with and the vertical stabilizer being completely gone, and no hydraulics. And this was in 1985.
I’m not saying that JAL 123 was in any way survivable (although some miraciously did survive), just pointing to the fact that even in the worst possible scenario there’s still probably a way to somehow control the plane.
Also see United Airlines Flight 232 https://en.wikipedia.org/wiki/United_Airlines_Flight_232 which suffered a loss of all hydraulics and thus control surfaces. The crew and a check airmen who was aboard as a passenger managed to somewhat control the plane using engine thrust and managed to steer the plane to an airport before making a crash landing. They saved more than half the people on the plane.
One of the harrowing details is this quote: "The crew contacted United Airlines maintenance personnel via radio, but were told that the possibility of a total loss of hydraulics in a DC-10 was considered so remote that no procedure was established for such an event."
Amazing, I didn't know of this incident. Awesome airmanship.
> One of the harrowing details is this quote: "The crew contacted United Airlines maintenance personnel via radio, but were told that the possibility of a total loss of hydraulics in a DC-10 was considered so remote that no procedure was established for such an event."
Has this even changed? What can you even do in the extremely unlikely scenario of loss of all hydraulic systems? I don't think it's even possible to train for such a scenario. You are truly on your own. What would the checklist even look like? "try doing the best you can with engine thrust alone, and may god help your soul"?
As far as I'm aware, the engineering has gotten better (introduction of hydraulic fuses that keep a closed loop in case there's indications of leakage), but truly, what can you even be expected to do without flight controls.
I think it's a fairly safe assumption that if you are alive to read the article, you are unlikely to have been in a plane crash (yes, I agree some plane crashes are survivable).
Next article: Why you've never been struck by lightning.