Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?
$600,000, pfft, try £350 million, Niall Fitzgerald at Unilever.
We launched a product with a wonderful new element in it, which had the capacity to clean clothes better than anything that had ever been seen. Unfortunately, if it wasn’t used in exactly the right spec, it cleaned so well that it cleaned away the clothes as well...I was personally responsible for Unilever’s largest marketing disaster.
He quoted his Chairman’s response to this at the time:
We’ve just invested £350m in your education. If you think you’re going to take that somewhere else, forget it.
An old coworker at a clearing firm once accidentally rebooted two very critical servers during a very critical time. From what I'm told, it ended up costing the company about a billion dollars in fees and fines.
He managed to keep his job.. just with no more prod access. :)
Sounds like someone else’s fault, unless he owned and designed the system. No single engineer should be able to cause that kind of damage. Quorum rules, etc.
It was a team of five senior level linux sysadmins who oversaw about 2k servers. Lots of arguments for and against pushing blame here or there, but at the end of the day - he fucked up big time and shoulda known better.
I’m sure that’s true, but my point is that if you have that much money riding on a system you should have to figuratively (if not literally!) put two keys in and turn the lock at the same time to break shit. There should be systematically enforced mandatory reviews, two plus person policy for issuing commands, etc.
You have to expect people to make mistakes. I’m not saying he didn’t fuck up, but if a company is down a billion dollars the story should be of multiple people making multiple mistakes.
Hm. I mean yes, but still - the guy made a legit rookie mistake by not checking the hostname of the host he rebooted before typing "reboot". Kinda 101 stuffs there. :/
That's all I know, really. It was one of the strangest companies I've ever worked at.
Another story from the same place (won't confirm nor deny if it was the same sysadmin) - someone accidentally pushed a puppet config out that changed all hosts' timezones to CDT. A bit later, I received a random and kinda joking text from a buddy from a past job - who couldn't get an answer from my company's tech support - asking why all of their clearing reports had the wrong times.
He was the only person outside of three other sysadmins who ever mentioned anything was wrong. Not even my boss brought it up! It was odd.
Why would it be 1940 at the very latest? He ran IBM till well into the 50s. It has multiple variants, the longer ones involving a sale and a million dollars.
It's about intent. Was the guy's intent to bring the site down? Does he have a track record of making such mistakes? Do they demonstrate a lack of concern or remorse? No?
Then blame is irrelevant. It happened, they learned from it, they now have that experience under their belt.
Disappointed that one of the lessons was not “don’t deploy on Friday and then immediately run out the door.” I know most of you will say that that shouldn’t be an issue if you have proper tests, devops, etc., but this type of thing is the reason that Ops usually controls the releases. Yes, I know, Ops is obsolete and can go suck a lemon, but it’s stuff like this that shows the wisdom of the older ways.
So yeah, be nice to Ops too, because they actually have experience in stuff like this and one weekend of downtime is not an appropriate price to pay for every developer to learn a lesson.
Actually, what I learned is that being afraid to deploy on Friday means you're lacking in testing, verification, auto healing and rollback processes.
Also, if something goes down in a way that requires a human to work on the weekend, it should result in a postmortem, and all of the components in the deployment chain related to the failure should be evaluated, with new tasks to fix their causes. If it happens multiple times, all project work should stop until it's fixed.
This of course is balanced against how much failure your business can tolerate. If the service goes down and nobody loses money, do you really need your engineers working overtime to fix it?
> Actually, what I learned is that being afraid to deploy on Friday means you're lacking in testing, verification, auto healing and rollback processes.
Or being afraid to deploy last thing on Fridays is an admission that maybe... just maybe... you're not infallible
My team deploys probably 5 times in a given day, including Friday. They are all small deploys, and none can happen at the end of the day. If shit hits the fan, we rollback and maybe people figure out the root cause over the weekend but they aren't sweating bullets.
I guess all the people commenting are much more competent than the people I've worked with.
Even with tests, I've seen startups do double charges on accounts and whatnot on deployments which didn't show up until the next day. I've also seen ops people updating OES where the storage service would segfault a day later. How does DevOps and OES go together in one sentence, you ask? it doesn't, but it just means not all ops people are pure wisdom either. The guy caused others to waste 72 hours of compute resources, because of this. So it's not limited to Dev. And yes, the first company did learn from the double spending bug, but why learn on a saturday?
So even if your DevOps practices are amazing and you have 70% test coverage on all your components, that doesn't mean you can't deploy faulty components where the deployment itself appears successful. Now what? Things aren't failing, they appear just fine. Someone has to go in and debug the problem, it may affect multiple components, it may be critical, and a simple rollback may not cut it.
Friday deployments are fine for certain components, but surely not as a general rule for everything? Friday deployments are like Monday morning or Friday meetings. You can do them, and most of the times they'll be fine, but maybe out of respect for your colleagues you shouldn't anyway.
I've worked in places that required a checklist worthy of NASA to deploy software, and also had blackout periods where no changes could be made to the system without an executive team sign-off[1]. The thought of any of them deploying 5 times a day is just beyond anything they would be allowed to do. I would expect most enterprises are closer to the event deploys then rapid, multiple deployments.
In that type of "event" style deployments, week night deploys probably are safe, but anything scheduled for Friday is trouble.
> Actually, what I learned is that being afraid to deploy on Friday means you're lacking in testing, verification, auto healing and rollback processes.
Our philosophy is that if nothing ever breaks in production, you are being too conservative with your controls and development. Or if you look at it another way, you can allocate resources towards stability and new features, and the (near) 100% testing/verification/auto healing/rollback coverage means that too much of your resources are allocated to stability and not enough towards new features. Running a service with uptime too close to 100% uptime also causes pathologies in downstream services, and if your never have to fix anything manually the skills you need to fix things manually will atrophy.
Or, for our service,
- There should be a pager with 24 hour coverage, because our service is critical,
- That pager should receive some pages but not too many, so operations stays sharp but not burdened,
- Automation and service improvements should eliminate the sources of most pages, and new development should create entirely new problems to solve,
- If the service uptime is too high, it should be periodically taken down manually to simulate production failures, and development controls should be reevaluated to see if they are too restrictive.
Eliminating all the production errors takes a long time and a lot of effort. Yes, we are spending that effort, but the only way that this process will actually “finish” is if the product is dead and no more development is being done. The operations and development teams can then be disbanded and reallocated to more profitable work. A healthy product lifecycle, in general (and not in every case), should see production errors until around the team is downsized to just a couple engineers doing maintenance.
You can phrase it as “afraid to deploy on Friday”, but I think “afraid to cause outages in production” indicates that the blast radius of your errors is too large or that you’re being too conservative.
The product has 24/7 pager coverage, but that does not mean that one person has the pager the whole time! At any given time the pager is covered by two or three people in different time zones. The way my team is structured, I will only get paged after midnight if someone else drops the page. And I only have a rotation for one week every couple months or so.
There are definitely employees who don’t enjoy having the pager, but we get compensated for holding the pager with comp time or cash (our choice). The comp time adds up to something like 3 weeks per year, and yes, there are people who take it all as vacation. No, these people are not passed over for promotions. No, this is not Europe.
So the trade off is that seven weeks a year you carry your laptop with you everywhere you go, maybe do one or two extra hours of work those weeks, and don’t go to movies or plays, and then you get three extra weeks off. Yes, it's popular. People like pager duty because they get to spend extra time with their families, because they like to go camping, or because they want the extra cash.
> People like pager duty because they get to spend extra time with their families, because they like to go camping, or because they want the extra cash.
Adequately compensating on-call is, of course, the right way to do it. All sorts of considerations that were, otherwise, problems, such as how to ensure a "fair" rotation, magically go away [1].
Unfortunately, it's vanishingly rare, at least among "silicon valley" startups (and maybe all tech companies). I suspect it's one of those pieces of Ops wisdom that's vanished from the startup ecosystem because Ops, in general, is viewed as obsolete, especially by CTOs who are really Chief Software Development Officers.
Insofar as it's a prerequisite to all your other suggestions, it makes them non-starters in such companies.
[1] Although I suppose if the compensation is too generous, there may still end up being complaining about unfairness in allocation
Actually, what I learned is that being afraid to deploy on Friday means you're lacking in testing, verification, auto healing and rollback processes.
I worked in a shop like that -- they had such great testing policies that they did continuous deployment, code went from commit to production as soon as the tests passed.
Until the holiday weekend when two code changes had an unexpected interaction and ended up black-holing all new customer activity that weekend. (existing customers were fine, they only lost data for new customers).
They could have recovered the data from a log on the front-end servers, but one of the admins noticed an unusual amount of disk space used on the front ends monday morning and just replaced them all (since their auto-healing allowed this without any interruption of service)... and since those logs were only used for debugging problems, they weren't persisted anywhere.
It turns out that tests aren't perfect - they only test what you think you need to test.
If the service goes down and nobody loses money, do you really need your engineers working overtime to fix it?
Money is not the only way to value a service.... but if the service goes down and no one cares, why run the service at all?
Well ... there are different kinds of deploys with different kinds of risks.
I usually don’t like a blanket don’t deploy on Friday rule.
We can usually rollback with one command easily, have good monitoring and health checks so even though something makes it into prod, it’s super easy to go back.
Unless you have changes like you mentioned, weird side effects, db scheme changes, config changes that affect machine configuration. Those are unknown unknown changes. Good practice to hold them back.
As to web and assets changes to a css file or a self contained js change, that should only re-deploy the files that changed and generally low risk.
Oh no no, bad code would be deployed all the time =) We just didn't notice it, because auto healing brought back the working site so quickly. Occasionally something would break design parameters and that would require a manual fix, but then that manual fix was added into auto healing...
It takes a good deal of design work to get a high level of resiliency, but it's completely within the realm of possibility. Most shops just don't dedicate the effort to it, because they're more worried about shipping new features, and this is understandable. Just different priorities.
+1 Blocking deploys on a Friday is a symptom of a lot of room to grow your operation maturity: the on-call person should be on-call, and able to handle most issues (in this case, probably a rollback) and that's if your tests didn't block the deploy in the first place
Code freezes (and that's what blocking deploys are) are a great tool, but primarily for managing your on-call more effectively.
This is what I was thinking. If one person is able to break a system with a small mistake then processes are at least as much to blame as the person. Mistakes will always happen because programmers are fallible so
a deployment process designed around infalliblity is destined to fail.
We have a company policy to never deploy on Friday, and only under ideal circumstances on Thursday. This protects everyone from unneeded overtime and panicking clients during off hours.
It's one of the central policies to ensure happy clients and smooth running operations that gets regular review and questions from clients when they are in a hurry. But when it was implemented, stress levels across the board plummeted. And only a small amount of client education was needed before they agreed it was a good policy.
There are emergency circumstances that override this rule of course.
Even if your automation is great, there’s often a series of events that goes like this:
Error discovered -> Call person responsible -> Roll back
It’s not just about whether you have the ability to fix the error, it’s about whether deploying on Friday is likely to disrupt people’s personal lives. Or, put another way, it’s not kind to deploy on Friday, it’s selfish. It looks good if you push out features quickly, and if it blows up someone else has to take time out of their weekend. On my team ops controls releases and if you miss the Thursday build you’re not getting anything in production until Monday.
If you deploy on Friday, run out the door, and soon find out that your contribution to a deployment caused an outage, wouldn't you immediately return to work to at least give the appearance of personal responsibility? (on any day of the week, even)
Also, wouldn't you just do a rollback to the last viable build?
I wiped out some pretty important databases (i.e. our ERP) at my first job. While it was extremely stressful, I don't remember being scared of being fired (who else was going to fix it?). The one thing I learned, beyond not doing that again, was that being confident makes everyone else not worried, and that gives you a good bit of grace with leadership teams.
I still laugh thinking about how the president, who had never showed any emotion before and was as serious as they come, brought in a Burger King meal for me while I was up late working on the restore.
List of things I've done in my first ~9 months at my company:
- Messed up the Software Engineering dept. Jira when I tried to customize my dept. Jira too much. Took hours to fix, during which time Software Engineering dept. couldn't do any ticketing actions.
- `sudo shutdown -h now` on a remote battery control PC because my terminal was still logged into it via SSH from 5 hours previous. I was trying to shutdown my laptop at the end of the day, and don't like the buttons when I have Tilda hotkeyed to 'F1'. Had to send a technician to the site 2 days later at a cost of several hundred bucks, plus the battery was not operational for 2 days during the most operational time we had in recent months, so we lost money there. I've done several more of this sort of thing that has required a tech, too.
- Forecast that a certain day would be the best day to do a specific operational test, but then fucked up inputting the ISO-format date/time string so it started at 7PM on a Saturday rather than 7AM (I know the format pretty well now: `2018-08-24T23:15:00-07:00`).
- Forecast that a certain day would be the best day to do a specific operational test, but fucked up the script, so it mis-calculated everything and the forecast turned out to be worse than useless (lost the company $30k over 4 hours).
Luckily, my company was fine with all this and we learned a ton from it (other people made similar mistakes, too), so it was useful in some way. I am also way more careful and deliberate about anything now, no more "iterative-keyboard-banging" (my original programming style) in Python while connecting directly to the production database!
In my experience workplaces that actually value mistakes are rare. That is, not only tolerate them, but value them as learning opportunities, as I believe they should be, and as the linked article proposes. Such workplaces do exist (I've heard), but not in my own anecdotal experience. Mistakes have always tended to be treated as something shameful, something to be hidden unless disclosure is absolutely unavoidable, and something to accuse people of, unless the offender has been a management favorite or something similar.
It's come to the point where I've acquired a nagging suspicion that this is how it needs to be. That 'to be kind' will always be icing on the cake so to speak, no more. Maybe I've grown too cynical.
In my experience, believing in people is far more likely to get them to "live up" to your belief than is the alternative. This is a great story about using a mistake as a lesson, but also about building a strong and cohesive work culture.
At the end of the day, know that your coworkers don't have bad intentions really helps.
I have great manager who gives me room to learn and push myself to work harder.
My work involves doing a lot of work for other people's projects. The nature of the work often means a simple mistake will result in days or weeks of work being lost.
This article reflects the difference in how some people have treated me when I have told them I have made a mistake. When I make a mistake now, those who treated me nicely I will tell without hesitation. But when they have been less than nice to me about past failures, I will consider not telling them I made a mistake.
Good story. I also screwed up something at work in my early 20s. Would expect to have gotten fired, unexpectedly happened a similar reaction to this story.
Was a wakening call. Taught me to be far more careful when deploying and testing s/w.
Thought I read this before and sure enough here's the previous discussion about this post [1]. The way I look at it is if you're not nice to someone who screwed up it's probably because you've never royally screwed anything up yourself, but I'm sure your time is coming. We are all humbled at some point during our careers. Mine was 10 years into it and I'll never forget it.
There are two kinds of lessons you can learn when you mess up:
1. Be more careful next time so that you're less likely to make the same mistake again.
2. Fix the system so that nobody can make that kind of mistake again.
Learning lesson #1 will means that you're less likely to make the same, but learning lesson #2 will prevent you and your team from making that same mistake.
For something that's as easy to test as "is the site working", the real lesson there is that you set up your deployment system so that the website needs to respond to a health check before the deploy finishes.
(I realize this is nitpicking and isn't the point of the article, just thought I'd mention it :P)
I talked about the need for proper QA. About thoroughly testing my changes.
Though somewhat beside the point, if you have a dedicated test team, know that I don't trust my test infrastructure to catch all your screw-ups anymore than you have confidence that there won't be any. If it passes the tests, I'm pretty confident we didn't break anything, but it can wait until Monday, right?
On to the point, everyone has to do this once. We've all got a story (I have an anthology). If you're confident that a lesson has been learned, no reason to belabor the point.
"Hello babies. Welcome to Earth. It's hot in the summer and cold in the winter. It's round and wet and crowded. On the outside, babies, you've got a hundred years here. There's only one rule that I know of, babies-"God damn it, you've got to be kind."
(Thomas J. Watson, IBM CEO)