Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Firing Myself (backintyme.substack.com)
108 points by banzin 71 days ago | hide | past | favorite | 67 comments



I once made a huge fuckup.

A couple years into my career, I was trying to get my AWS keys configured right locally. I hardcoded them into my .zshrc file. A few days later on a Sunday, forgetting that I'd done that, I committed and pushed that file to my public dotfiles repo, at which point those keys were instantly and automatically compromised.

After the dust settled, the CTO pulled me into the office and said:

1. So that I know you know: explain to me what you did, why it shouldn't have happened, and how you'll avoid it in the future.

2. This is not your fault - it's ours. These keys were way overpermissioned and our safeguards were inadequate - we'll fix that.

3. As long as it doesn't happen again, we're cool.

Looking back, 10 years later, I think that was exactly the right way to handle it. Address what the individual did, but realize that it's a process issue. If your process only works when 100% of people act perfectly 100% of the time, your process does not work and needs fixing.


Besides the obvious takeaway of the story, to anyone who reads this: use pre-commit hooks to avoid this kind of problems (or something equivalent).

With the pre-commit framework, an example hook would be https://github.com/Yelp/detect-secrets


Yep, been adjacent enough to a couple large ones through my career to see the details and been up-close to a few that this is the right way to approach it.

Did the person know they screwed up? Did they show remorse and a willingness to dive in and sort it out? They likely feel like absolute shit about the whole thing and you don't need to come down on them like a ton of bricks. If that much damage could be done with a single person then you have a gap in your process/culture/etc and that should be addressed from the top.

One of the best takes I've seen on this was from a previous manager who when confronted with a similar situation as the article(it was a full DB drop). The person tried to hand in their resignation on the spot, they instead(and I'm paraphrasing here) said: "You're the most qualified person to handle this risk in the future as we've just spent $(insert revenue hit here) training you. Moving forward we want you to own backup/restore and making sure those things work".

That person ended up being one of their best engineers and they had fantastic resiliency moving forward. It turns out if you give someone a bit of grace and trust when they realize they screwed up you'll end up with a stronger organization and culture because of it.


> they instead(and I'm paraphrasing here) said: "You're the most qualified person to handle this risk in the future as we've just spent $(insert revenue hit here) training you.

This is an old quote that has been originally attributed to different people throughout the years. It shows up in a lot of different management books and, more recently, LinkedIn influencer posts.

It’s good for lightening the situation and adding some levity, but after hearing it repeated 100 different times from different books, podcasts, and LinkedIn quotes it has really worn on me as somewhat dishonest. It feels clever the first time you hear it, but really the cost of the mistake is a separate issue from the decision to fire someone for it.

In real world situations, the decision to let someone go involved a deeper dive into assessing whether the incident was really a one-off mistake, or the culmination of a pattern of careless behavior, failure to learn, or refusal to adopt good practices.

I’ve seen situations where the actual dollar amount of the damage was negligible, but the circumstances that caused the accident were so egregiously bad and avoidable that we couldn’t justify allowing the person to continue operating in the role. I wish it was as simple as training people up or having them learn from their mistakes, but some people are so relentlessly careless that it’s better for everyone to just cut losses.

However when the investigation shows that the incident really was a one-time mistake from someone with an otherwise strong history of learning and growing, cutting that person for a single accident is a mistake.

The important thing to acknowledge is point #3 from the post above: Once you’ve made an expensive mistake, that’s usually your last freebie. The next expensive mistake isn’t very likely to be joked away as another “expensive training”


> This is an old quote that has been originally attributed to different people throughout the years. It shows up in a lot of different management books and, more recently, LinkedIn influencer posts.

And a similar story has been recounted by pilot Bob Hoover, where a member of the ground crew fuelled Hoover’s airplane incorrectly. Instead of a dollar amount, the cost is that Hoover and the two passengers could have died.

https://sierrahotel.net/blogs/news/a-life-lesson#:~:text=%E2...


I'm fairly certain it occured since the story was first-hand and about 12+ years ago(although they may have lifted it from similar sources). It's not a bad way to diffuse things if it's clear there was an honest mistake

Your point on willingness to learn is bang on. If there's no remorse or intentionally negligent then yes that's a different story.


Oh I’m sure it occurred. The CEO was just repeating it from the countless number of management books where the quote appears.

My point was that it’s a story that gets overlaid on top of the real decision making process


To quote a statistician friend: 100% of humans make mistakes.

OP's leadership was shit. The org let a junior dev delete shit in prod and then didn't own up to _their_ mistake? Did they later go on to work at a genetics company and blame users for being the subject of password sprays?


Here's one of may favorite anecdotes / fables on the topic.

A young trader joined a financial company. He tried hard to shoe how good and useful he was, and he indeed was, at the rookie level.

One day he made a mistake, directly and undeniably attributable to him, and lost $200k due to that mistake.

Crushed and depressed, he came to his boss and said:

— Sir! I failed so badly. I think I'm not fit for this job. I want to leave the company.

But the boss went furious:

— How dare you, yes, how dare you ask me to let you go right after we've invested $200k in your professional training?!


So much this.

There is a great book which I think should be on a table of every single person (especially leadership) working in any place which involves humans interacting with machines:

https://www.amazon.com/Field-Guide-Understanding-Human-Error...


Is that a referral link on HN? If so, please remove the referral part.

The title is The Field Guide to Understanding ‘Human Error’.


No it is not a referral link. I have searched for the book and then removed anything that looked like it would keep state/extra info. In retrospect, indeed posting a title would have been a simpler option :) thanks!


Never worked with AWS, but besides that it obviously shouldn't happen - is it really that bad? Couldn't the keys be invalidated/regenerated immediately after you realized they were compromised?


Oh, they can and were. But bad actors scrape github constantly for access keys. If you commit yours to a repo, some script somewhere will find those keys and use them to spin up EC2 boxes mining bitcoin or use SES to send scam emails within minutes. You can invalidate the keys and scrub your AWS account once you notice the issue - it just depends on how much damage the bad actors are able to do before you do that.

In my case, our CTO was messaging me (either Slack or Hipchat - whatever we were using at the time) within an our or two. Iirc they only managed to accrue a few thousand dollars in charges before we got it under control.


I want a boss like that. Say no more.


> One of the peculiarities of my development environment was that I ran all my code against the production database.

Hahaha. I still see this being done today every now and then.

> The CEO leaned across the table, got in my face, and said, "this, is a monumental fuck up. You're gonna cost us millions in revenue". His co-founder (remotely present via Skype) chimed in "you're lucky to still be here".

this type of leadership needs to be put on blast. 2010 or 2024, doesn’t matter.

If it’s going to cost “millions in revenue”, then maybe it would have been prudent to invest time in proper data access controls, proper backups, and rollback procedures.

Absolutely incompetent leadership should never be hired ever again. There should be a public blacklist so I don’t make the mistake of ever working with such idiocy.

The only people ever “fired” should be leadership. Unless the intent is on purpose in which you should be subject to jail time


Yeah, this is the lesson learned, why are they working in production?? This is rule one, even in small startups. This is the real mistake and the head of development (or who ever allowed this to happen) should have been fired.


They let you stay after costing them millions in revenue then? Doesn't sound like the worst leadership to me?


It sounds like they know they fucked up, but blamed their employee to avoid taking responsibility. Whether or not they chose to terminate the scapegoat is irrelevant to determining that they're bad leaders.


Oh please. It's shared blame, not no blame.

The provess needs to improve but the query didn't run itself, leaving the person in their position but giving them what sounds like a light chewing out isn't a sign of bad leadership.


I think the correct thing to do no matter the scale of the fuck up is you, as the person doing the confronting, admit to what you did wrong that led to this outcome, as well as addressing what they did.

They didn't do that, and they threw around "you" a lot at an "us" problem. That's bad leadership.

Leading by example means readily admitting to your own faults and acknowledging that the team has a shared responsibility for outcomes.


They “let you stay” after saying to your face you’ve done “a monumental fuck up” and are “lucky to still be [there]”. That is shit leadership. It reveals they will hold a grudge about that event which the employee won’t ever live down and is likely to impair their growth in the company. After that event one should start looking for a new job.


Blaming the employee means you can emotionally extort them for the rest of their tenure and use them to balance out the raise pool.


This sounds like a company that does not learn from errors, looks for "junior engineer" scapegoats instead of looking for the systemic processes that facilitated this, and not a great place to stay tbh. This was a chance for the company to reflect on some of their processes, and take measures that will avoid similar issues (and the steps to take are pretty obvious). And the description of what happened afterwords show a probably toxic environment.

It should never be like this, and especially in this case I blame OP 0%. This is something that could happen to anybody in such circumstances. I have not deleted a full database, but have had to restore stuff a few times, I have made mistakes myself and have rushed to fix problems caused by others' mistakes and each single time the whole point and discussions was about improving our processes so that this does not happen again.


Agreed, and who allows the devs to work in production, even at small companies? People are fallible, so you're just waiting for a something to blow up.


> I found myself on the phone to Rackspace, leaning on a desk for support, listening to their engineer patiently explain that backups for this MySQL instance had been cancelled over 2 months ago. Ah.

This is the issue not what the author did. It would be a matter of time that the database would have been accidentally deleted somehow.


Forget accidental deletion - this same exact incident would've occurred as soon as the database's hard drive failed, or some similar infrastructure disaster occurred.

Lack of backups is inexcusable, and ten times more egregious than the author's mistake.


So a company gives junior engineers full access to a production database without backup so they can work on it developing features that require DDL SQL commands. I've seen it happen before, what I've never seen is someone blame the junior employee when things undoubtedly go south.

I'm not sure I even believe that part of the story. This was either a very disfunctional company or a looooong time ago.


> I'm not sure I even believe that part of the story. This was either a very disfunctional company

The first sentence of the article tells us it was "a Social Gaming startup" and with that as well everything we needed to know.


I haven't personally seen this particular case either but I have no doubt it could happen. I've seen orgs where a blameless type culture isn't natural, and I've had to explain to the leadership that publicly humiliating (in jest) someone for getting caught by the phishing tests or posting private data to a pastebin type service is a bad idea.

And I've interacted with plenty of people who externalize everything that goes wrong to them, naturally some of these folks will be in leadership positions.


There's a lot of responsibility there resting on your superiors because they weren't following "best practises". Sure you fucked up, but if they had backups, it wouldn't have been such a disaster, and if you had a Dev environment to test against, it would have been a non-issue entirely. Straight out of Uni, you shouldn't have been expected to know that, but I bet you grew as a consequence.


Yep, whether the leadership recognizes it or not this is an organizational failure. No access controls for destroying prod data, no backups, no recovery plan, told to do testing in prod, whatever horrible process they have that required engineers regularly directly accessing the database.


Have been in situations just like this, on pretty much every side (the fuck-upper, the person who has to fix the fuck up, and the person who has to come up with a fuck-up remediation plan)

The most egregious case involved an incompetent configuration that resulted in hundreds of millions $ in lost data and a 6-month long automated recovery project. Fortunately, there were traces of the data across the entire stack - from page caches in a random employee's browser, to automated reports and OCR dumps. By the end of the project, all data was recovered. No one from outside ever found out or even realized anything had happened - we had redundancy upon redundancy across several parts of the business, and the entire company basically shifted the way we did ops to work around the issue for the time being. Every department had a scorecard tracking how many of their files were recovered, and we had little celebrations when we hit recovery milestones. To this day only a few people know who was responsible (wasn't me! lol)

Blame and derision are always inevitable in situations like this. It's how it's handled afterwards that really marks the competence of the company.


I can relate to this with my own story, where I managed to delete an entire database — my first day on the job, no less.

I was hired by a little photo development company, doing both walk in jobs and electronic B2B orders. I was brought in to pick up on the maintenance and development of the B2B order placement web service the previous developer had written.

Sadly, the previous dev designed the DB schema and software under the assumption that there would only ever be one business customer. When that ceased to be the case, he decided to simply create another database and spin up another process.

So here I am on my first day, tasked with creating a new empty database to bring on another customer. I used the Microsoft SQL Server admin GUI to generate the DDL from one of the existing tables, created (and switched the connection to) a pristine, new DB, and ran the script.

Little did I know, in the middle of many thousands of lines of SQL, the script switched the connection back to the DB from which the DDL was generated, and then proceeds to drop every single table.

Oops.

Of course, the last dev disabled back ups a couple months before I joined. My one saving grace was that the dev had some strange fixation on logging every single thing that happened in a bunch of XML log files; I managed to quickly write some code to rebuild the state of the DB from those log files.

I was (and am) grateful to my boss for trusting my ability to resolve the problem I had created, and placing as much value as he did in my ownership of the problem.

That was about 16 years ago. One of the best working experiences in my career, and a time of rapid technical growth for myself. I would have missed out on a lot if that had been handled differently.


> Sadly, the previous dev designed the DB schema and software under the assumption that there would only ever be one business customer.

What kind of an assumption is that?!


It's not uncommon for B2B software to isolate each customer in their own database.


Yes, but that is not the same as assuming there would only be one customer. I think your interpretation makes more sense.


That's a mistake I've made before, except we did have nightly backups.

I was not the one who set up the backups, nor did I perform the restore. I just told my senior I made a big mistake, and he said thanks for saying so right away and we're going to handle it. Our client company told their people to take the rest of the day off. It must have been costly. I learned a lesson, but I also internalized some guilt about it.

Reflecting on this, I was in an environment where it was the norm to edit production live, imagining we could be careful enough. I'm suggesting that the error the author publicly took the fall for was not all their fault. Everyone up the chain was responsible for creating that risky situation. How could they not have backups?


> I found myself on the phone to Rackspace, leaning on a desk for support, listening to their engineer patiently explain that backups for this MySQL instance had been cancelled over 2 months ago. Ah.

There is no part of this story that’s the protagonist’s fault. What a mess.


Agreed. Negligence bordering on criminal all the way up the management chain. The fact that they blamed the author is telling about the culture as well.


Yeah. And even if this cancellation was to save money, it’s not even an excuse.

What is hard and expansive are easy to restore backups.

Simple backups that may be a pain to restore but that are going to save your company from bankruptcy are only costing you some cheap storage and someone that more or less routinely checks that your data is safe somewhere.

I mean, of course I’m not advocating for flaky backups but even when you are cheap as fuck, you can afford a cron script running rsync and a couple hard disk drives.

Going fully yolo on this topic should not be even remotely acceptable in any business.


Yeah, cannot help but agree. It should have been impossible for this to happen in the first place.


> It should have been impossible for this to happen in the first place.

Exactly, CEO should have fired himself for allowing that environment to exist.


No junior should have been able to cause this much damage on their own without a safety net of some kind.

It's on the company for cancelling their backups.


I wouldn’t blame you for resigning, it sounds like an awful environment.

But individuals will always make mistakes, systems and processes prevent individuals mistakes from doing damage. That’s what was lacking here, not your fault at all. I just hope lessons were learned.


Should have spun it as a novel game feature. Like burning the library at Alexandria.


If it's true that the company had no backups of their production database, then the engineer was nearly blameless. The CTO should have been fired. If there was no CTO, then the Board Of Directors should have asked the CEO why they felt they were competent to run a software company without a CTO. If the CEO was technical enough to take responsibility for technical decisions, then the CEO should have been fired or severely reprimanded.

If a software company is making money, but developers develop against the production database, and there are no backups, then its the leadership that is at fault. The leadership deserves severe criticism.


Clearly the fault of a terribly lead engineering organization. Mistakes are almost guaranteed to happen. This is why good engineering orgs have guardrails in place. There were no guardrails whatsoever here. Accounts used to manually access adhoc production databases should not have delete permissions for critical data. And worst off all no backups.


> I found myself on the phone to Rackspace, leaning on a desk for support, listening to their engineer patiently explain that backups for this MySQL instance had been cancelled over 2 months ago. Ah.

As usual, a company with legitimately moronic processes experiences the consequences of those moronic processes when a "junior" person breaks something. Whoever turned off those backups as well as whoever thought devs (especially "junior" devs) should be mutating prod tables by hand are ultimately accountable.


I can remember my huge fuckups on a few fingers of one hand.

I'm not implying there are few of them, more like they each have a dedicated finger, and I remember them and the cold sweat feel of each specific one.

The first one was decades ago. There was a server room. They had a hardcopy printer in there, and behind it was a big red mushroom button that shut down the whole room. It was about head level if you were looking behind the printer...


I witnessed a similar one. The big box had an emergency power-off switch in back that stuck out. Our operator was dealing with cabling behind the big box, and when he stood up, his belt caught the switch. Oops.


Why is the first paragraph in the article an affiliate link to some medical scam site?

Edit: Also, the user name 'drericlevi' seems to be used on pretty much all other social media outlets by a very online ENT doctor from Melbourne. I think somebody got the password to an unused Substack and is just posting blogspam.

Edit again: Also the story in the post is taken from the 2023 book 'Joy of Agility' by Joshua Kerivsky and modified (presumably by AI) to be told from the first person perspective:

https://books.google.com/books?id=0pZuEAAAQBAJ&pg=PA96&lpg=P...


I've really come to appreciate blameless post postmortems. People hide mistakes when you have a culture that punishes.

My biggest foible: Our 400 person company had issued blackberries and was pulled into a project where all the work was in India. I covered EMEA, and phone rates were like a quarter a minute, or something like that. There was no concern about using the phone/data.

I ended up spending a significant amount of time over there and one thing I noticed is it looked like I had a different carrier every time I looked at the device. Did not think much of it in the couple months I was there. When I got back - one of our Ops guys called me and we discovered it was closer to $9/min, with absurd data charges. Sat down with the CFO, as the $12k charge was not projected. I had no idea if I was about to be canned or not.

Instead, leadership and sales got brand new blackberries, with an unlocked SIM card. Celebrations by all the brass... Was very happy to have a job.


After 20+ years in tech, the only time I have ever seen a developer fired for making a mistake was when he leaked credentials because he used one of our repos as the base for an interview project at another company (on a day when he was allegedly working from home). We had a hardcoded key in our package.json to allow it to access a private repo, and he pushed that to a public repo which had an incriminating name like company-x-interview-project-hisname.

We immediately got an email from Github because they scan for such things. We rotated our keys within a few minutes, and he was gone the next day.

Obviously we should have been more careful with our keys, so all would have been forgiven if not for the fact that he leaked keys while looking for another job on company time.


So the question is… did he get the job with Company X?


Any privileged access or secrets given to engineers in plaintext on their daily driver workstations -will- be compromised or misused. Full stop. There is never a good reason to do this apart from management inexperienced in security and infrastructure management.

Every single usage of secrets of privileged action of consequence must go through code review followed by multiple signatures where an automated system does the dangerous things on your behalf with a paper trail.

Never follow any instructions to place privileged secrets or access in plaintext on an development OS or you might well be the one a CEO chooses as a sacrificial lamb.


I can't imagine putting someone who's new to this work in that kind of precarious position. If I let someone make a mistake that severe, I'd apologize to them and work with them through the solution and safeguards to prevent it from happening again.

A little bit of room for error is essential for learning, but this is insane. I'm so glad the only person who has ever put me in that kind of position is me, haha. This career would have seemed so much scarier if the people I worked with early on were willing to trust me with such terrifying error margins.


> The CEO leaned across the table, got in my face, and said, "this, is a monumental fuck up. You're gonna cost us millions in revenue". His co-founder (remotely present via Skype) chimed in "you're lucky to still be here".

Should expose the CEO’s name. Between this and forcing you to work 3 days straight, that was the least professional way to handle this situation.


I recently dealt with something like this. Someone ran a `delete *` statement which truncated a bunch of tables in production with critical business data. They had autocommit on in their database client. Luckily, there was a backup available which restored the data. After the analysis, it was decided to provision less privileged roles and explicitly turn off autocommit in all database clients. I am also introducing a PR workflow with static analysis that detects these issues. Nobody was fired and no names were ever mentioned in the announcements.


It makes you a better developer. I backup obsessively BECAUSE I fucked up almost this badly and more than once. Hire yourself back and charge a bit more for the extra wisdom.


> backups for this MySQL instance had been cancelled over 2 months ago.

Uhh, there's the problem, not that someone accidentally deleted something lol


A typical commercial change control process exists to factor in human error. There's a cost to setting it up. This company never set it up, but ended up paying another way.


Its really the fault of the process decided by upper management, a junior dev shouldnt have that much access. Least access privilege wasnt done correctly here


It is ALWAYS the fault of management when the databases are lost.

Engineers must never feel guilty if the company was run in such a way as to make that possible.


Less "Firing yourself" and more like liberating yourself from a toxic unprofessional clown show.


the fact that the leadership vilified a JUNIOR developer on the erasure of a PRODUCTION database speaks volumes as to how toxic of a workplace this was. glad to hear you promptly walked out. screw places like this. disgusting.


She got gaslit. As a junior. Pretty sad.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: