Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not really germane to the topic. This type of op fuck-up happens everywhere. It's hard to build solid process, particularly in growth phases.

Unless there are 2x a year restore tests, I personally assume a 60% backup fail rate.



The only reason they are still in existence is due to a chance backup they took for a tangential reason. From the sounds of it, their solution is held together with bubble gum, some tape and lots of hand waving. Being in 160 different locations probably doesn't help much either.


I'm pretty sure most solutions on the internet consist of bubble gum, some tape and lots of handwaving. Gitlab's screwup is hardly unique, even if it was very public.

It's difficult and expensive to build and maintain a solid system, and even if you want to, time and financial pressures often just don't let you, on top of the issue of just communicating the need for solid engineering, as it usually only becomes apparent when the problems start occurring.


Unless I'm misreading things, the reason they only lost 6 hours of data instead of 24 hours was a chance backup, but there was never an existential crisis here.

Downtime happens to pretty much every service out there. In this case the company was incredibly forthright and so we can make fun of their stupid mistakes, but really most mistakes are stupid when you look at them -- when you make thousands of decisions a day, some of them will seem silly in hindsight. We just never learn about most of them.


I'm not defending them -- but that is the norm.

I had a customer once in the 90s take a 30 hour outage that cost them nearly $6M in fines because some asshole put a budget freeze on anything related to cleaning, including tape drive cleaner carts. The dopey ops guy kept using one tape on multiple drives, making them do nothing.

I could personally rattle off a dozen stories like this at late stage startups, Fortune 10 and .gov.

The only reason many businesses are alive is luck and reliable SAN.


"It's hard to build solid process, particularly in growth phases." Hard is an understatement, it is insanely difficult.

However, the interview seems to suggest that "write everything down" is the way to solve this problem and was what allowed for all their growth and success. So the solid process they have built is to write everything down, which they obviously didn't do or they wouldn't have had a 7 layer disaster recovery failure.


"Write everything down" is a terrible process, for two main reasons.

First, no one has the discipline to write much of anything down.

It's a very boring process and it's always going to be low-resolution. Your most meticulous documentation writers will get fired for failing to get their "real work" done. If you personally recognize the value of the documentation they furnish and thus refuse to fire them, all of their peers will feel that they are dead weight, which is still a ticket out of the company.

Second, after all that work, no has the discipline to read much of anything that you've written!

They skim. They glance. They Ctrl+F. They don't read. When you're in a pickle and you can pick out a life-saving bit of documentation, it's amazing, but that happens quite rarely and requires a lot of energy.

How many times have you pulled up docs, tried to follow them, gotten a really confusing error that you spent hours trying to troubleshoot only to find out that there's a one-sentence explanation tucked away in the third sentence of the fourth paragraph on the page you originally pulled up? This just happened to me _last week_, and frequently the most frustrating problems are small things like that.

People don't read. It's nothing personal, they just don't read. It takes a lot of cognitive energy. People are biologically programmed to conserve as much as energy as possible. Good programmers are both lazy and dumb!

If you want documentation that means something, it needs to be part of the process of actually working. I don't mean you need to add "write docs" to your checklist, I mean meeting the operational standards should be the only way things can get done in the first place.

The operating procedure needs to be married to the actual completion of the task, and that means setting good baseline project standards and setting reliable enforcement on those standards.

Code should be self-documenting to a reasonable extent. Tests should be mandatory. Peer reviews and signoffs should be mandatory. Internal company discussions should be recorded and referenceable. Documentation only works when it's self-generating.

In short, it should be run like a mature open-source project with an open IRC channel, mailing list, bug tracker, commit history, mandatory tests, maintainer signoffs, merge processes, code and style standards, docstrings and good automated documentation generators, and so on.


> First, no one has the discipline to write much of anything down.

This is why it's baked into the company culture. It's very common for someone to ask where something is in the handbook or if an issue has been created for something.

> It's a very boring process and it's always going to be low-resolution. Your most meticulous documentation writers will get fired for failing to get their "real work" done. If you personally recognize the value of the documentation they furnish and thus refuse to fire them, all of their peers will feel that they are dead weight, which is still a ticket out of the company.

Fwiw, everyone is responsible for maintaining the handbook/our process and procedure documentation. The docs team isn't on the hook for it, nor is it the sole responsibility of engineering.

> Second, after all that work, no has the discipline to read much of anything that you've written!

After enough reminders, you'd be amazed at how quickly people learn to RTFM at work.

> They skim. They glance. They Ctrl+F. They don't read. When you're in a pickle and you can pick out a life-saving bit of documentation, it's amazing, but that happens quite rarely and requires a lot of energy.

It's true that people skim the handbook (the guide with all of the "Here's how you get access to Twitter accounts" stuff), but runbooks are actually looked at when stuff hits the fan. I think it's important to differentiate these two things since they have different purposes. Imo runbooks should be as lean as possible for that very reason.

> Good programmers are both lazy and dumb!

This level of documentation is helpful for non-developers who make up a significant part of many organizations. We're not just talking about documenting code here.

> If you want documentation that means something, it needs to be part of the process of actually working.

100%, this is the only way it works.


> Fwiw, everyone is responsible for maintaining the handbook/our process and procedure documentation. The docs team isn't on the hook for it, nor is it the sole responsibility of engineering.

I find that in practice, with non-mission-critical activities, especially documentation, when everyone is responsible for getting it done (And making it useful), no-one is.


These are human problems. Doctors used to bitch about shit like this, because they felt it was beneath them to follow checklists. As a result people have had limbs removed and life threatening, unnecessary surgery. Guess what? The insurance companies demand checklists and controls to prevent fuckups.

There is an easy solution to dealing with people who refuse to read things. (Hint: It doesn't involve recording hours of meetings.) You need standard operating procedures to document what people do at an appropriate level of detail. Period.

Operational procedure is different than code documentation.


Yeah, I'm not disagreeing with checklists and SOPs. I'll edit my post to make sure that's clear as it's absolutely not what I'm trying to say at all. Quite the contrary: I'm trying to say that to the fullest extent possible, those SOPs should be integrated into the process so that skipping them isn't an option. Well-run open-source projects have given us good examples of how this works.

The goal is to marry the SOP and the actual completion of the task, to block off the shortcuts and require a minimum expenditure from external discipline/motivation reserves to get people to follow those SOPs.


Well I guess I must concede that we violently agree! :)


Right, and the biggest problem is that they never get updated so they don't match the reality of today's system.


"Write everything down" is the ante to function at all!

I agree... but I guess I read that coming from the perspective of someone who isn't a super fan of distributed teams. Many teams think that futzing around in slack or hangouts is enough. "Writing things down" is practically discovering the wheel to some places!


> This type of op fuck-up happens everywhere. It's hard to build solid process, particularly in growth phases

7 layers of backup processes, that they had written down, all failed and went unchecked, by any of their employees across the 160 locations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: