It would guess that the root cause for most outages that have a human factor is ...

sokoloff · on Nov 17, 2019

We used to (half) joke that in our “5 whys” process, #4 was often “because we were lazy [or in a hurry]”.

drewcoo · on Nov 17, 2019

Being public and honest is always cited when this happens to Gitlab. Which I can say because my fragile memory recalls a number of incidents. This should be alarming but apparently their psy ops is better than their dev ops because we all react with fondness and awe. Maybe I should do more of that at work!

vidarh · on Nov 17, 2019

I think that is because HN has a lot of people who knows first hand that very few places are free of these kind of issues.

In 25+ years of working in tech, I can honestly say I've never worked anywhere where there haven't been one or more serious issues where one or more parts of the cause was something everyone knew was a bad idea, but that slipped because of time constraints, or a mistaken belief it'd get fixed before it'd come back and bite people.

That's ranged from 5 people startups to 10,000 people companies.

Most of the time customers and people in the company outside of the immediate team only gets a very sanitized version of what happened, so it's easy to assume it doesn't happen very often.

Gitlab doesn't seem like the best ever at operating these services, but they also doesn't look any worse than average to me; which is in itself an achievement, as most of the best companies in this respect tends to be companies with more resources and that have had a lot more time to run into and fix more issues. For a company their age, they seem to be doing fairly well to me.

org3432 · on Nov 17, 2019

So they went off and implemented a brand new fancy service discovery tool for I bet a problem they didn’t have, but couldn’t do the basics of tracking 2kb of data for the CA. I don’t think that’s a age issues, that and there’s nothing that prevents companies of any size from self reflection on what they’re doing and what’s important.

Also what’s the point of transparency if you’re not getting critical feedback from it and learning?

Aeolun · on Nov 17, 2019

I mean, I much prefer them telling us about all their stupid mistakes to keeping all of the stupid mistakes hidden.

I know every company makes stupid mistakes, but all of the ones Gitlab made are public, and there’s comparatively few.

lmm · on Nov 17, 2019

That last phrase is what I disagree with. Every company makes stupid mistakes, but Gitlab seems to make a lot - more than average, compared to companies I've seen the insides of (of course a small sample).

Aeolun · on Nov 17, 2019

For me, as soon as the company becomes bigger, the number of mistakes becomes sheer endless.

oefrha · on Nov 17, 2019

Yeah. “We rm -rf on production server, and our backups are useless, but we’re public and honest!” Sorry, not impressed.

andyroid · on Nov 17, 2019

This happens everywhere. You just don’t know about it precisely because companies are normally not public and honest about it.

nl · on Nov 17, 2019

It really doesn't happen everywhere.

Most places with decent devopss hygiene have defense-in-depth around their backups.

I've heard of people dropping production databases in big companies (but saved by backups).

There are some stories around the bitlocker blackmail thing that had similar impact, but that was with a malicious opponent.

The only thing similar I've heard for the notorious self modifying MIT program (for geo-political coding) in the 1990s which destroyed itself without backups.

icebraining · on Nov 17, 2019

Gitlab was "saved by backups" as well. They lost some data since the latest backup, which is rather common.

vidarh · on Nov 18, 2019

Most places don't have decent "devops hygiene".

oefrha · on Nov 17, 2019

> You just don’t know about it precisely because companies are normally not public and honest about it.

If a big company lost a ton of user data, I'd absolutely know about it, whether they have Apple-level secrecy or not.

andyroid · on Nov 17, 2019

The incident described did not result in loss of tons of user data, and neither will most incidents, whether you choose to be open about them or not.

oefrha · on Nov 17, 2019

What are you talking about?

> This incident caused the GitLab.com service to be unavailable for many hours. We also lost some production data that we were eventually unable to recover. Specifically, we lost modifications to database data such as projects, comments, user accounts, issues and snippets, that took place between 17:20 and 00:00 UTC on January 31. Our best estimate is that it affected roughly 5,000 projects, 5,000 comments and 700 new user accounts.

https://about.gitlab.com/blog/2017/02/10/postmortem-of-datab...

Yes, most incidents from most companies don’t result in this kind of data loss, which is why GitLab stood out.

icebraining · on Nov 17, 2019

How do you know what most incidents result in? For example, when Github deleted their production database[1], they simply gave no numbers of affected users/repositories. We do know that the platform already had over 1M repositories[2], so 5000 affected seems perfectly possible, but their lack of transparency protected them against such claims. And that lack of transparency seems to me to be the norm.

[1] https://github.blog/2010-11-15-today-s-outage/

[2] https://github.blog/2010-07-25-one-million-repositories/

exikyut · on Nov 17, 2019

MySpace lost all its music from 2003 to 2015: https://news.ycombinator.com/item?id=19417640

Probably a few hundred TB or so. Maybe nearly a petabyte?

oefrha · on Nov 17, 2019

That’s the point: we know about that. Hard to believe “this happens everywhere” when we only know a few instances, and any instance would be picked up by media.

vidarh · on Nov 17, 2019

I've had to help clean up after any number of data losses or near losses that has never been made public; ranging from someone mkfs'ing the wrong device on a production server, to truncating the wrong table. In some cases afterwards having people writing awful scripts to munge log files (that were never intended for that purpose) to reconstruct data that were too recent for the last backup.

Of course there are people that avoid this, but I've seen very few places where their processes are sufficient to fully protect against it - a lot of people get by more on luck that proper planning. Often these incidents are down to cold hard risk calculations and people know they're taking risks with customer data and have deemed them acceptable.