It would guess that the root cause for most outages that have a human factor is disorganization and sloppiness because if that wasn’t the case there wouldn’t be an outage.
It’s interesting to me that GitLab are so public and honest. I don’t think that appeals to everyone, but it is a unique selling point to some.
Being public and honest is always cited when this happens to Gitlab. Which I can say because my fragile memory recalls a number of incidents. This should be alarming but apparently their psy ops is better than their dev ops because we all react with fondness and awe. Maybe I should do more of that at work!
I think that is because HN has a lot of people who knows first hand that very few places are free of these kind of issues.
In 25+ years of working in tech, I can honestly say I've never worked anywhere where there haven't been one or more serious issues where one or more parts of the cause was something everyone knew was a bad idea, but that slipped because of time constraints, or a mistaken belief it'd get fixed before it'd come back and bite people.
That's ranged from 5 people startups to 10,000 people companies.
Most of the time customers and people in the company outside of the immediate team only gets a very sanitized version of what happened, so it's easy to assume it doesn't happen very often.
Gitlab doesn't seem like the best ever at operating these services, but they also doesn't look any worse than average to me; which is in itself an achievement, as most of the best companies in this respect tends to be companies with more resources and that have had a lot more time to run into and fix more issues. For a company their age, they seem to be doing fairly well to me.
So they went off and implemented a brand new fancy service discovery tool for I bet a problem they didn’t have, but couldn’t do the basics of tracking 2kb of data for the CA. I don’t think that’s a age issues, that and there’s nothing that prevents companies of any size from self reflection on what they’re doing and what’s important.
Also what’s the point of transparency if you’re not getting critical feedback from it and learning?
That last phrase is what I disagree with. Every company makes stupid mistakes, but Gitlab seems to make a lot - more than average, compared to companies I've seen the insides of (of course a small sample).
Most places with decent devopss hygiene have defense-in-depth around their backups.
I've heard of people dropping production databases in big companies (but saved by backups).
There are some stories around the bitlocker blackmail thing that had similar impact, but that was with a malicious opponent.
The only thing similar I've heard for the notorious self modifying MIT program (for geo-political coding) in the 1990s which destroyed itself without backups.
> This incident caused the GitLab.com service to be unavailable for many hours. We also lost some production data that we were eventually unable to recover. Specifically, we lost modifications to database data such as projects, comments, user accounts, issues and snippets, that took place between 17:20 and 00:00 UTC on January 31. Our best estimate is that it affected roughly 5,000 projects, 5,000 comments and 700 new user accounts.
How do you know what most incidents result in? For example, when Github deleted their production database[1], they simply gave no numbers of affected users/repositories. We do know that the platform already had over 1M repositories[2], so 5000 affected seems perfectly possible, but their lack of transparency protected them against such claims. And that lack of transparency seems to me to be the norm.
That’s the point: we know about that. Hard to believe “this happens everywhere” when we only know a few instances, and any instance would be picked up by media.
I've had to help clean up after any number of data losses or near losses that has never been made public; ranging from someone mkfs'ing the wrong device on a production server, to truncating the wrong table. In some cases afterwards having people writing awful scripts to munge log files (that were never intended for that purpose) to reconstruct data that were too recent for the last backup.
Of course there are people that avoid this, but I've seen very few places where their processes are sufficient to fully protect against it - a lot of people get by more on luck that proper planning. Often these incidents are down to cold hard risk calculations and people know they're taking risks with customer data and have deemed them acceptable.
It’s interesting to me that GitLab are so public and honest. I don’t think that appeals to everyone, but it is a unique selling point to some.