Hacker News new | past | comments | ask | show | jobs | submit login

It would guess that the root cause for most outages that have a human factor is disorganization and sloppiness because if that wasn’t the case there wouldn’t be an outage.

It’s interesting to me that GitLab are so public and honest. I don’t think that appeals to everyone, but it is a unique selling point to some.




We used to (half) joke that in our “5 whys” process, #4 was often “because we were lazy [or in a hurry]”.


Being public and honest is always cited when this happens to Gitlab. Which I can say because my fragile memory recalls a number of incidents. This should be alarming but apparently their psy ops is better than their dev ops because we all react with fondness and awe. Maybe I should do more of that at work!


I think that is because HN has a lot of people who knows first hand that very few places are free of these kind of issues.

In 25+ years of working in tech, I can honestly say I've never worked anywhere where there haven't been one or more serious issues where one or more parts of the cause was something everyone knew was a bad idea, but that slipped because of time constraints, or a mistaken belief it'd get fixed before it'd come back and bite people.

That's ranged from 5 people startups to 10,000 people companies.

Most of the time customers and people in the company outside of the immediate team only gets a very sanitized version of what happened, so it's easy to assume it doesn't happen very often.

Gitlab doesn't seem like the best ever at operating these services, but they also doesn't look any worse than average to me; which is in itself an achievement, as most of the best companies in this respect tends to be companies with more resources and that have had a lot more time to run into and fix more issues. For a company their age, they seem to be doing fairly well to me.


So they went off and implemented a brand new fancy service discovery tool for I bet a problem they didn’t have, but couldn’t do the basics of tracking 2kb of data for the CA. I don’t think that’s a age issues, that and there’s nothing that prevents companies of any size from self reflection on what they’re doing and what’s important.

Also what’s the point of transparency if you’re not getting critical feedback from it and learning?


I mean, I much prefer them telling us about all their stupid mistakes to keeping all of the stupid mistakes hidden.

I know every company makes stupid mistakes, but all of the ones Gitlab made are public, and there’s comparatively few.


That last phrase is what I disagree with. Every company makes stupid mistakes, but Gitlab seems to make a lot - more than average, compared to companies I've seen the insides of (of course a small sample).


For me, as soon as the company becomes bigger, the number of mistakes becomes sheer endless.


Yeah. “We rm -rf on production server, and our backups are useless, but we’re public and honest!” Sorry, not impressed.


This happens everywhere. You just don’t know about it precisely because companies are normally not public and honest about it.


It really doesn't happen everywhere.

Most places with decent devopss hygiene have defense-in-depth around their backups.

I've heard of people dropping production databases in big companies (but saved by backups).

There are some stories around the bitlocker blackmail thing that had similar impact, but that was with a malicious opponent.

The only thing similar I've heard for the notorious self modifying MIT program (for geo-political coding) in the 1990s which destroyed itself without backups.


Gitlab was "saved by backups" as well. They lost some data since the latest backup, which is rather common.


Most places don't have decent "devops hygiene".


> You just don’t know about it precisely because companies are normally not public and honest about it.

If a big company lost a ton of user data, I'd absolutely know about it, whether they have Apple-level secrecy or not.


The incident described did not result in loss of tons of user data, and neither will most incidents, whether you choose to be open about them or not.


What are you talking about?

> This incident caused the GitLab.com service to be unavailable for many hours. We also lost some production data that we were eventually unable to recover. Specifically, we lost modifications to database data such as projects, comments, user accounts, issues and snippets, that took place between 17:20 and 00:00 UTC on January 31. Our best estimate is that it affected roughly 5,000 projects, 5,000 comments and 700 new user accounts.

https://about.gitlab.com/blog/2017/02/10/postmortem-of-datab...

Yes, most incidents from most companies don’t result in this kind of data loss, which is why GitLab stood out.


How do you know what most incidents result in? For example, when Github deleted their production database[1], they simply gave no numbers of affected users/repositories. We do know that the platform already had over 1M repositories[2], so 5000 affected seems perfectly possible, but their lack of transparency protected them against such claims. And that lack of transparency seems to me to be the norm.

[1] https://github.blog/2010-11-15-today-s-outage/

[2] https://github.blog/2010-07-25-one-million-repositories/


MySpace lost all its music from 2003 to 2015: https://news.ycombinator.com/item?id=19417640

Probably a few hundred TB or so. Maybe nearly a petabyte?


That’s the point: we know about that. Hard to believe “this happens everywhere” when we only know a few instances, and any instance would be picked up by media.


I've had to help clean up after any number of data losses or near losses that has never been made public; ranging from someone mkfs'ing the wrong device on a production server, to truncating the wrong table. In some cases afterwards having people writing awful scripts to munge log files (that were never intended for that purpose) to reconstruct data that were too recent for the last backup.

Of course there are people that avoid this, but I've seen very few places where their processes are sufficient to fully protect against it - a lot of people get by more on luck that proper planning. Often these incidents are down to cold hard risk calculations and people know they're taking risks with customer data and have deemed them acceptable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: