"This" being remote work? Also, I'm a little surprised by the anti-GitLab sentim...

problems · on Feb 9, 2017

> didn't do anything particularly more wrong than anyone else?

This is almost completely on their admin staff, maybe other people aren't willing to say it, but I will. Test your backups. Or at least make sure they're non-zero in size. It should really be Operations 101.

Whether you do this automatically or manually by setting a reminder on your calendar once a week or even month, doesn't matter. Something this simple would have solved their entire issue. I do this and we run a much smaller shop than GitLab. Heck, if we were larger I'd have hot spare database servers in another datacenter in case the primary got nuked by disk failure, network outages or mistakes.

fsiefken · on Feb 9, 2017

As part of our disaster recovery plan when I was working as a sysadmin at a 150p company we had replicated server (database replicated and webfiles rsync) on hot standby, we just switched the front-facing servers manually. GitLab has a very short blurb on a similar styled HA setup, I'm not sure if and how they have implemented such themselves and if it would have helped in preventing or shortening the recent downtime. They have probably documented their own setup somewhere.

"Automated failover can be achieved with pacemaker alongside STONITH network management. Keep in mind that application servers need to be prepared for transitioning to the new network addresses.

In this situation you can also opt to synchronize the database via a database specific protocol instead of DRBD. In the documentation for each database you can find out more about the options for MySQL and the options for PostgreSQL." https://about.gitlab.com/high-availability/#filesystem-stora...

sytse · on Feb 9, 2017

We had a secondary server. The secondary we were using was basically dead.

That's what lead to the problem. We were trying to fix it, but ran the wrong procedure on the primary instead of the secondary

Thus wiping the primary, instead of what might have been left on the secondary.

The secondary was already removed at that point.

Basically the procedure was the following:

1. Secondary falls behind too much, stops replicating. At this point you need to manually re-sync the whole thing

2. A re-sync requires an empty PostgreSQL data directory, so this data was removed

3. Re-sync doesn't work, leading to the other problems

4. At some point team-member-1 thought previous re-sync attempts left data behind, so team-member-1 wiped the data directory again to be sure; except team-member-1 ran this on db1.cluster.gitlab.com (the primary) instead of db2.cluster.gitlab.com (the secondary)

geofft · on Feb 9, 2017

Synchronizing the database isn't quite what you want. It's true that in the case of an errant rm -rf it would almost certainly have helped, but it's approximately as easy to run a "DELETE FROM importantdata" and leave off the "WHERE" clause, which would get replicated. And certainly if you're using DRBD (replicate the volume, not the database), an rm -rf will get replicated.

I'm just genuinely unsure what a better outcome would have been here. (It's certainly a process failure that no backups existed other than the manual 6-hour-old snapshot, but I'm not sure you can do much better than automating that.)

problems · on Feb 9, 2017

You can easily add snapshotting using on-disk technologies like LVM or ZFS to that and achieve reliability against such an issue though, as well as being able to do a full text backup (ie: to SQL) from your replicated server at higher performance than you would on production.

cookiecaper · on Feb 9, 2017

They had LVM snapshots every 24 hours. They lost 6 hours of data because someone had coincidentally triggered a snapshot 6 hours prior to the deletion event. Otherwise, they would've lost several more hours of data.

problems · on Feb 10, 2017

Correct me if I'm wrong, but wasn't that from staging data, not from an LVM snapshot?

detaro · on Feb 9, 2017

The "trigger" of the failure (accidentally deleting the wrong database) is bad luck that happens, but the lack of preparedness for it isn't. If I understood the document right even if the regular backups would have worked, they only happened every 24 hours (same as the staging replicas they recovered from, except one was manually created out of schedule that day).

Maybe being fully remote helps to let stuff like this "slip through the cracks". (Not great phrasing, but I don't think anyone made a conscious decision "loosing X hours of data is ok", and nobody questioned what goal the current practices could (not) achieve.) I don't know, but it seems at least a question one might ask.

geofft · on Feb 9, 2017

What's the best-practice approach for taking backups that are significantly more frequent than every 24 hours, but also robust to things like an rm -rf or a DELETE FROM table;? Something like continuous data protection seems like it would be far too much data for an active database server, no?

(Or are we just saying that they should have been taking backups every 15 minutes or hour or so?)

cookiecaper · on Feb 9, 2017

With modern database servers, people will take checkpoints at some regular interval and automatically stream append-only copies of all intermediate transaction logs to a storage server some place (the same transaction logs that the database servers use for replication).

In the event of accidental deletion, the most recent snapshot will be restored and the transactions up to a certain timestamp will be replayed on top of it. If done correctly, this can prevent all but a few minutes of data loss.

AWS Aurora (and possibly other database-as-a-service providers) has this functionality built-in.

detaro · on Feb 9, 2017

I'm not really qualified to confidently talk about best practices, so these points have question marks attached, since I can't judge what's possible in their setup and what's not. I guess I'm not sure if they did judge it, see last paragraph.

More frequent backups would of course help, if their impact is tolerable. They also need to be tested to actually exist and work.

At least LVM snapshots apparently are cheap enough that they can be done out-of-schedule just to get slightly newer data to staging, so they likely could have been done more often (but they probably weren't thought of as backups, which is why 24 hours seemed enough and nobody had a prepared plan to get back to production from them).

Similarly, Azure-side snapshots are mentioned as not enabled for the database hosts. Maybe that's not viable to do (they had performance issues with Azure, and I don't know how much overhead they case), maybe just something they forgot to set up.

Others have asked "why is there only one replica", which also seems like a good question (but my experience with database replication is close to non-existent, and maybe the higher load of more replicas would have caused other issues. Don't know.).

My point is more that I don't have the impression that the 24 hours are a figure that they arrived on by evaluating what "service level" they could achieve, but more a result of someone at some point setting up a backup with some interval. I don't think 24 hours is a figure where they can say "that's the best we currently can achieve", or even "that's what we planned for", but "that's what we have because that's what we have and nobody has paid closer attention". There is at the very least a cultural angle to it, which is why how they work could be relevant.

jon-wood · on Feb 10, 2017

One job I worked in had a slave set up with 30 minutes if replication lag intentionally applied to reduce the impact of someone accidentally nuking a table.

Then as someone else mentioned you can stream your replication log to storage elsewhere giving you the ability to do point in time restores.

problems · on Feb 9, 2017

Use that live database replication along with LVM, Btrfs or ZFS snapshotting every hour and purging the old one. Take a full, proper backup every hour if you can or every day if you can't. If you can afford to, do a SQL dump at reduced priority and compress it instead of a binary copy as it's easier to check a text dump.

Anyone with more experience with large databases have anything to add or any concerns with this sort of scheme?

toomuchtodo · on Feb 9, 2017

Hourly snapshots that you purge every 24 hours (after the nightly is successfully taken).

cookiecaper · on Feb 9, 2017

It's unfair to characterize the entire organization by one failure, but their data loss issue is not attributable to some obscure, convoluted technical issue or freak accident. The simple fact is that they didn't verify that their backup and recovery infrastructure was working correctly. Moreover, they didn't have any automated monitoring confirming that the backups even appeared to be working. Those are things that every moderately-experienced IT person knows needs to be in place and occurring routinely.

If you look up the doc, they have like 6 different places where they looked for backups and each point is like "We should have these backups, checking for them now... woops, it hasn't been working."

The only reason GitLab's data loss wasn't even more disastrous is that an employee just so happened to take a manual snapshot 6 hours before the incident occurred.

Now, I'm sure we've all been in situations like that. I don't think it necessarily reflects on the individual technical employees of GitLab. I think it says more about the organizational and management resources at the top. It was their job to ensure that the appropriate resources were committed to data integrity, and they failed to do so.

90% of us probably work at places with similar problems, but publishing "the secret to [managing technical employees]" fresh on the heels of such a spectacular organizational failure is probably unwise.

overcast · on Feb 9, 2017

Are you kidding? How is not ensuring your backups have EVER worked, or have even been setup, considered "bad luck", or "anything particularly more wrong". We're not even talking testing backups, literally if they have even been setup to run at all. The only reason this wasn't a complete meltdown, was that the tech decided he'd snapshot before he did work six hour earlier.

cjbprime · on Feb 9, 2017

I think you're confusing sympathy and solidarity towards their engineers (which I was very glad to see) with "wow, this could happen to anyone". Quoting from their post-mortem:

> So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

It's not true that having zero tested working backup/restore strategies happens to everyone. It's a catastrophic failure in process. Let's not blame any specific engineers, but it seems totally appropriate to blame the CEO (who was the subject of the puff piece!) and decrease your trust in the company appropriately.

wheelerwj · on Feb 9, 2017

Look, everyone is supportive of gitlab and how they recovered. but to come out boasting after being down for that long is a hard sell.

sytse · on Feb 9, 2017

I totally agree that we should not be boasting but figuring out how to improve our service.

ancarda · on Feb 9, 2017

It may be due to downtime earlier today (this being only days after the data loss). Things don't seem to be going well for GitLab right now.

wehadfun · on Feb 9, 2017

This user account is not active or something and his tone is a little offensive but I think he makes a point. Writing everything down does nothing if no one reads it.

writing things down is not the problem... failing to read the things they write down is the problem. they probably don't have time to read, because they are too busy always writing. they can't find what they need to read in the sea of endless wasteful and pointless drivel that never should have been written down in the first place. you're all idiots.