Hacker News new | past | comments | ask | show | jobs | submit login
How to do cheap backups (mixpanel.com)
134 points by suhail on Feb 21, 2012 | hide | past | favorite | 48 comments



The quoted numbers per GB are honest calculations, but they can be a little misleading because they don't reflect the REAL costs you are going to end up paying PER MONTH and UP FRONT. As with anything, you always have to run the calculations first.

An illustrative example: My co-founders and I recently looked at a bunch of new office spaces and were doing similar comparisons with costs per square foot for each space. We ended up in a situation where option A (a much larger space) looked MUCH cheeper at $17/sq.ft, but we ended up going with option B (a small space) at nearly $40/sq.ft. because we just didn't need all the space in option A, and option B was in an office facility where we wouldn't have to buy any additional furniture or appliances like couches, chairs, a coffee maker, fridge, etc. (they were supplied to all tenants in a large common area as part of the cost). So the REAL cost difference to us PER MONTH was about $400 LESS with option B (the smaller more "expensive" space) and came with a lot of extra conveniences to boot.

So in the example given in the article, if you're not using the full 45TB of backup storage space (24 x 2TB in RAID-6), you could actually end up paying significantly more per GB for what you have stored (what you're actually using) - ESPECIALLY when you include the UP FRONT costs of buying and co-locating the server and the maintenance costs that go along with it.

Moral of the story: Just because something LOOKS more expensive per unit doesn't mean it's actually GOING TO BE when it comes to cashflow. ALWAYS do the math for your own situation before making decisions like these.


Not sure how honest these calculations are because I couldn't repeat them and the article gives no REAL detail.

Cheapest matching configuration I could find at SoftLayer was $559 for a dual processor Xeon 5504 with 12GB RAM ("Speciality: Mass Storage"). Each 2TB drive costs an additional $60. I'm assuming the raid controllers come free if you can plug in that many drives so not adding that onto the build. Total for build with 24 drives: $1999. (Assuming a free OS, and that you don't need to upgrade the network port speed.)

24x2TB in RAID-6 gives you 45TB usable capacity (all 24 drives, madness) or 20TB usable capacity (12 drives in mirrored RAID-6 configuration, probably more sane).

45TB @ $1999/mo is ~$0.045/GB, but if more than 2 disks fail you're screwed (quite likely to happen in a 24 disk configuration!)

20TB @ $1999/mo is ~$0.1/GB, and you can stand to lose at most 4 disks (at most 2 in each of the mirrored array).

Compare with the $0.11/GB Amazon S3 is charging you to store data in the TB range (excluding data transfer costs, which are free for incoming data, which ought to be the bulk of data) and I'm not really sure I follow their argument given that maintaining this backup system is going to be a PITA and the risks don't seem worth it.


It isn't a sale if you weren't planning on buying it in the first place.


You need to hire a sysadmin (or operations engineer or whatever they're called these days). You're talking about developers spending time on implementing backup, which is wrong. That's not what developers should be doing. Sysadmins may occasionally write software to solve a problem, but they're not software developers. Software developers may occasionally do sysadmin type work, but in my experience most of them are notoriously bad it it (setting up mod_python because they have a blog post with step-by-step instructions from 2004 bookmarked where that was the best way to do stuff - or leaving gigantic virtualenv turds with complete python binaries in your SCM because they don't understand what virtualenv is for).

Sysadmins also have toolkits (you know, screwdrivers, torx bits, zip-ties) and have scars on their arms that prove they aren't afraid of sharp-edged hardware stuff that sometimes starts smoking for no discernible reason and doesn't turn on any blinkenlights when you push the button (many software developers panic at this point). This comes in very handy when you just "left the cloud" and you're experiencing first-hand the reasons why people moved into the cloud in the first place (hardware sucks).

Sysadmins also know about this backup stuff and will tell you to shut up when you start talking about doing it with cobbled together shell scripts. They'll probably recommend using something like Amanda (or a commercial equivalent), that makes sure your backups happen regularly, are complete and actually contain the stuff you needed to backup. Good ones may even know to test the backup occasionally by restoring a server just to see if it actually works afterwards.

(Apologies to any software developers who know their sysadmin stuff.)


Cheap backups? Use de-duplication.

I have cronjob with daily dumps of several MySQL databases, in the usual textual format. One day woth of dumps takes about 470MB now. Two years ago, we started at about 20MB/day and it was growing ever since.

Each dump is committed into one common Git repo. After two years (that's just over 700 dumps), this whole Git repo is about 180MB.

Yep, much less that one daily dump. Git performs thorough de-duplication and delta-compression.

Cheap backups? Use de-duplication.


Very cool. What options do you pass to `mysqldump' for best diffability?


Straight from the dump scripts:

  --complete-insert --hex-blob --skip-add-drop-table --single-transaction --order-by-primary --skip-dump-date
  --force # so mysqldump does not bail out on invalid view
  --no-create-info # do not emit CREATE TABLE, because its AUTO_INCREMENT changes often, and would create unnecessary differences

where the `order-by-primary' is probably the most important, and `skip-dump-date' sure helps.

Also, I make a big deal out of spreading every large table into a set of smaller dumps, each with fixed number of rows, sorted by record ID. For various reasons, most of our tables are usually appended-to, and changes (UPDATE, DELETE) are less common. Thanks to this, changes are usually confined to the last file of a set (with newes records), and other files stay mostly unchanged -- and so they pack the best.

  --where="_rowid >= $FROM AND < $TO"
I try to keep individual files down to about 8...16MB, 32MB max, so git's repack (upon pull/push/automatic gc) doesn't take too much of time nor RAM.


How do you split the files? Is that part of mysqldump (if so, how), or is it a handrolled thing?


For now, handrolled. The idea is to be able to do either `cat * .sql' or just `cat LAST-PART.sql'. I run mysqldump once per each large table with --where="_rowid >= $FROM AND < $TO" argument to mysqldump, and call mysqldump in loop with consecutive $FROM and $TO. It works, it gets the job done, but it's not transaction safe.

That `_rowid' is a reserved symbol in MySQL. Refers to table's PRIMARY KEY (but only if it's single INT). In the usual case, the script doesn't have to know table's PRIMARY KEY.

Another way would be to use `rolling checksum' to split files; the concept described in http://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/ But you could end up with dump files split in the middle of SQL statement, not very cool.


I looked into backup options for my company and ended up rolling our own solution using an open-source program called Duplicity: http://duplicity.nongnu.org/ . I've been really impressed with Duplicity. Incremental backups are fast, the data is encrypted and you can target many different source/destination types (local file system, ssh, Amazon S3, ftp).


The main reasons for me to choose duplicity is the multiple protocol support. I've had so much issues in the past switching to other storage facilities only to have to completely change the backup process. Plus I'm using multiple storage sites for really important data.

Duplicity can be a bit finicky though, especially when it comes to cleaning up after itself, even more so if the backup was somehow interrupted.


Duplicity is also unable to incrementally remove old backups.

A backup set starts with a full backup, then has incrementals after that. You can remove the incrementals, but if you remove the (old) full backup, you have to do a full backup again.

A typically strategy is incrementals every day, and a full once a week. However this is really hard on home connections - it's just too much data to do a full backup every week.

One of its pros is that everything is completely encrypted from when it leaves your machine - so you can backup to a completely untrusted repository without worry.

However, in order to do that it locally stores block-checksums of every file it backs up. (So it can detect differences.) These files can get large.

Another option from the same source is rdiff-backup. This has none of those limitations - it uses a CVS style reverse diff (so old incrementals can simply be deleted, and the most current is simply stored as a file, which makes restores very easy). However it's not encrypted from the source - so you have to trust the repository at least enough to create a local encrypted volume on it.


Security reminder:

The live server should not have write access to the backup machine.

Instead the backup machine should have read access to the live server.

This prevents disaster in case of hacks.


That's one option, but I prefer doing it the opposite way around: Have the live server push data to the backup server via an append-only interface. This is much simpler in terms of access control if you want to back up some of the live server's data but not all of it.


I think you missed his point. He meant its safer to give the live server no access because if it gets hacked/virus then it can not impact the backup server.


I think you missed my point. I was suggesting that the live server should access the backup server via an append-only interface, i.e., one which doesn't allow it to delete backups or modify them.


Now the security of your backups is completely dependent on the construction of the append-only interface. Are you 100% certain it can't be compromised or permission-escalated?


It's much easier to build an append-only interface than a read-only-and-only-read-some-files-not-others interface.


Tarsnap pricing is for deduplicated data. You need to divide it by orders of magnitude for a proper comparison.

Apples and oranges - what you are doing is on site backup, Tarsnap is offsite. Both are needed.


Tarsnap pricing is for deduplicated data. You need to divide it by orders of magnitude for a proper comparison.

To be fair, that depends on how much duplicate data you're backing up. If you weren't using Tarsnap, you probably wouldn't have very many mostly-duplicate archives.

I think the better way to look at it is that Tarsnap makes it easy to have many archived snapshots of your data without paying very much more. Whether being able to have daily snapshots for the past year is worth more than always having just your latest archive... well, that depends on what you're doing. It's a feature I think is important for my own use cases, but I'm not so bold as to think that everybody has exactly the same needs as me.


Dedupe isn't magic. If your data isn't duplicated on the block level, it isn't going to do anything for you. It does wonders on backing up 10,000 windows machines, it doesn't do anything at all for a server's data store.


I disagree. The deduping enables low-cost snapshot semantics which has drastically simplified my life.

Every 24 hours, I dump a database and let tarsnap work out what to actually send to S3. And it does such a good job, for such a low price, that it boggles the mind.


Fair enough, but I'm reacting to the unbased claim that you can just divide it by "orders of magnitude" when the original post does not contain enough information to claim that, and indeed enough information that it's probably not true. If you have 100GB of server-type service data, you're not looking at "orders [plural] of magnitude" less space taken up on the backup service. The huge multipliers that dedupe is sometimes cited as giving are for certain datasets, you do not get "orders of magnitude" shrinking in the general case.


The claim is based on personal experience.

With Mixpanel I'd expect even better results since their data is append-only by nature. I.e. think about deduplication when backing up a large append-only log file on a daily basis.

Tarsnap has weaknesses but cost of storage for Mixpanel type of workload is not one of them.


Note that mixpanel is talking about full dumps. Tarsnap does cross-backup deduplication, so while your first 100GB dump may take 80GB, the next may take 100MB.


Good point, but I would have said the big win with Tarsnap is the zero-effort encryption. It means that everything you send is safe, so you don't have to worry about whether there is sensitive data in some particular DB / directory / logfile / whatever.


Tarsnap addresses the security concern of cloud backup, making it effectively as secure as on-site backup, but this article seems to stipulate that security isn't a big issue with offsite backup.


Anybody else disappointed that an article "How to do cheap backups" didn't at ALL describe how they, well, actually do the backups? I was expecting some smart copy algorithm, not a post about the price of hardware. Also, they compare their hardware when at full capacity, to AWS and others scaling pricing model. Their first GB will cost a lot more than stated here in price / GB.


Takes about 25 lines in bash to do rotating, encrypted, s3 backups with Timkay's awesome aws script

I actually had to throttle our servers when sending to amazon as they seem to be able to receive at impossible maximum speeds and eat the whole pipe!


Is this the one? http://timkay.com/aws/

I somehow hadn't seen it before; thanks!



I hope the backup machine is connecting to the live machine and they're sealed off from each other. I've heard of cases where hackers have managed to get into a machine and then access backup machines to completely wipe all copies of a database.


I think there's an argument for periodic off-line, "hold it in your hands" backups.

Good luck wiping that...


or wipe record of them being there


If all your servers are on Amazon EC2, and your backups are in S3, then you have all your eggs in one basket. One billing dispute with Amazon and your servers and backups are gone.

Backups are there to protect you in case the worst happens.


Excellent point. While Amazon is unlikely to vanish over night, can your business really rely on it being there tomorrow? Even if it's still there you could be prevented from accessing it for a variety of reasons (billing, legal, technical). Do you have a plan for that? How fast can you recover from it?

We're very careful about not putting all our eggs in one basket and have servers at two different VPS providers (Amazon-EU and Gandi.net in Paris) and backup (and only that) at a third. Our DR manual includes instructions on how to restore an EC2 machine to a Gandi VPS and vice versa.


Make a second account?


Does the Amazon EC2 terms allow that? If not, you might give Amazon (or a zealous front line fraud watcher) an excuse to lock out both your accounts, before you get into any dispute.


Yes, you can have multiple AWS accounts, as long as you're not doing so for bad reasons. According to the AWS Service Terms, "You may not access or use the Services in a way intended to avoid any additional terms, restrictions, or limitations (e.g., establishing multiple AWS accounts in order to receive additional benefits under a Special Pricing Program)" but that's really just a "don't try to cheat" clause.

Prior to the creation of IAM, the standard way of creating restricted-privilege access keys was to have multiple accounts; I have at least 5 AWS accounts (I say "at least" because it's possible I've forgotten some which I'm not using any more...), lots of people at Amazon know about this, and nobody has ever suggested that there is anything wrong with it.

On the other hand, if Amazon decided to close your account, they would probably look to see if there were any other accounts owned by the same person. On the gripping hand, I've never heard about anyone ever having a billing dispute with Amazon Web Services, which at their scale tells me that they're very reasonable people and not prone to Paypalesque random account closing.


I use a nice OS X app called Arq which does encrypted backups to my own S3 bucket. Since I only have a few GB of stuff that warrants offline backup (git repos, etc) the ~$0.25 storage fees per month is well worth the convenience.


Any good Windows/Linux alternatives?


For Windows, Duplicati is a good option: http://code.google.com/p/duplicati/

For Linux, try Duplicity (which I mentioned in another comment): http://duplicity.nongnu.org/

Both are open source and have similar features to Arq (including Amazon S3 support).


It's nice, but they still need someone to keep an eye on the backup machine.

Also S3 is expensive because it keeps many copies of your data (though they appear as one) and check them for corruption, so it would be more reliable than a single backup machine.


One important feature they didn't mention: if you have a decently powerful backup server hosting all of your data, it may be relatively easy in case of an emergency to use it to serve production data directly. For instance, you could start a mysql instance with the backup data directly from the backup server, if your production server (or datacenter) is fried.

There is no easy way to achieve that from S3 or tarsnap.


Mirroring is not backup. If you keep your "backup" server in sync with your live server so that you can promote it to production quickly, it's also going to be in sync with any loss which occurs through sysadmin error or deliberate attack.

Ideally you should have both a mirror on standby and a backup server with multiple earlier copies of your data.


What they said doesn't necessarily have anything to do with mirroring.

When we take a backup of MySQL to our backup server, we also create a my.cnf file configured to use said backup. As in, we have multiple config files, each created at time of backup, specifically referencing that individual backup. Each can be easily used to start MySQL up using that X day old backup. Then I can shut it down, and start it up again using a completely different backup just by specifying that backup's my.cnf file, all without modifying either backup.

This is also nice because it makes it simple to make sure your backup actually works.


Small plug but we're always looking for awesome people to join our ops team! Link: http://mixpanel.theresumator.com/apply/Xm0tLy/Software-Engin...


And softlayer is by no means the cheapest provider of dedicated hosting in the US (though, there are few comparable providers (in terms of price and quality) with multiple regions).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: