Yep, I've backed up 350GB of data, but since most of it is duplicated, I pay for storing 6.3GB. Win.
One word of caution though - this isn't a mainstream consumer backup service. If you lose your keys you lose your data. No chance of recovery. So make sure you back those up properly too, ideally in a different geography.
Just make a consistent snapshot of your data (I'm using UFS snapshots), point Tarsnap at it, and you're good to go.
You're using the --snaptime option, right? It's necessary when you're backing up a filesystem snapshot in order to work around a race condition with them -- if a file is modified, the filesystem snapshot is created, and then the file is modified again, all within a single time quantum, it can trick Tarsnap into thinking that the file hasn't been modified later (which triggers an optimization of "this must be the same blocks as it was last time" in place of the usual "read the file and split it into blocks" behaviour).
Finally, compression and deduplication is amazing:
Well, if we're going to be posting statistics here...
Total size Compressed size
All archives 269 TB 121 TB
(unique data) 177 GB 72 GB
That's 269 TB of data backed up from my laptop, deduplicated and compressed down to 72 GB. This is what I get for taking a backup of my entire home directory every hour...
Ah, that was you -- I remembered sending an email about snaptime recently but couldn't remember who it was to (and HN user names don't always correlate anyway...)
Multiple copies of a file, or parts of a file. These can be spatial (your archive contains an SVN checkout, so there's an extra copy of every file in the .svn/pristine directory) or temporal (you take regular backups, and many of the files haven't changed very much from one backup to the next).
Ah okay, thanks. Another question: Why wouldn't compression take care of that? Isn't the point of compression to compact as many repeating sequences as possible?
Deduplication is a form of compression, yes. Most forms of compression are "local" however -- looking to match data against bits from within the past few MB -- so they won't detect duplicated data spread across entire archives.
In order for compression to work, the data must be 'solid'. Which means to add or remove something from the archive, you must reprocess the entire archive. This isn't a very good model for backups, especially when you have lots of them. (As an aside, .zip files aren't 'solid', so two copies of a file won't compress well. This is also the reason why most archives on linux are done through tar, to create a single stream of data).
Tarsnap uses variable blocks (in such a way that inserting into the middle of a file creates minimal differences). If a new block is detected during a backup, it only needs to send that block, as the rest are already stored on the server. It also means that new archives can refer to the old blocks stored, allowing each archive to be independent, and unused blocks removed when the last archive using it is deleted. Before sending, blocks are compressed then encrypted, so compression can't really help since you might not have all the old data locally.
You can also deal with backups using a master and incremental diffs. This doesn't work well with the tarsnap model, as archives are no longer independent.
why Tarsnap pricing is defined in terms of picodollars per byte rather than dollars per gigabyte: Tarsnap's author is a geek. Applying SI prefixes to non-SI units is a geeky thing to do.
I find that so amazingly annoying. To me it says "yeah, I know many people might find it hard to get their head around the units I defined, but I don't really care about that because I find it cool." We have standard units for a reason, because people can immediately get the scale of something in their mind. With this, you can't. I went to their site open to what they were selling, but I'm very turned off by this.
Note that "they" are one person, Colin Percival. That's all Tarsnap is; one person, some backup code he wrote, and data stored on Amazon S3. As the sole owner and only person working on a product aimed at a niche audience, I don't think it's unreasonable that he has a little fun with the way he runs it. If this kind of thing bothers you, this service probably isn't for you.
If prices were listed in dollars per GB instead of picodollars per byte, it would be harder to avoid the what-is-a-GB confusion (a GB is 10^9 bytes, but some people don't understand SI prefixes). Picodollars are perfectly clear — nobody is going to think that a picodollar is 2^(-40) dollars.
Specifying prices in picodollars reinforces the point that if you have very small backups, you can pay very small amounts. Unlike some people, I don't believe in rounding up to $0.01 — the Tarsnap accounting code keeps track of everything in attodollars and when it internally converts storage prices from picodollars per month to attodollars per day it rounds the prices down.
And finally, the price in dollars per GB are also prominently displayed, right after the price in picodollars per byte. So really, you're just being bothered that he's having a little bit of geeky fun even though it has absolutely no effect on you.
I get how it could seem too self-induldgent, but I think that was mostly meant tongue-in-cheek. The real reasons are right below:
If prices were listed in dollars per GB instead of picodollars per byte, it would be harder to avoid the what-is-a-GB confusion (a GB is 10^9 bytes, but some people don't understand SI prefixes). Picodollars are perfectly clear — nobody is going to think that a picodollar is 2^(-40) dollars.
Specifying prices in picodollars reinforces the point that if you have very small backups, you can pay very small amounts. Unlike some people, I don't believe in rounding up to $0.01 — the Tarsnap accounting code keeps track of everything in attodollars and when it internally converts storage prices from picodollars per month to attodollars per day it rounds the prices down.
Plus, as others have pointed out, prices are listed in standard units (dollars per GB) just below the oddball ones.
Then don't use it if you find that a simple conversion is "annoying". The author is catering to an audience that, bar none, wants transparency and privacy in their backups. He's done a phenomenal job of this. It's no loss if he's out a user such as yourself.
But the price is also given in $/GB-month, so you can at least figure it out using units that people will probably find more convenient. I couldn't see the quoted comment on the opening page, which is probably lucky, because it annoys me more than simply giving the prices in picodollars/byte. (Pricing it like that at least reveals that the pricing is actually per byte, rather than being, say, $0.30/GByte with some kind of rounding.)
More generally, I've always found it to be a good idea to be at least somewhat circumspect if you're going to have some kind of an asocial relationship with somebody (e.g., getting them to give you money). It's impossible to foretell what people will get annoyed by, so you might as well give them as few things to get annoyed by as possible. (I suppose this is the "better to keep quiet and be thought a fool..." principle, in a way, though obviously foolishness is not the precise issue here. Oh well. I don't claim to be original.)
Interesting: in my browser, this comment is both greyed out and upvoted to the top of the list. Does this mean that comment placing is determined by upvotes, and comment greying is determined by downvotes, but those two processes are independent of each other?
As an alternative, I use Arq continuously on all my computers and I highly recommend it (Sorry I'm on my iPhone and won't be able to give a link). It lets you use your own AWS credentials for backup and you can encrypt the data before it is sent to AWS.
The issue I have with Tarsnap is that the data is still at the hands of a small operation, as far as I can tell, and honestly I'm afraid we won't get our data if something happens to the guy. This is fine of course for many services, but data backup is inherently as mission critical as it gets. The whole reason for it is reliability, assurance and redundancy. It is not a nice to have, it is for many people the only place they fully trust to keep their data forever.
I wish Tarsnap had an innovation that made it possible to use it with one's (or an organization's) own AWS credentials. An on-site mode, if you will. Otherwise it has always seemed to me like a great piece of software.
Haven't thought of that tbh. I only have a single reservation that I mentioned in my above post. Not to put any words in his mouth, but I'm not sure there can be a solution to it given his current architecture of the system. It is inherently a multi-tenant SaaS backup service...
I've just sent an email to Colin about this. Will edit my comment as soon as I have a response.
EDIT: Wow, got a response in less than 5 minutes:
It's not something I'm looking at doing right now. The way the Tarsnap server
side is designed, in order to keep costs low (and performance high), data is
aggregated between multiple Tarsnap users and stored in S3 as large chunks;
keeping each user's data segregated would add a lot of additional complexity
and cost.
I have the same thought. I'm using backblaze at the moment but am actively looking to move to either arq or tarsnap. I like that arq is in your own account and the file format is open so you can work on it yourself. Also, storing to glacier means it's dirt cheap. It's reassuring to here someone having a positive experience with it. Backblaze ha been a mixed bag for me.
Have you considered Crashplan? I've had positive experiences with it. The only downsides are that the client program uses a ton of RAM and there's no API.
I just started using Arq myself, and it's perfect for what I need. The Glacier backup is the killer feature. Tarsnap is really nice, but it would bankrupt me. I have 300+ gigs of photos and video (I have children and I'm a total tool with my camera, I know). That's $90 a month for Tarsnap vs $3 with Arq using Glacier. For $90/month it would probably be cheaper to rent a machine somewhere and just use rsync.
http://www.hashbackup.com has dedup, compression, encryption, and lets you use your own storage: AWS or compatibles, rsync, ssh, ftp, imap, local dir, mounted remote dir. Disclosure: I'm the author.
All the data is encrypted before it ever leaves your machine. Not even cperciva should be able to read it.
You can also create a write-only key. If you run tarsnap from a server which gets pwned, the attackers can't touch the existing backups. Don't be the next Astalavista[1].
With Crashplan, all data is encrypted before it leaves your machine if you use a private key. The pricing is MUCH better and it's not a single man operation.
If you're paranoid about it being closed source, you can make a quick script to encrypt sensitive data, copy it to another folder, then sync that encrypted folder online. I do something similar with a small % of my data.
As far as server backups, it's trivial to script a copy to your local machine then let Crashplan sync that.
The thing is that Colin Percival has done genuinely novel computer science, real heavy lifting, to make both strong encryption and smart de-duplication possible in the same service.
So far as I know, nobody else has done that.
In practice tarsnap is cheaper than everything else because of the dedupe.
Well you have me there. I'll fall back on the fact that Colin's code is available and that he's published papers covering all the maths and computer science that leads up to being able to dedupe without sending stuff to the server or decrypting on the server side.
The source code is available; it's available under a "shared source" license rather than free software/open source (you can look at it, but not modify it), but it is available for review. https://www.tarsnap.com/download.html
He also has a bug bounty http://www.tarsnap.com/bugbounty.html, and several substantial security bugs have been found and fixed due to the bug bounty (http://www.tarsnap.com/bounty-winners.html). In fact, the first of those, the AES CTR nonce bug, was found before he had offered the bounty program; the bounty program was inspired by that bug, and has since led to the discovery of several other more minor issues.
So, the source is available, and there's a bounty out for discovering bugs ranging from cosmetic issues to major security issues. Feel free to review it and submit any bugs you find!
"At the present time, pre-built binaries are not available for Tarsnap — it must be compiled from the source code." https://www.tarsnap.com/download.html
That is indeed a problem which has yet to be solved. Or a potential problem, rather... I'm rather hoping it will never actually happen. ;-)
Seriously though, it is on my list of issues which needs to be addressed. Bringing in someone else and getting them up to speed on how to run everything is an expensive prospect, though.
Have you considered some sort of "enterprise" variant where large organizations use their own storage backend? Just 1-4 serious enterprise sized customers would cover the salary and overhead of 1+ good engineers. There's a lot of opportunity in that direction. Like any medium sized or larger company that needs to deal with compliance with education or medical privacy regulation, your tech is a great backup solution, and if carefully done doesn't increase your overheads much/ at all.
Setting up tarsnap to use non-AWS infrastructure would be a significant amount of work. Setting up a "private" Tarsnap (but still on AWS) is something I could do for a company needing to store a large amount of data (say, 10+ TB).
doesn't AWS have some special clouds for companies that have compliance needs?
Either way, it might be worth looking into even just the "private" Tarsnap on AWS business direction as a way of growing revenue in a way that isn't tied strictly to data storage volume.
One way to go about this is to ask some of your larger business users if they would be interested in such a "private for them" self hosted Tarsnap variant. I think many of them would love a way to help you have revenues sufficient to support having an additional engineer (or two) working with you, which isn't possible for them to do with your current usage based revenue model.
Point being, theres probably an "enterprise" business model that stays true to your quality goals, but gives you more ahead of time revenue by a substantial amount. For some of your customers, there might be more value in supporting you being able to hire some engineers than there is in the cost savings element of the current revenue model. This can be an ancillary product that isn't the core one, but which still helps you have more resources to make the core better.
Talk with your larger customers, they're probably happy to chat with you given the chance.
OP here. I found them looking for a good backup solution.
They look amazing. Bug bounties for everything (including cosmetic stuff), completely transparent architecture, data deduplication and compression on the fly, they will be up even if two of Amazon's data centers fail, one pays per byte (traffic/store) and for all that they are pretty cheap.
Tip: forget everything you knew about scheduling full and incremental backups, because you don't have to. Tarsnap provides logical snapshots and does all the diff magic for you.
Interesting. Tarsnap and rsync.net seem to alternate coverage on HN, and for the longest time I kept forgetting they were different, even though I had vague sense of confusion.
I understand data is encrypted before it ever leaves your machine, but I certainly wouldn't want encrypted data at-rest being exposed. Which gives me concern about Tarnap's terms: "I may provide information concerning your account and your use of the service to 3rd parties, at my sole discretion, if ... It is requested by law enforcement authorities ..." note - no requirement for a court order or subpoena.
Note the last paragraph of that: However, I'm serious about saying "at my sole discretion" — if a law enforcement agency wants information, they'd better have a good reason for asking for it... and I don't consider the NSA saying "we want to have all the information you have, just because we feel like it and someone somewhere might be a terrorist" to be a good reason. Also note that unlike the situation with certain illegal wiretaps, I can't give your data to anyone, because it's all encrypted such that I can't read it.
This situation has never arisen, but if I'm confronted by a police officer and enough evidence that I'm sure they could get a court order, I'd rather be cooperative than force them to go through the courts. This doesn't mean that I'd give them any more data than they would get from a court order -- in fact, quite the opposite, since police tend to err on the side of requesting more than they need when going through the courts, and cooperating could change "seize a server" into "get a copy of the required data".
They would all be channeled through your local law enforcement though.
If swedish police comes to you directly, you don't have to comply, but if they go through the proper channels, the request to you, comes from canadian police.
I'm thinking of using Tarsnap. Can I absolutely, positively, definitely trust that everything on Tarsnap's end is encrypted to best practice standards and that there is no reasonable way to get to my data (outside of the usual contract provided by encryption I mean)?
I don't have the option to know for sure by analyzing the source code myself so I'll have to trust the popular opionion of Very Smart People here on HN (well, I suppose I could if I spent a non-trivial chunk of the coming year reading up on crypto stuff).
The encryption happens on your end, not Tarsnap's.
The bar you're setting, though, is impossibly high. Can you absolutely, positively, definitely trust that your machine is not rooted and some nefarious entity isn't quietly collecting your every keystroke and snickering in the dark while stroking a white cat?
At that level of paranoia, you're probably best off using a device personally soldered together with hand-selected transistors that XORs all your backups with the white noise collected from your tv (while disconnected from cable, of course).
Client-side encryption and deduplication, with source code. 8GB free, $10/mo for personal unlimited use, 10c/GB/month for business/enterprise. My main reservations are they seem to be based in one datacenter, and don't seem to have support for multiple keyfiles with separate read/write/delete/machine restrictions. Also not in FreeBSD ports :P
One of the things I love about Tarsnap is the bug bounties, which range from $2000 for being able to decrypt user data right down to $1 for cosmetic issues.
quick question here: is there a delay in Recent Activity?
I just signed up and used it on two servers like 30 minutes ago, but I don't see anything in the account activity except the payment info.
I'm quite sure my servers sent stuff because I monitored b/w usage
I use it to save backups of my desktop and laptop home directories to a home NAS mounted with NFS, which later gets synced to another NAS.
It's a decent tool. I encrypt with a separate gpg key, do mostly incremental backups, and a full one every few months. Incremental backups take under a minute on my desktop (100K files, 11G). Full ones are kind of slow (which is why I set it to only do it every few months).
Have used Duplicity for a couple of years now, works very well although tricky to get it to store data other than in the US-East region. Currently backing up about 8 servers with it, mix of Ubuntu & Amazon Linux.
Colin - I dig what you're doing, but every time I go to the Tarsnap website, I'm turned off from using it for all of the reasons that have been discussed here ad nauseum since 2009. I'd love to see you succeed more; I think you deserve it, and I wish you'd just grab it.
I'm still not sure whether I can trust somebody else with my data, but I'm growing more and more concerned of hardware failure of my own backups. Might try Tarsnap one of these days.
It aledgely encripts all your data at the local machine, before sending it to the server.
Now, of course, if you are truly paranoid, you'll want to review their code first. I don't get why I can't simply mount a volume with encription and write there. Using code that is already on my machine (on the kernel, nonetheless) would make it a much simpler decision.
I don't get why I can't simply mount a volume with encription and write there.
You can (and you could even use tarsnap to back up the encrypted filesystem image if you want), but writing your data to an encrypted filesystem tends to expand the amount of data changing -- in the extreme case, if you create a copy of a file you'll write that many blocks of new encrypted data which needs to be backed up, whereas tarsnap would just say "hey, I recognize all these blocks, it's those ones I backed up earlier" -- so Tarsnap's encrypted backups of a filesystem tend to be many times more efficient than backups of an encrypted filesystem.
Did you just assume the source code wasn't available? It's not linked from the front page of the site, but if you go to the 'Download' page, it's right there for your review:
I would love to have something like the Backblaze client but working with Tarsnap as a backend: you install it and you forget about it. The sensible default configuration is good enough for average joe but you can tweak it if you want.
Well, it is not using Tarsnap as a backend and it seems that you have to add folders on the first launch, that is definitely not what I am looking for :)
Time Machine and Backblaze know how Mac OS is architectured and backup everything but useful files (logs mainly).
That's not exactly rocket science these days (see bup). What you're paying for with tarsnap is making it totally rock solid and as usual the last 90% of the work is also 90% of the cost.
The point that suxnoll is making, though, is that the cost is not nearly high to begin with, b/c the data is dedup'ed and compressed. You're justifying an issue that doesn't quite exist.
The query that suxnoll responded to supposed that you have 400 GB of data with small deltas. But that's only possible if you're filling your harddrive with files created from random noise from /dev/random and updating all your files monthly by more random noise.
> That's not exactly rocket science these days (see bup)
Actually, combining crypto and dedupe in such a way that the server can never tell what's on the client computer but the client can still reliably pick what's changed and dedupe it?
That's honest-to-goodness computer science. And Colin is the guy who invented this stuff.
I did not miss that. If a lot of the data in question is in the form of sparebundles, mp3s, and/or video — for me, it would be — then good luck meaningfully de-duplicating and compressing that.
I'm a little cautious with BackBlaze now (looking at switching to Arq [0]). I have about 700GB with them but a while ago my backup metadata became corrupted in the storage on their side. I worked through it with them to try to diagnose the issue. I even went the extreme length of buying a whole new mac mini in the hopes it would fix it. N such luck, so I had to reupload the full 700GB to them again. It's not an isolated case either - happened to a friend of mine too.
More recently we'd been trying to get to the bottom of some unusably slow macs (mountain lion). Turns out that the BackBlaze filelist service (that watches for changes to files) is very poorly behaved. Initially we discovered that it fights with apple's mds. Even once we'd fixed that (by stopping mds from watching a load of folders) it still ignores the scheduled backup times so it runs all day. BackBlaze support acknowledged the issue but the only workaround we've found is to have a cron job unload the BackBlaze daemon during the day to stop it destroying performance.
> Turns out that the BackBlaze filelist service (that watches for changes to files) is very poorly behaved. Initially we discovered that it fights with apple's mds
That's really disturbing, does it misbehave and interact with /dev/fsevents directly instead of using the public fsevents api? If so somebody needs to get flogged over that choice.
That's not really a question I have the knowledge to answer (though with some more guidance I'd be happy to look into it). I'm aware of fsevents, and I know for example that dropbox have their own dbfsevents so they can filter out what they need closer to the kernel. My understanding is that it gives you an api to be notified of changes to the fs. Maybe the BackBlaze case is complicated because they backup the whole filesystem by default?
The behaviour we saw was a constant scanning of all the metadata of all the files on the filesystem - not just things that were changing. It seemed that instead of being notified about changes it was relying on comparing to its cache by polling. There were huge folders with years of photos in that hadn't been touched in months being scanned again and again all day.
To be honest, my business partner contacted BB about it and they didn't seem as concerned as we were. It was at that point that for the first time in a couple of years I found myself with the desire to investigate backup services again. It's a shame because BB has been good when I've needed them. A drive died recently with a lot of important data on it (years of photos and music) and they had a replacement to me in a matter of days.
I'm still running their backups at the moment but I also started Arq running last night. It seems to fit my use-case perfectly.
Edit: I just ran it a again to check. The bzfilelist process does an lstat on every file in the system one by one.
> Edit: I just ran it a again to check. The bzfilelist process does an lstat on every file in the system one by one.
Wow, so it doesn't even listen for fsevents? What a terrible design, I understand running a full scan every now and then to ensure you haven't missed anything while the filesystem has been offline (in case it's mounted on another machine), but holy crap.
Sync is not backup. For backups you also need the ability to look back at the previous versions of your files. For example in your sync solution if a file gets corrupted, all copies of the file will also have the corrupted bytes sync'd. With a backup solution you'll be able to rollback to a previous non corrupted version.
i've been poking round the site and i couldn't see an answer to this question - why do you need to encrypt the communication if the data themselves are encrypted? maybe i am misunderstanding, but it seems like each block is encrypted and the pipe between client and server is encrypted. is it because there are additional interesting metadata (if so, what)? or have i misunderstood?
There is metadata; whether you consider it interesting is up to you. The tarsnap client has to say "I'm machine X, and I want to store a block of data with tag Y"; and when you extract an archive, "I'm machine X, and I want to retrieve the block of data with tag Y". This could allow someone to figure out (a) that it's the same machine, and (b) that you're extracting an archive which contains data which was stored at a particular point in time.
Paranoia means encrypting everything which might be sensitive, even if you can't see any way for it to be abused.
I concur. I've been very happy with the price and features, but the ever increasing amount of RAM (the larger your backup, the larger its RAM use) is worrying.
Truly paranoids are/will/should use Bitcoin or Litecoin. I don't get why pricing is USD$ only, it just seems that cryptocurrencies are perfect for this kind of service.
As others have said, this depends on your definition of paranoia. If you're paranoid right here and right now, you can always go to your local 7-11/Walgreens/etc and get a prepaid credit card to pay for this.
If your opponent uses geographic profiling, you want to pick a point far away from you and then go to stores far away from that point -- then they'll identify that far-away point as being your origin. If your opponent is a game theoretician, on the other hand, you want to pick stores to visit completely at random, in order to avoid providing any information.
Perfectly serious snarkless question - how would that work, financially? You're reselling S3 storage with your value-added service on top - obviously you set your prices in a way that pays for your costs, time and generates some sort of profit. Short of adjusting your prices on a daily (or hourly!) basis, how would you be able to accept payment in a currency of such extreme volatility?
This is why I don't accept BTC yet -- unless the exchange rate settles down I'm only going to be able to do it via a service which lets me say "I want X USD; make it happen".
Coinbase comes close -- the only thing they don't do is allow me to specify at run-time how many USD I want; for some odd reason they need you to "create a button" with an API call before displaying that button on your site, which is a pain to deal with when you have variable payment amounts.
Why not add a simple public page to the Tarsnap website that allows the visitor to gift x dollars to the foo@example.com account? That way, bitcoiners can use bitspend.net etc. as a proxy payment provider for the time being.
Not leaking metadata on credit card statements that allows to infer that valuable data is stored at Tarsnap (and approximately how much), would be a practical increase in security: Such a leak might prompt an attacker to allocate more resources towards compromising the Tarsnap user's client computer.
I'm guessing it suggests that those who are truly paranoid about others snooping into their tarsnap-saved data may want to make their tarsnap payments anonymously as well.
Well, the BC PST is stupid. I find it amazing that during the HST referendum campaign a large number of people said they thought the HST was good, but they were going to vote against it anyway to punish the BC Liberal government... and now we don't have the HST, the the government they wanted to punish has gotten re-elected.
The documentation is thorough, and Colin (the owner/operator/author) responds quickly to emails.
Finally, compression and deduplication is amazing:
Yep, I've backed up 350GB of data, but since most of it is duplicated, I pay for storing 6.3GB. Win.One word of caution though - this isn't a mainstream consumer backup service. If you lose your keys you lose your data. No chance of recovery. So make sure you back those up properly too, ideally in a different geography.