I use rclone's B2 driver (https://rclone.org/b2/) as an rsync-style backup solution of about 1TB worth of my pictures and other media. Also, it's encrypted with my own local key using rclone's crypt module (https://rclone.org/crypt/).
rclone supports multithreaded upload, and even has experimental support for FUSE mounting. However, the sync command gets you Dropbox-like behavior and can be cronned: https://rclone.org/commands/rclone_sync/
I really like the price of B2, I hope it stays low :-)
I'm also backing up my photo archive on B2 via rclone and it works great, it costs like $1 for 200 GB which is awesome. But I'm not encrypting the files.
Encryption raises the barrier for both third-parties and family. In case something happens to me, I want the technical barrier to be low enough for my family to discover the backup. Another reason is that in my experience encrypted data is more sensitive to bit rot and bugs than unencrypted data. I'm backing up important work stuff with Arq Backup for example and I've had my archive corrupted once. Not sure if it was the software's fault or the storage.
My rule of thumb is ... if the data should be discovered by my family in case I'm not around, then I won't encrypt it. Photos are not worth encrypting anyway, since a lot of them end up being shared on Facebook, Flickr, Instagram, etc, as photos are meant to be shared, at least with your family.
That said I still expect Backblaze or Dropbox to keep my data private. Not secret, but private and there is a difference.
I've thought about that a fair a lot recently myself, and came to a different conclusion, but like yours as a counterpoint. What I've been considering was going analog and writing down how to get to the few places I've kept my secrets and employing something like a Yubikey to access an admin account on my computer and paper instructions in an envelope in the family safety deposit box. It's not a complex as a two-person thing I read about once, but seems like a nice compromise where I can still feel like I'm protected, but the nuclear keys exist if someone ever needed them.
Where do you draw the line around data that needs to be discovered? I'm thinking about instructions to access things like bank accounts or such that they may or may not already have access to, where I'd want them encrypted but accessible. Not that I've got secret Cayman accounts or anything, but financials are usually things i want heavily encrypted, but do want family access to in case of the worst.
I normally think of financials as things that don't need to be heavily encrypted, because we have laws limiting liability in case financial information is stolen and misused. What makes you feel it needs encryption?
Not in all cases and even if thier are legal remedies it takes months to clear it up and while you are clearing up the effects of the fraud, the person who has stolen your identity is committing new fraud until they get caught.
I know people that were arrested for fraud because their identity was stolen and someone else was committing financial crimes in thier name. They always have to carry official police reports with them just in case they get arrested again.
Honestly, I don't know, I've just always held banking info as something that must be treated as such. That said, it's not like I don't use credit cards on the Internet, but something about someone getting my bank account numbers does worry me. Plenty of people who know me well know my mother's maiden name, for example, and those two pieces of info together could spell trouble. I'm also in the camp of unless it needs to be shared, may as well treat it as private, and that includes encryption.
That's an interesting viewpoint to strike a balance between privacy and secrecy.
Can you shed some light on how you share the photos with non-technical family and friends, given that B2 has no app as such?
I have some experience with AWS/Azure and both of them do not support folders, and the workaround is to have slashes in the filename to create a virtual directory. Is it the same with B2?
I'm only using B2 for backup and it's doing a fine job, since from what I understand it also does unlimited history of the files, so in case files get corrupted, theoretically I still have previous versions around.
I keep my photos on Dropbox too, which is how I share them with family, besides sending files over WhatsApp, which is popular these days. But they only provide the history of changes for 1 month, or 3 months for Pro. As has been said before, solutions like Dropbox are not reliable for doing backups without specialized software like Rclone or Arq Backup, that can keep a version history.
My archive is currently less than 150 GB, so B2 is really cheap. I also have an offline backup on a portable hard drive. The idea with backups is that if you have data you care about, then it's a good idea to have at least 2 backups in different locations, made via different software.
> I have some experience with AWS/Azure and both of them do not support folders, and the workaround is to have slashes in the filename to create a virtual directory. Is it the same with B2?
B2 has folders, you can navigate them in the online interface. That said the service doesn't have polished apps available, being a platform like S3. It has no desktop or mobile apps currently. Although if they survive, given its price, I'm sure apps will happen at some point.
The online interface simply assumes that a slash in the filename should be represented as a folder; and they encourage apps to do the same. I believe they also enforce a max distance between slashes that is smaller than the max filename length.
What this means is that their is no way to, for instance, query what the root directories are, short of listing all files.
If you have a directory, you can list its contents using a prefix search (although the prefix need not be a directory, and this will not just list the toplevel elements)
>their is no way to, for instance, query what the root directories are, short of listing all files
This is not true! Try this from the b2.py command line:
b2.py ls <bucketName>
That would list all the top level folders. The APIs are designed to support two things: 1) listing all files, or 2) navigating and listing the contents of each folder.
Yes, but you're paying for that storage. If you sync 100GB of photos then locally make a small EXIF data change to all of them and sync again, you're now paying for 200GB of storage. B2 has Lifecycle Rules [0] to help keep versions from getting out of control and the API has methods for handling versions for clients like rclone [1] to use.
B2 doesn't have it's own desktop app but 3rd party desktop apps like Cyberduck use the API work with B2.
I would like to point out that given [1] in the US that means online services don't agree with your balance. Anything that is hosted is accessible to law enforcement, and potentially to anyone that has a legal disagreement with you.
+1 for restic. I switched to restic(+B2) from duply+duplicity(+S3) (another backup tool supporting dumb remote storage, encryption, incremental backups, and snapshots) and life is so much better. Duplicity needs the duply front-end for too many basics, it regularly needs full backups to be made and stored, it's not built at all for random access (it takes eons in order to list or fetch a specific file from a specific incremental backup), and it needs a barely documented command line switch in order to not bug out with S3 if you have too many files (why is that option not default? Its default S3 configuration has a max limit on file size. Duplicity splits the archives containing your files, but it has one un-split archive listing filenames or some kind of metadata that it doesn't support splitting, so if that file gets too big, then your backups just start failing). Restic is nice.
rclone is excellent piece of code and has solved a large part of issue for my own project. I maintain a slightly modified fork of rclone [1] and integrated that in daptin [2] as a server side piece. So I can seamlessly work with any cloud storage for assets/uploads thru daptin.
I wrote up a walk-thru some time back. The changes basically include replacing all the "fatal logs" with "error logs". I keep merging the upstream back regularly.
I'm currently backing up about 1.7TB of pictures to B2 from my Qnap NAS. Qnap has a backup app called Hybrid Backup Sync.
The problem is, while doing the one-way upload sync, the Qnap app downloaded a lot of data as well. I got confused why I was showing a lot of 'b2_download_file_by_name' API calls on the Backblaze reports page (600 GB upload resulted in 700 GB download calls).
Contacted Qnap support and they said a little bit of download is normal but this looks abnormal. Logs are all fine on the Qnap so they suggested I contact Backblaze.
It's a "cheap, easy, reliable: pick any two" situation.
Cheap and easy: buy a 2 TB drive and keep it at home. If some disaster affects your home -- flood, fire, burglary -- it can take out your data and its backup.
Cheap and reliable: buy a 2 TB hard drive and keep it somewhere else. Keeping the backup up-to-date means regularly bringing the drove home, updating it, and putting it back.
Easy and reliable: pay for a service like Backblaze that automatically backs up all your files to a remote server.
There are other benefits to services like B2 especially, namely being able to access your backed-up files from any device or location, or being able to link people to your files on a high-speed server.
You put the 2 TB drive somewhere else (at a relative's) and keep it updated regularly via network.
That's my set up (but with a bigger drive).
At home, I have the master copy of the data on my file server.
Then I have backup #1 that is in the same location and backup #2 that is in a different location.
Both #1 and #2 get updated at night with a "timemachine-like" backup system based on rsnapshot. The network traffic goes over ssh.
Remote backup system #2 cost a UPS, a RaspberryPi and an 8 TB drive, which is about ~$250-$300 total.
The initial sync is best done locally of course, but deltas can generally easily go over network at night.
Cheap, reliable, and (relatively) easy (if you're a geek, that is).
I remember there being software back in the day that did exactly this.
The name escapes me right now, but basically you had to add "friends" in the software, then dedicate a certain amount of HDD space to it. It would then back up your files to your friends' computers, and theirs to yours. Backups were encrypted so your friends wouldn't be able to see your files.
It was a super neat idea, I wish I could remember what it was called so I could see if they're still around...
There's an open source one called Tahoe: https://tahoe-lafs.org/trac/tahoe-lafs
There used to be a company with a more usable similar product called allmydata, but it seems to be defunct.
Nothing bad, if you use "bup" or "borg"; The latter has better delete support, so is a better choice for rolling backup if you sometimes delete data. "bup" has the advantage that its repo format is git, which makes it easier to hack.
Both use rsync-style deltas to only send changes, but they use a content-addressable scheme like git so renames are a small metadata change record.
Also, both offer ftp and fuse interfaces if you need to access an older backup.
Bad things! rsync isn't smart enough (AFAIK) to know that files have been renamed or moved: it just sees files disappearing on one side and appearing on the other side, so the daily delta can get big.
DRBD may be a good solution to this problem, although I haven't spent the time to see what it would take to replicate over ssh, and the kind of traffic that is incurred vs changes in origin.
Online/offsite backup is a different use case. They are paying $60/year so that, if their house burns down, gets flooded, disk gets fried by lightning, they still have their family pictures.
Local backup is cheap and fast, and you should do it too. But it doesn't provide geographic redundancy.
Another $60 buys a water and fireproof safe for the hard drive. I assume it also helps against lightning :-) I honestly think remote backup is an overkill for personal needs and new risks you get by placing your valuable data on someone else's hard drive are not always internalized.
And you take your non redundantly stored hard drive out in 5 years and you can’t retrieve data on it.
But realistically, isn’t it worth $60 a year to have a constantly backed up hard drive? The alternative is to take the hard drive out of the safe every so often and do a backup and put it back.
I've recently recovered data from a pair of 15GB and 20GB drives I last used in 2001 (and were stored in an ordinary closet of a house that experienced inside temperatures ranging from 5C to 30C over that time, and great humidity/dryness flactuations over those 17 years). There were 16 bad 512-byte sectors on one of the drives, but otherwise all worked.
Modern higher density drives are probably less resilient, and who knows how flash drives will fair after 17 years in the closet - but my experience so far is that HDDs trump backup tapes on every measure including costs except at extreme sizes (at this point in time, into the petabytes).
> I've hard a few hard drives fail in the past decades but I've always been able to retrieve the data
Care to share what kind of procedures you use?
I've recovered some data for friends and employers who want it back but aren't prepared to pay > USD1000 for it but if I cannot connect to the disk I'm lost.
(My tricks: tilting the disk, freezing the disk, leaving ssds powered on but sata unconnected, and even before that photorec and ddrescue etc.)
Note: don't do any of the above if data needs to be recovered at any cost, in that case just contact a data recovery company.
By putting your drive in a safe, you now have to unlock it, take it out, plug it in, sync, unplug it, and lock it away. Which is a hassle if you want to stick to regular (hourly? daily?) backups.
It is proven that if you introduce friction into process, over time that process will be followed less.
True, the standard Backblaze offering will erase your backup if you’re not online for six months. If you have an external drive that’s offline for 30 days, they will erase your backup.
A B2 based backup solution costs more but you don’t have those limitations.
Accord to Backblaze the retention period is thirty days for the personal plan [1]. Where do you see six months? If a file is deleted on the source device, and the deletion is synchronized to the BB repository, the recovery window is 30 days.
If you have an externally attached drive on your computer and it isn’t connected in 30 days and your computer is online, they erase that backup of the external drive.
If you reconnect your computer after a month and you don’t have the external drive connected, they erase your backup.
I perform hourly backups of my VPS and personal computers, storing it all into a giant repo on OVH and B2. If my house goes up in flames, I have to redo, at worst, 1 hour of work.
Additionally I won't have to deal with expanding to a 4TB drive eventually.
I use Backblaze's standard service. I look at it as cheap belt and suspenders insurance. I do onsite backups using Time Machine--along with the occasional sync to another drive.It's offsite and it's a completely independent backup mechanism. $50/year or so is essentially not worth worrying about in this context.
WRT your other comment: Yes, there's some small level of incremental security risk but there's so little that's genuinely sensitive in my storage, I'm willing to take that risk. And, yes, it's probably overkill but for the cost, there are a lot of things I spend money on that are probably unnecessary :-)
> I do onsite backups using Time Machine, and also Backblaze.
You are doing everything correctly. You are following the 3-2-1 backup philosophy, which is: "3 copies of the data, 2 copies locally, 1 copy remote". Here is a blog post we wrote about it: https://www.backblaze.com/blog/the-3-2-1-backup-strategy/
Hard drives fail and you need them physically to access the data. Online backups allow you access everywhere and are generally much more reliable than a single, local disk.
Am I reading this right? Google/B2/... might send your data to another URL you didn't expect.
Not sure why that matters, or why it's an attack. Since they have your data anyway, as that's the whole point of the service, to store your data on their hard drives. Why go through the trouble of sending it elsewhere? To play games with your data for giggles?
No, the API can tell your software to send some private LAN files, e.g. some IP-filtered secret NFS store, to an URL of it's choosing (so to itself, or your competitor).
This is bad, as long as you don't heavily jail and firewall the software to prevent it from ever accessing anything it shouldn't (need to).
I quickly skimmed, but this entire attack is assuming that the attacker has successfully MITMed the API. At that point everything is already nuked, so of course you can fabricate any number of attacks. Did I miss something important?
Cool setup. I'm not a Backblaze customer, so I'm curious. How is your setup better than using their client for personal backup? The webpage mentions threading and encryption.
Backblaze's backup product doesn't support backing up a NAS. It is for backing up a single Windows or Mac computer, and priced as such. They state this policy is to avoid abuse. Fair enough.
Backblaze's object storage product, B2, is priced per GB-month, so you pay for what you use. Fair enough. Because it is charged this way, it is open for whatever creative use developers can come up with.
I use B2 because I'm locked out of using Backblaze Online Backup - and that's fine with me, because it's the right product for the job.
It seems like nobody mentioned it, yet. Another great product is https://www.rsync.net/ and this just works. There are no bad surprises. You can overshoot your backup limits, and they will send you an email to fix this. But still you have your backup.
Your interface is rsync/scp/ssh.
They give you ZFS snapshots, you can use s3cmd from their machines, so you can delegate uploads to S3 via rsync.net.
Our prior backup setup was duplicity with GPG hitting S3, and this sometimes was flaky for listing the current keys.
Glad I read HN, I heard about rsync.net. They even have/had a HN discount. You should use the search functionality to find other threads.
For those that don't know, borg is a backup utility[1] that has been called the "holy grail of backups"[2].
It takes your plaintext files and directories, chops them into gpg-encrypted chunks with encrypted, random filenames, and will upload (and maintain) them, with an efficient, changes-only update, to any SFTP/SSH capable server.
My understanding is that the reason people are using borg instead of duplicity is that duplicity forces you to re-upload your entire backup set every month or two or three, depending on how often you update ... and borg just lets you keep updating the remote copy forever.
The Borg/Attic/HN discounted price is a quarter of the regular price IIRC. Well worth it IMHO. They're reliable, answer emails very fast, and are happy to provide technical help should you need it to configure your system.
Our current, headline price is 4c per GB, per month and the borg accounts are 2c - so it is half-priced. (we're in the middle of a price drop this month - that page still had the old 3c rate on it ...)
The ZFS-created snapshots of your filesystem are disabled - it is assumed that you will handle your retention/point-in-times with the borg tool itself (we don't like doing snapshots of snapshots ...) Also, while you get full technical support for the use of rsync.net in general we offer no technical support for your use of borg.
The assumption is that borg users know what they are doing - and that assumption has proved to be correct.
I use their Borg discount as well, and am extremely happy with it. I do wish it were cheaper, but I get 150 GB for $50/yr, which is enough for me with careful rationing.
I wish I had a TB for $50 so I didn't have to be so judicious with my photos, but the ability to use Borg is so fantastic that I can't complain.
You can get get a 500GB storage server for about that price, a bit more for a TB. I run an Ubuntu instance and use borg to backup my home systems to this server.
my affiliate link: https://billing.time4vps.eu/?affid=1881
It's almost like they try to be a self-sustaining, viable business instead of burning through VC money with irresponsible pricing that kills fair competitors.
The snark is really not necessary or contributing to the conversation.
AWS Glacier, hardly a VC-backed startup, charges a literal tenth of the cost. Given that most people are going to be holding on to their backups rather than retrieving them regularly, the pricing math works out better even though it's a bit more complicated.
Say you push 2TB up to Rsync, AWS Glacier, and Backblaze B2, and you need that data back a year later.
Rsync will cost you $80x12: $960, bottom line.
Glacier will cost you $8.00x12: $96 for the storage, plus .01 for a thousand retrieve requests, plus 0.01 per gigabyte retreival, plus 0.09 per gigabyte transfer.
$96 + .01 + $20 + $180 = $296.10
Backblaze B2: $10x12 = $120 for the storage, plus 0.01 per gigabyte retrieved:
$120 + $20 = $140
I'm guessing the "startup" dig was directed at Backblaze, but they're actually charging more for the plain storage than AWS, where you're paying more for the bandwidth!
> I'm guessing the "startup" dig was directed at Backblaze
And ironically, Backblaze is 99% self-funded and doesn't have VC funding and no deep pockets. We're profitable, the only way to stay in business without VC funding.
(Note: we did have a tiny "friends and family" round in 2009 which was 9 years ago. Plus we sold a small percentage of the company to a silent investor who didn't even get a board seat, no votes, no control. 100% of the board of directors are founders of Backblaze.)
"AWS Glacier ... charges a literal tenth of the cost."
Amazon Glacier and Google Nearline are not comparable products. What we offer at rsync.net is a live, online, random access filesystem - so the appropriate comparison is with Amazon S3.
I believe our current pricing is reasonably comparable to S3 - and at larger quantities is actually cheaper. Also, the borg pricing (2 cents) is cheaper at any quantity.
Fine... but your marketing is literally all about backups. The front page of rsync.net is "cloud storage for offsite backups".
If you hadn't told me this, or if I don't call a human on the phone number (why? this is an immediate turnoff) listed on your cloud storage page, or go read on "open platform" (which sounds less like a tech page and more like a marketing page), I'd never know about it.
Speaking of ZFS, I use B2 with zfsbackup-go. Sanoid is making snapshots and zfsbackup-go is uploading them.
[zfsbackup-go](https://github.com/someone1/zfsbackup-go)
Hetzner Storage Box[1] is an interesting alternative to Backblaze B2. It's not cloud-based, but provides free automated snapshots, free 1 Gbps bandwidth, and supports FTP, FTPS, SFTP, SCP, rsync and BorgBackup[2].
Be very careful with those - they do not use ECC memory and thus suffer from the many potential security attacks and your software needs to handle the odd memory error.
> Be very careful with those - they do not use ECC memory and thus suffer from the many potential security attacks
ECC hyperbole much?
When you've decided to put your personal data somewhere in a cloud on the other side of the internet, this kind of stuff should probably be absolutely on the bottom of the list of things you need to worry about.
tl;dr - Shifted my knowledge of the hosting market by exposing me to cheap dedicated servers
I had previously thought that dedicated servers were doomed to be too expensive/heavy weight for me. I also felt like most VPS providers charged too much (especially true in the case of AWS -- $10/mo for a t2.micro is ridiculous).
I first found INIZ (http://iniz.com/) and was super happy with them, then someone introduced me to Hetzner Robot Marketplace and I was blown away by the affordable prices (+/- setup fee) and have had one ever since. Hetzner also has a cloud offering that is also pretty great -- slight limits on operating system choice and some other features and you can have very competitively priced machines in a more cloud-friendly fire-up-and-go format.
Now I have a ~6 Core (12 vCore/hyper-thread) 24GB RAM monster that I can run experiments with for a decent monthly price.
If you go to other providers like Packet, OVH or Amazon, you're going to see way higher prices -- I'm don't have too many requirements so Hetzner worked for me.
Yes, very usual for EU companies. Make sure to blank anything sensitive. If this seems weird, consider how many places require your Social Security in US.
Hetzner is a bare-metal company operating since 1997, which hasn't released a cloud-based storage yet.
"Your files on Storage Boxes are safeguarded with a RAID configuration which can withstand several drive failures. Therefore, there is a relatively small chance of data being lost. Please note, however, that you are responsible for your data and there is no guarantee from Hetzner against potential loss of data. The data is not mirrored onto other servers."
Thank you - so it‘s just less redundant than e.g. Backblaze (I assume). That‘s an important distinction. See, I‘m not a fan of using buzzwords to describe anything in more detail. „Cloud“, „AI“, „Big Data“, „NoSQL“ etc. is (sometimes) fine to get non technical people interested, but useless to say anything meaningful about a system IMO.
"Backblaze will make commercially reasonable efforts to ensure that B2 Cloud Storage is available and able to successfully process requests during at minimum 99.9% of each calendar month.
"
Can I ask, what makes that setup "not cloud" vs Amazon S3? Amazon doesn't make public their hardware setup, merely that they offer various "9s" of reliability against data loss.
What if, for arguments sake, Amazon's secret setup is exactly the same as Hertzner hardware wise, with Amazon merely putting a number against the reliability that setup offers?
It's colocated hardware, with a thin service layer on top so they set it up for you as a service. The service and hardware are quite reliable, but you can (and will) still lose all your data in case the hardware fails. You have to create your own processes and layering to get to an adequate number of 9s for whatever kind of reliability you're looking for.
Right, I meant to say it's "like colocated hardware". Hetzner own the hardware, but the service guarantees are similar to where you own the hardware. If the hardware fails, tough luck.
Indeed. It's just never been publicized and personally I think the implementation could be better. Although with so many stars on this repo now I plan on maintaining it for the foreseeable future. Start adding feature requests everyone! :)
PS: They also don't even use their own library in their code examples so I don't think they meant it to be used in that fashion.
That would be great – it's always good to have some competition ;)
Regarding feature requests I'd love to see a well-maintained B2 Django Storage. I'm currently using an existing implementation, but it's not that well maintained:
I saw that one. I'm not a particular fan of Django but integrating my library apart from Django's storage library wouldn't be difficult. Neither would be building a django library on top of mine. Any takers? :)
I use Backblaze now and once I get my NAS, I’ll probably end up using a B2 based backup. But let’s make an honest comparison. Backblaze does not replicate your data across data centers. The standard S3 storage class does (0.23/gb). The comparible storage class for S3 is one zone infrequent access (.01/gb). B2 still comes out ahead, but I wouldn’t use either one for primary storage. For thier suggested “3-2-1” backup strategy, sure.
Then again, just for backup, I could use S3 glacier for $.004/gb. That’s cheaper than B2 and I get multiple AZ storage. The data charges would be higher - but its backup. If catastrophe struck and I lost my primary and my local backups, getting my data fast is the last thing I would worry about.
> Then again, just for backup, I could use S3 glacier for $.004/gb
Having done that in the past, I have to say that's just a million times less practical than basic S3-like storage. And if you want to automate that setup, Glacier is even worse.
I could see using something like rsync + Cloudberry (maps S3 and make it look like a network drive). Set it up to use one zone infrequent access, and then after x days use a lifecycle policy to move it to Glacier.
My use case for backups is solely for movies and music. For source code I use hosted git repos, pictures Google photos, and for regular office documents, they are either on Google docs or One Drive.
Last time I used Glacier, it was a separate product from S3 and had its own API.
You had to upload pre-prepared "tapes" for backups. You couldn't mutate an existing backup, you had to create a new one. And frequently fetching and/or deleting existing "tapes" (backups) would cost you money (more so than the original cost of the backup).
That meant you couldn't just ZIP it all up, backup the latest version and the delete the previous one to avoid being doubly charged for storage either.
Basically at time of archiving you needed to determine what was already archived and create a new bundle with only what's new, and archive that only. In the same spirit, restore meant piecing together multiple such tapes into a full restore-set.
Absolutely terrible. It was like having traditional backup-software constraints, but none of the software-support.
If Amazon has improved on that now, good for them, but I figured they probably had to if they wanted any users at all.
Honestly, I’ve never used the Glacier api directly. I’ve only used it as part of a lifecycle policy where objects were stored in S3 and then using the console to have AWS migrate data after a certain amount of time.
My offsite backup would only be accessed in the case of catastrophic failure - my primary and local backup data is unavailable. Data transfer does cost more but if I had that type of catastrophe, worrying about getting my movies back for my Plex server would be of little concern. Everything that I would care about - source code, photos, documents etc are stored other places.
That’s another strike against Backblaze backups (not B2 based backups). When we were in between residences last year - we left our apartment when the lease was up and stayed in an extended stay waiting for our house to be built, my main computer was offline for 5 months. One more month and my Backblaze backup would have been deleted. I forgot about it and I restarted my computer before I reconnected my external drive - so my backup from my external drive was erased from Backblaze as soon as I came back online. It wasn’t catastrophic but irritating. Luckily I have gigabit upload.
C14 is really not at all an object store. Getting data in and out is a huge pain, even compared to other cold stores like AWS or OVH. We evaluated them and passed.
S3 - not much to say, fast, durable, expensive...the gold standard. Given limitations of below, we use for rotating nightly backups despite cost.
Glacier - great for cold storage/archive, but has 90 day minimum
OVH hot - open stack based, cheaper than S3 but not absurdly cheap, charged for egress even intra-DC which is absurd and kills many use cases. They have crippled OpenStack permission management (i.e. no write-only keys with lifetime management per bucket which is necessary for doing backups securely)
OVH cold - charges for ingress but then storage is crazy cheap, and egress not as bad as Glacier. This is our preferred archival option.
C14 - not object storage, more like a "cold" ftp dump
B2 - pricing is epic, S3-incompatibility is a pain and lack of Backblaze-sponsored libraries (the library in the python b2 cli is not a proper API)...we've been working on adding B2 to WAL-E. However, their permission/user management doesn't cut it.
Wasabi - S3 compatible, great pricing if not for 90 day minimum, which they hide in the fine print
> B2 - their permission/user management doesn't cut it
Have you seen the new "Multiple Application Keys" APIs we have published docs for (and the release coming in a week or two)? I'm curious if they satisfy your permission needs. The docs are here: https://www.backblaze.com/b2/docs/application_keys.html
A screenshot of the web GUI to these keys is here: https://i.imgur.com/RdlgdAs.jpg
(NOTE: the web GUI does not expose the full power of the multiple application keys, it is meant to be easy to use and hopefully satisfy 95% of customer's needs.)
That looks great for me at least! I'd been using B2 a bit personally, but had written off using it for any serious projects because of the inability to make extra restricted per-project(bucket) API keys.
> Q: How do safe-deposit boxes work?
> A: The safe-deposit box is a free temporary storage space
> that lets you to upload your files before creating an
> archive.
> The safe-deposit box can be accessed for free using Rsync,
> FTP, SFTP, SCP protocols for a period of 7 days and
> supports up to 40TB.
> After 7 days or when you archive your safe-deposit box,
> your data are permanently stored on C14.
> When unarchiving, your data are delivered untouched,
> including file metadata.
Is there a fundamental reason why B2 is (and will remain) cheaper than S3, or is it just because they need to compete with AWS and once successful the prices will be the same (or higher)?
From my understanding, they've put a lot of work into lowering the cost of storage. I know at one point they were using arrays of consumer-grade drives, and they've done a bunch of analysis on the cost and reliability of drives on the market. They also created the "Storage Pod" [1] to maximize storage density.
Every cloud company does that. Google, Amazon, MS, all use consume-grade drives with software on top to reduce costs and increase reliability at a fraction of traditional enterprise storage solutions. Scality even provides a proprietary solution to do the same on-premise.
Objects on S3 are replicated on 3 AZs (datacenters) by default and I can't find any info on B2 if they also replicate the data on multiple DCs. That can definitely change the cost per GB for them if that's the case.
Also, AWS costs a lot just in traffic. A lot of people store things on S3 and then make that publicly available.
AWS is "da cloud" for a lot of people. So they ride that wave high and mighty, charging a lot for everything they can easily measure. People will just pay it and will try to [post-]rationalize how it's cheaper than other providers, because AWS is better.
I posted a longer version earlier. But the correct B2 vs S3 comparison is 1 zone infrequent access. B2 still comes out cheaper especially when you consider transfer costs but not by as much.
Are you able to try multithreaded uploads too? I found that single stream uploads were too slow (< 10 megabytes per second) but I could get ~35 megabytes per second from packet.net to B2 by using 4 threads.
> If you use the large_file API (needed for multithreaded uploads)
We recommend for small files that you use multi-threaded where each thread sends a totally separate file. So if you have to upload both cat.jpg and dog.jpg, you upload cat.jpg in one thread and dog.jpg in another thread.
Based on the Backblaze architecture, that means cat.jpg will be sent to one "vault" in the Backblaze datacenter with one thread, and dog.jpg will be sent to a totally different "vault" in the Backblaze datacenter with another thread. This scales incredibly well, in that it should be twice as fast for two files, and 20 times as fast for 20 files if you do it correctly.
Source: I wrote a lot of the Backblaze Personal Backup client, which uses this philosophy.
Okay, I'll have to go back and have a look at some of the client libraries I tried - it may have been the machine I was using wasn't quick enough to hash ~30GB in a reasonable amount of time.
I haven't added multi threaded stuff yet because I wanted it to be compatible with single threaded web servers like flask and django. I can and will add it if you want to add an issue.
This whole library wouldn't be necessary if Backbalze implemented a S3 compatible API. They give reasons like being able to load balance on the client for their API (which I do no think is a good reason), but ultimately they just push work from their end to a lot of applications and developers.
Maybe it also has a strategic advantage? Now every product has to announce they support B2 whereas nobody has to announce they support Wasabi, because they support any S3 compatible storage such as AWS S3, Google Cloud Storage or Wasabi.
Meh, I can see why they didn't. they aren't really in the same business. It makes sense for followers to implement APIs compatible with market leaders. Riak CS is API compatible with S3, which is nice. But it's literally intended to be an open source version of S3 that you can host and scale yourself.
Backblaze is in a different market. They may be finding out that there's overlap and allowing that use. But they are not the same and probably aren't prepared for developers to start using b2 en masse.
I think it makes business sense. You want to save some money? Do a little extra work for the cheaper product. Want to save even more money? Roll your own with Riak CS. Cloud services all work along the same spectrum where you pay more for convenience and ease of use, and you pay less up front if you're willing to pay in developer or devops or infrastructure costs. I think this fits in nicely on that spectrum.
Object stores unfortunately innovate on their APIs instead of their implementations. I wrote S3Proxy to bridge the gap between S3 applications and a variety of object stores including B2:
Hah, I hadn't heard about Blackbaze in a while and I was even thinking about creating Ask HN asking if anyone was maybe using Blackbaze for a longer while and can say something about them (speed, data reliability). Now I'll take my chance:
Had you used Blackbaze B2? How was your experience?
It's very cheap and effective for archival storage without having to deal with time/cost issues when you actually need to retrieve something. I use it for all my media so I can store terabytes and download in minutes for viewing.
Bandwidth is limited since they aren't connected like the major clouds, but it's workable if you don't need gigabit speeds. Single API key for permissions and lacks all the other features like events, object lifecycle, etc. Basic reporting but shows bucket size in real-time which is nice.
API can be annoying because it requires a request to "start" an upload (to get the address of where to upload), then doing the actual upload itself, but this can be automated away. Only single region for now (with multiple datacenters that aren't visible to you) so no global replication for extra durability or locality.
They have a partnership with https://www.packet.net (cloud bare metal) for free interconnect between their servers and B2 so you can do processing on your data without the public internet bottleneck and fees. Allows for an interesting data lake/warehouse option.
Use Cyberduck for a decent GUI client. If you just need personal computer backup, then use their actual backup offering which is unlimited storage and has auto-uploading background app.
Perhaps. Personally I use Arq with B2 for a back end, and this usually costs me less than $1/mo (their regular backup starts at $5). In addition, using B2 with something like duplicity is the better approach for backing up Linux or NAS boxes, where their official client is not supported.
I've used B2 for some internal backup handling (several 100s of GB but millions of files) and have largely found it inexpensive and performant.
A few considerations: Their web UI can not handle large amounts of files (support said after a few million the file browser will not work). Sometimes when making a large number of deletions at once the API may serve 500 errors and the web UI give Java Servlet errors (this only happened a few times and resolved itself in a hour or two). As another user noted the per file/fragment upload speed isn't fantastic but I could max out my gigabit fiber with many concurrent downloads/uploads. The API has no concept of folders, only file path strings (which is mildly annoying to work with). Lastly I think all the data is currently housed in one geographic area but they are working on a DC in Phoenix.
Overall a pretty smooth experience but I was mostly using it for cold data.
Cool, you guys should send out mailings to your customers when new offerings come online. I asked about s3 at the end of last year and was told it is currently unavailable for new customers. I haven't thought to check if it was available until now and may have gone to a more expensive competitor.
I can max out the upload of my residential internet connection (about 1 megabyte/sec) and that's enough for me.
I use use it to back up my Nas where all my other computers are backed up. I set it up with duplicacy-cli, rate limited the upload to 700KByte/sec (internet stay usable that way) and the script that launches duplicacy check that it is not already running.
Since I never upload more than 60Gb per day on average to my Nas I don't have any issues
We are opening a European datacenter in 2018, so stay tuned!
For now, we recommend you use multiple threads and you should be able to saturate any network connection, including yours in Europe. However, we do realize not all programmers or applications are capable of using threads and it would be more convenient to have lower latencies, thus the European datacenter in 2018. :-)
I tried it from the UK a couple of years ago, and had the same experience. I have ~1TB of data to backup, and it was going to take months to upload vs a few days for Azure or AWS hosted storage.
Maybe I misunderstand his point, but isn't transferring data from Australia to the US or reverse always going to be slower due to the speed of light? What he's said doesn't negate the fact that he's only got one POP.
Speed of light affects latency, not throughput. TCP works bad with large delays, that's why it's recommended to use several TCP connections to saturate the link.
I'm not sure how many POPs they have so I do agree with that. I'm not shilling for B2 and have no affiliation with them. I've had good performance them but am US based.
I'm trying to use it as an S3 replacement for audio content delivery - seems slow and laggy unfortunately. Uploads also fail frequently enough. I don't know if CDNs would make a big difference. (Europe)
Edit: lack of webhooks or something similar for doing follow-up after successful uploads is also irritating.
This feels like a handy tool! The first thing I read when opening new code is the test suite - it's worth getting that right at the start. Would you consider deeper unit testing? S3 (and aws) have the indispensible `moto` boto mocks, I think something similar would be dead handy here.
Yes, on my TODO list is much deeper unit testing. I made this in four days and was just testing that it worked. It already has about 92% code coverage but I want to cover that fields are returned properly and such. Some help would be appreciated if people would like to, including mocking it up.
I just wish Backblaze would fix their snapshots. You still have no way to tag a snapshot, put in any notes, anything. You literally make two snap shots 5 minutes apart and they only thing that differentiates them is the time stamp. unforgivable.
> You still have no way to tag a snapshot, put in any notes, anything.
We totally agree, and the project is fully spec'ed, just waiting for an available engineer to implement it! On a side note, we also have open recs for engineers. :-)
It's fairly standard for network clients to assume potential malicious control of the server they are connecting to.
It helps reduce the blast radius of a compromised server.
In the case where the server is operated by a third party (as is the case with the B2 API server), there can be many compliance implications if that third-party-operated server has access to an internal network.
We don't accept when SSH clients or web browsers have the ability to do things they shouldn't based on instructions sent by the server they connect to.
Why would we suddenly have lower expectations of our file storage API clients? (or any other network/HTTP clients for that matter)
Ah, I see what you're getting at. It'd be better if the URL for get_upload_url (I think that's what the API was called) could be calculated client side.
At the moment, you're probably still more at risk of downloading a malicious library from PyPi or npm but this is sure to turn up in a CTF at some point - even curl is technically vulnerable.
Have you talked to anyone from Backblaze about this?
Yes, client-side URL calculation and/or a whitelist of acceptable URLs would be a significant improvement.
Thankfully command-line curl won't follow redirects unless you pass it a special flag, though if you do need it to follow redirects, I'm not sure what the best way is to restrict the range of redirects that it will follow.
This issue was part of a broader coordinated disclosure and was only published today. I've gotten in touch with B2 support & I'm hoping my support ticket will make it to the correct people.
We are big users of OVH, AWS and B2 object storage. OVH charge for ingress and egress even if local. AWS Glacier has 90 minimum storage time. For most uses cases, B2 is much much cheaper.
EDIT: Response to your edit, Wasabi also has 90 day minimum storage policy.
Wasabi has a 90 day minimum storage period, just like AWS Glacier. This means it's pretty unusable for things like nightly backups if your retention is less than 90 days.
If your backup process involves downloading the backup artifacts in a different region (say for true off site DR + validation) then it's still a net win as the 90-day storage costs are less than the insanely high $.09/GB AWS charges for outbound bandwidth.
I'm loving B2 for my Linux desktop backup. I used Crashplan for many years until they pulled the plug recently. Now I'm using B2 via Duplicati and I'm actually saving money (I have about 500Gb of backup).
Borg was a runner up, but Duplicati had built in B2 support, provided scheduling, and a web interface which makes navigating for specific files in a tree easy when needed.
Crashplan had a nice Linux client, but it was blackbox/closed source, so problems came up from time to time that were hard to debug. So its nice to have more control of my data as well.
> uploading large amounts of data to [Backblaze B2] is very slow
All reports we (Backblaze) hear is that if you only use one thread, Backblaze B2 is slightly slower than S3 (like maybe 90% of the performance). If somebody has better numbers I would LOVE to see them!
If clients use multiple threads, this issue goes entirely away. Using 500 threads can provably be 500 times as fast with Backblaze. This is because the Backblaze B2 architecture means there are no "choke points" like Amazon S3 has. Each thread will most likely be talking to a completely different "Backblaze Vault" maybe even in a completely separate Backblaze datacenter. Since they don't share any network switches or load balancers in common, there is no way they will slow down.
But again, I would love any measurements or reproducible tests showing differences so we can chase them down and improve Backblaze B2!
rclone supports multithreaded upload, and even has experimental support for FUSE mounting. However, the sync command gets you Dropbox-like behavior and can be cronned: https://rclone.org/commands/rclone_sync/
I really like the price of B2, I hope it stays low :-)