Petabytes on a budget: How to build cheap cloud storage (2010)

KaiserPro · on Oct 26, 2014

I look after about 15PB of tier1 storage, and I'd recommend not doing it the backblaze way.

Its grand that its worked out for them, but there are a few big drawbacks that backblasze have software'd their way around.

Nowadays its cheaper to use an engenio based array like the MD3260 (also sold/made by netapps/LSI)

First you can hot swap the disks. Second you don't need to engineer your own storage manager. Thirdly you get much much better performance(2gigabytes a second sustained, which is enough to saturate to 10 gig nics). Fourthly you can get 4 hour 24/7 response. Finally the air flow is a bit suspect.

we use a 1u two socket server with SAS to server the data.

If you're brave, you can skip the raid controller and the JBOD enclosure instead and ZFS over the top. However ZFS fragments like a bitch, so watch out if you're running at 75% plus

brianwski · on Oct 26, 2014

Disclaimer: Backblaze employee here.

> Nowadays its cheaper to use an engenio based array like the MD3260

Cheaper than what? Backblaze still builds it's own storage pods because we can't find anything cheaper. We would LOVE getting out of that part of our business, just show up with something 1 penny cheaper than our current pods and we'll buy from you ENDLESSLY. NOTE: we have evaluated two solutions from 3rd parties in the last 3 months that were promising, but when the FINAL pricing came in (not just some vague promise, wait for the final contract/invoice) it was just too much money, we can build pods cheaper.

If your goal is to keep your job as an IT person - spend 10 or 100 times as much money on super highly reliable NetApp or EMC. You won't regret it. Your company might regret the cost, but YOUR job will be a bit easier. But if you want to save money - build your own pods. NOTE: remember Backblaze makes ZERO dollars when you build your own pods.

qthrul · on Oct 27, 2014

Disclaimer: VCE employee here. VCE is an EMC Federation Company [tm] with investments from Cisco and VMware.

> We would LOVE getting out of that part of our business [...]

That's a common theme when I meet with some unique companies that are still building out their own stacks of kit and just enough software to keep their consumption patterns consistent. It is about necessity some years but often comes down to the assembly line methods and sunk investments.

I'm curious how much variance you guys have / tolerate or if you ever see reaching VMI (maybe you are already there?) procurement levels?

Market wise, it isn't too outlandish to expect many of the "super highly reliable" options to have a future that is a spectrum of DIY + software (100% roll your own) all the way up to a full experience (status quo). Today, that roll your own option is a progression of hardware compatibility lists to reference architectures but there are only a few companies that go beyond that.

I'm curious to hear how much you see as "too much money" to keep your team from getting out of the rack/stack/burn/test/QA/QC/deploy. Do you plow the "too much money" less "build pods cheaper" into a value that can be compared an opportunity cost for the labor involved?

brianwski · on Oct 29, 2014

> I'm curious to hear how much you see as "too much money" to keep > your team from getting out of the rack/stack/burn/test/QA/QC/deploy.

Early on the "too much money" would have been maybe a 20 percent premium. But now that we have scaled up and have all the teams in place, "too much money" is probably even a 1 percent premium. In other words, if we moved to a commercial solution now it would be because we actually SAVE money over using our own pods. That shouldn't be impossible - a commercial operation could make their profit margin out of a variety of savings, like maybe their drives fail less often than ours do because they do better vibration dampening. Or since this is "end-to-end cost of ownership" if the commercial system could save half of our electricity bill by being lower power then they can keep 100 percent of that.

We have a simple spreadsheet that calculates the cost of space rental for datacenter cabinets, the electricity pods use, the networking they use, the cost of all the switches, etc. It isn't overwhelmingly complex or anything. It includes the cost of salaries of datacenter techs with a formula that says stuff like "for every 400 pods we hire one more datacenter tech", stuff like that. Hopefully if we went to a commercial built storage system we could increase the ratio like only hire one more datacenter tech every 800 pods. The storage vendor could pocket the salary of that employee we did not hire.

Or heck, maybe a large vendor can get a better price on hard drives than we get. We buy retail.

> reaching VMI (maybe you are already there?) procurement levels?

I'm not completely sure I understand the question, but here is one answer: in a modern web and email and connected world with tons of competition, I'm not sure there are any economies of scale past the first 10 computers you build. Places will laser cut and fold sheet metal in units of "1" for extremely reasonable prices, it's AMAZING what the small manufacturers can do for you nowadays.

sargun · on Oct 27, 2014

So, looking at what you guys are doing. At least based on your design from four years ago, have you looked at what the open computer project is doing?

At 4U, you guys can fit 45 drives in your case. The equivalent OpenCompute Knox would be 6U, at 45 drives. They have a different backplane arrangement, and more redundancy than y'all, so I expect it'll be a little bit more expensive, but at the benefit of not having to manage your own supply chain.

mattzito · on Oct 26, 2014

I think it's a use case question - obviously you'd never want to use a backblaze storage pod for scenarios where you'd be using only one or two pods, or for anything with high random I/O requirements, for all the reasons you describe.

But there's a lot of use cases these days where the majority of the data in question is "cool to warm" as opposed to hot/active. While vendors like NetApp and EMC would be happy for you to pay tier-1 vendor prices to store cool data on high-performance storage, if your app can handle it, it's not really necessary.

Think about photo storage, backups, database archive/transaction logs, log aggregation, etc. etc.

KaiserPro · on Oct 26, 2014

Oh I agree completely.

The Pods are not general purpose storage devices. They are literally as close to tape as you can get without getting dirty.

the average throughput of our teir1 storage is around 10 gigabytes a second (not bits) with a peak of 25-30. however they are just 60disk with 4 raid6 (with 4 hot spares) and dual 10 gig servers, 4tb disks.

They are just plain NFS servers with XFS. They are fairly good general purpose fileservers, great for high streaming IO. replace those 4tb disks with SSDs or 15k disks, you've got amazing random IOPs

brianwski · on Oct 26, 2014

Disclaimer: I work at Backblaze.

> the average throughput of our tier1 storage is around 10 gigabytes a second (not bits)

Any one Backblaze storage pod is absolutely limited by the 1 Gbit (not bytes) ethernet interface. So if you need your data sequentially from one pod it is not a very fast solution.

But remember we (Backblaze) have 863 pods (with another 35 being installed each month), each pod has 1 Gbit connection directly on the internet. So actually our "fleet" supports peak transfer rates of close to 863 Gbits/sec (107 GBytes/sec).

Backblaze storage pods aren't the right solution to all problems, but if your application happens to be parallelizable don't count them out as "slow".

KaiserPro · on Oct 26, 2014

For sure, you're optimising for HTTP, wide and slow. In VFX we need big gobbets of data yesterday to feed the many hungry CPUs.

Do you see the need yet to dump in a 10 gig card? or does the economic seem a way off?

brianwski · on Oct 26, 2014

> Do you see the need yet to dump in a 10 gig card?

We have a few of the pods in the datacenter running with 10 Gbit copper on the motherboard as an experiment. I'd have to double check, but it raises the price of the motherboard about $300? That's not bad AT ALL (only a 3% cost increase), the real cost hit is if you want to utilize the 10 Gbit you have to buy a 10 Gbit switch, which raises the cost of a 48 port network switch by thousands of dollars.

Netgear ProSAFE XS712T is a (recent) fairly cheap 10 Gbit switch. I haven't tried it yet, but soon.... And hopefully the fact that it exists will drive down the ridiculous price gouging of the more "datacenter" brands. Not to say Backblaze is above putting consumer hardware in our datacenter when the quality and price are right. :-)

devonkim · on Oct 27, 2014

For my home storage systems I've been looking at using Infiniband instead of Ethernet beyond the 1 Gbps link layer (for iSCSI and FCoE LUNs to boot thin clients off of). The cost of Infiniband switches is far lower than Ethernet and comparable 10Gbps cards are a bit cheaper, too. With Ethernet you don't have to think about software compatibility much and since Infiniband is available on enterprise class hardware a lot, you'll get a fair chance at support across OSes too.

rincebrain · on Oct 27, 2014

But in this case, they're exposed to the internet at large, not using a high-speed internal backbone, so Infiniband isn't going to let him talk to the internet, and then he's got two links per machine.

devonkim · on Oct 29, 2014

I'm not sure what the actual network topology is supposed to be nor do I really understand their data replication methods across these storage pods, but they can always use an Infiniband to ethernet bridge / gateway and that'd centralize the conversion to the egress side of their switches without mucking up overhead locally. Not like you can stick an F5 load balancer right on top of a Mellanox switch and expect it to be hunky dorey, but there's lots of cheap options to be explored is all I'm saying.

iSloth · on Oct 26, 2014

It's an old post with old pricing, I simply don't see vendors pricing close to there system ( assuming your looking at the fourth iteration ) especially not including support.

KaiserPro · on Oct 26, 2014

Does the price include the cost of building the software, and the helping hands to replace/source broken drives? (that includes the 4th iteration)

UnoriginalGuy · on Oct 26, 2014

You're suggesting something that costs $30K minimum. Given the number of vaults they have, they can afford a lot of additional pairs of hands for that saving.

Plus the software has an inherent value, it isn't purely a cost centre like you imply. The value of Backblaze is partly made by that very same software you scoff at.

KaiserPro · on Oct 26, 2014

I'm not scoffing, its a cost. The only reason to do a backblaze is to save on cash.

they are using really wide raid6 stripes, with horrific rebuild times. no hot swap ability.

put in anotherway: with 100 pods, you're going to be loosing 10 drives a week (on average, sometimes more) it takes 48 hours to rebuild an array (although I heard it was upwards of 72 hours, without load.)

so every week, you have to powerdown upto 10% of your storage to make sure you don't loose future data. or you can wait till you've exhausted all your hot spares, but then you're in the poo if you loose a disk on power recycle.(yes raid 6 allows two disk failures, but whats to say you wont loose two disks, as I said they are using large stripes (24?) so there is a high chance of a raid rebuild failing)

hotswap is awesome. running low on hot spares? sure replace online, walk away happy.

to make a backblaze work you need to keep more data in tier 1 storage, which means higher cost. (even if you use a software layer over the top to do FEC you still need at least 1/3rd more disks. )

Then there is the woeful performance.

if you want JBOD, with large scalability, just look at GPFS. The new declustered raid is awesome, plus you get a single namespace over zetabytes of storage, with HSM so you can spin out cold blocks to slower cheaper storage (or tape)

It blows everything else out of the water, ceph, gluster lustre, GFS2 & hadoop. Fast, scalable, cross platform and real POSIX, so a real filesystem which you can backup easily.

brianwski · on Oct 26, 2014

Brian from Backblaze here.

> they are using really wide raid6 stripes,... no hot swap ability.

SATA drives are inherently hot-swappable, you can hot swap drives in a Backblaze Storage Pod if you want to. In our datacenter for our particular application we just choose not to. We shut down the storage pod, calmly swap any drives that failed, then boot it back up. The pod is "offline" for about 12 minutes or there about. The RAID sync then happens as the pod is live and the pod can be actively servicing requests (like restoring customer files) while the RAID sync occurs.

KaiserPro · on Oct 26, 2014

Interesting, do you find that the power up causes more disks to pop? (with the change in temperature?)

How often do you see multi-drive failures(in the same lun)?

We've just past a peak of drive failures (had a power outage) where we've had multi-drive fails in the same lun. The awesome part about that is that each rebuild caused more drive to fail. I think one array swallowed 5 disks in two weeks. It didn't help that the array was being hammered as well as rebuilding (70 hour rebuilds at one point)

brianwski · on Oct 26, 2014

Anytime we reboot a pod, it runs a "significant" risk of not coming back up for a variety of reasons, and if a pod is powered off for several hours it is even worse. My completely unfounded theory is cooling off and heating back up is bad, short power downs things don't really cool off as much.

> How often do you see multi-drive failures(in the same lun)?

It happens - usually for a reason. Most recently a bunch of drives of one model all seemed to fail very closely together after running for 2 years, it's like some part wore out like clockwork. We are pretty relaxed about 1 drive popping, nobody gets out of bed and runs to the datacenter for that, it can wait until the next morning (the pods instantly and automatically put themselves in a state where customers can prepare restores but we stop writing NEW data to them to lighten the load on the RAID arrays). So if 2 drives fail within 24 hours on a Sunday then Monday morning we go fix it.

> each rebuild caused more drives to fail.

Same exact experience here, we feel your pain. We even have a procedure where we line up all brand new drives, clone each individual original drive to a new drive, then take all the new drives and insert them back into the original pod AND THEN run the RAID rebuild.

rincebrain · on Oct 27, 2014

You may find "Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures" relevant to your theory.

the_ancient · on Oct 26, 2014

>>sold by netapp

I think you missed the "on a budget" part.....

KaiserPro · on Oct 27, 2014

Its OEM'd by a large amount of companies, and are dirt cheap

the_ancient · on Oct 30, 2014

I dont think you understand what the words "dirt cheap" mean

Netapp and dirt cheap do not belong in the same thread

2close4comfort · on Oct 26, 2014

https://www.backblaze.com/blog/backblaze-storage-pod-4/ here is the latest version. I have loved being able to follow the iterations of the storage pod. They are very thoughtful about HW choices but leave it open which I think is the best part!

atYevP · on Oct 26, 2014

Yev from Backblaze here -> Glad you're enjoying the posts! We enjoy spreading that information, it's a rare thing, so we're pleased as punch that folks are reading them :)

cc439 · on Oct 26, 2014

It's more than rare, it seems that you're the only real-world, large-scale source on a lot of the topics you've been covering.

I may not work in IT but I crave well-written accounts of technical problem solving. Yours are among the most interesting as you're so open about sharing details that no one else would consider releasing.

brianwski · on Oct 26, 2014

Brian from Backblaze here.

Awww, you are so kind you make us blush. :-)

I do wonder why big companies like Google and Yahoo and Apple don't release more interesting statistics on drive failure. I've heard many people assume they are getting some sweet pricing on hard drives with an agreement never to release those numbers. But if that were true, why won't Seagate or Western Digital or HGST ever give Backblaze a deal to shut us up? Those three companies have refuse to sell to Backblaze "directly" (their business plan is to sell to distributors like Ingram who mark up the product and sell it to NewEgg who marks up the product and sells to Backblaze). I assure you, there are no backroom undisclosed pricing deals for Backblaze.

So I suspect Google and Apple pay retail drive prices also, and is not disclosing drive failure rates because they don't see the profit in it?

atlbeer · on Oct 27, 2014

2007: http://research.google.com/pubs/pub32774.html

atYevP · on Oct 28, 2014

But no names named :(

antr · on Oct 26, 2014

Hi Yev, do you know if Backblaze is thinking of supporting NAS backups in the near future? I really want to use Backblaze, but what is really stopping me from using you guys is not being able to backup my NAS. Thanks

brianwski · on Oct 26, 2014

Brian from Backblaze here.

The NAS problem isn't a technical one, it is a pricing problem. We charge a fixed amount of money "per computer" ($5) and we don't throttle or limit customers in any way. We survive on "averages" - most laptop users only have less than 1 TByte and we make money, some have 50 TBytes and we lose money, but on average we are profitable.

If we allowed a single user for $5 to mount network drives with Petabytes of data on them and back those up, this would tip the averages and put us out of business. So we need to solve the pricing problem.

With all that said, stay tuned, we are actively trying to figure out a pricing system that will make everybody happy (NAS owners with Petabytes of data included).

antr · on Oct 26, 2014

I totally understand, and respect that. I'm sure there are ways (price plans) to get NAS owners to signup. I might have a NAS, but defiantly don't have petabytes of data, but on average 1TB to 1.5TB on my NAS. I'd be happy to pay $2-4 more dollars more if I could include my NAS... this is my thinking out loud, I haven't run any numbers.

e12e · on Oct 27, 2014

What I'd like to see (been waiting for, for a long time, actually) is "enterprise" or "pro" backblaze, more along the lines of rsync.net or similar. Something that can (willingly) handle backup from servers (at 1Gbs speed). In other words an Amazon S3/glacier-like product -- but with more transparent pricing.

I'd be happy to pay a premium (way over the $5 per computer) for something like that -- I'm sure I'm not alone. Extra if the software was released under the AGPL or something -- so that I could migrate to self-host if I wanted/needed to. I don't think too many would be able to start up and compete directly with you wrt your experience with pods etc... so I think it should be viable. Less sure if anyone would "want" to move off Amazon.

cerberusss · on Oct 27, 2014

Purely for this reason, I have an account and actively promote you online and to everyone I know.

Oculus · on Oct 26, 2014

I've seen their hard drive reliability blog posts, but I never knew they were so open about their storage server designs. I think it's really cool how open they are about something that is definitely a competitive advantage.

brianwski · on Oct 26, 2014

Brian from Backblaze here.

> open they are about something that is definitely a competitive advantage.

From our perspective, our biggest competitor is apathy. Something like 90 percent of people simply don't backup their computers. So pretty much anything we can do to have people hear about Backblaze and possibly overcome their apathy and start backing up their laptops is good for us. These blog posts have worked very well for us - people enjoy the information, and by disclosing the info we have potential customers hear our name and we get a bump in subscriptions. It's a win-win, the BEST kind of win. :-)

aftbit · on Oct 26, 2014

The real competitive advantage is the software and IT principles that make managing a dumb array of disks possible.

falsestprophet · on Oct 26, 2014

Is it though? Running OpenStack Swift on these pods would get you pretty far.

http://docs.openstack.org/developer/swift/

caraboga · on Oct 26, 2014

Having inherited an openstack install with the swift implementation implemented on earlier generation of pods, these are the issues I and my cohorts ran into:

1) The object replication between the pods will be hard on the array, as the object replication is simply an rsync wrapped inside nested for loops. If you have a bunch of small files with a lot of tenants, it'll hurt.

2) While swift allows you to simply unmount a bad disk, change around the ring files and let replication do its thing, there are actual issues. First of all, out of band smart monitoring of bad sectors actually cases the disk to pre-empt some sata commands and do the smart checks first. On a heavily loaded cluster, under a smart check for a bad block count could kill the drive, and take out the sata controller along with it. We've downed storage pods before by that way. The only way we got around it was to take a pod offline once a week, run all our smart checks, then put it back.

3) To replace a drive, you have to open the machine up and power it off. As any old operator will tell you, drives that have been running for awhile do not like to be turned off. If you are to power off a machine to replace a bad drive, do realize that you might actually break more drives just from the power cycle.

4) Once you change a drive out and use the ring replication of files to rebuild your storage pods, your entire storage cluster will take a non-trivial hit.

5) Last but not least, it is almost of tantamount importance to move the proxy, account and container services away from the hardware that also host the object servers. It's probably good to note that the account and container meta data is stored in sqlite files, with fsync writes. If you add and remove a bunch of files to multiple accounts, the container service is going to get hit first, then the object service. Further more, every single transaction to every meta data/data service, including replication, is federated through the proxy servers. If you look at a swift cluster, the proxy services take up a large chunk of the processing space.

Source: Was Openstack admin for a research group in Middle Tennessee. Ran an openstack cluster with 5 gen1.5 pods for the entire swift service, then moved account/container/proxy to three 12 core 2630s with 64 gigs of ram. Cluster was for a DARPA vehicular design project, first part fielded 3000 clients, second part fielded about 1/10 of that, but with more files and bigger files (these were cad and test results respectively).

rdkls · on Oct 27, 2014

Thanks caraboga. I was contemplating this setup a year or so ago, and great to learn your insights from actual implementation. It looked to me like Supermicro JBOD enclosures would be superior (mainly hot swappability) if a bit more expensive, would you agree?

caraboga · on Oct 27, 2014

My predecessor went with Backblaze clones for the drive and mb enclosure. You'll run into issues with the backblaze way as they have two power supplies, but one is for the motherboard and the boot drives, and the other power supply is for the actual drives. Further more, as this was a backblaze pod clone, there's no ipmi on the pod motherboards. It makes certain things a bit more annoying than they have to be.

If I were to do it again, I would stay away from using enclosures that were inspired by the BackBlaze models. Supermicro enclosures are fine.

rincebrain · on Oct 27, 2014

I'd be curious to know what drives and controllers you had that pre-empted requests due to SMART queries...

caraboga · on Oct 27, 2014

I don't recall the model, but they were 3 terabyte Seagates that were bought 3 years ago. I think my predecessor employed the same controller cards as version 1 of the pods. You could tell that the smart queries pre-empted normal io as the smart query would be executed, and disk activity for certain swift object servers would just stall. The object-servers would not return requests for at least some of the drives until the smart query was finished.

Curiously enough, I've also ran smartctl against Samsung 840Pros using Highpoint RocketRaid controller cards during the same project. Sometimes this crashes the controller card.

(This was for a gluster cluster, when gluster had broken quorum support. A copy of a piece of data, when out of sync with the other copies in a replica set, would be left unchecked. This was produced after one smartctl command took out the controller card, and then the machine along with it.)

jpalomaki · on Oct 26, 2014

I've always been very suspicious about backup vendors offering unlimited space for fixed price. These storage pod posts by Backblaze was primary reason why I decided to give their service a try. Knowing the technology behind the system made it much more credible for me.

brianwski · on Oct 26, 2014

Brian from Backblaze here.

This is EXACTLY one of the reasons we released the information. People would write about Backblaze and say we were losing money and it was an unsustainable business funded by deep pockets. That was frustrating to hear because we self-funded the company for years (no VCs) and I assure you there are no deep pockets here.

So we released the info on the pods partly so we could point at it and say "see, the math really DOES work!"

harel · on Oct 27, 2014

That: no VCs. That is a sign of a healthy company if you can do it without someone else's money (read: selling your soul for cash). I often see people with an idea worrying more about getting VC money than executing the idea and I can never understand it. If your idea is solid you don't need external money. There are valid reasons to chase it (facilitating rapid growth for example) but I rarely see VC money sought after for the right reasons.

atYevP · on Oct 26, 2014

Yev from Backblaze here -> Glad we could win you over :)

andyidsinga · on Oct 26, 2014

> A Backblaze Storage Pod is a Building Block

> But the intelligence of where to store data and how to encrypt it, deduplicate it, and index it is all at a higher level (outside the scope of this blog post).

I'm curious about their software that works outside the nodes too. I've been working on storage clusters over this past 9 months using the Ceph ( http://ceph.com/ ) open source storage software. Its pretty amazing -- and I suspect it could be deployed to a set of backblaze pods too.

It seems to be that for production environment where you wanted to maintain availability you would to build at least 3 of those pods for any deployment - enabling replication across pods/storage nodes.

xorcist · on Oct 26, 2014

Care to write something about your experiences with Ceph? I was recently looking at something that could be a use case for it, but it is a bit too close to the holy grail of storage for me to feel comfortable. I've toyed with similar systems in the part and they've always been very iffy with their POSIX semantics.

andyidsinga · on Oct 27, 2014

We're not doing anything explicitly posix with ceph - not using the ceph filesystem. We're using ceph for block storage under virtual machines that use KVM/QEMU with libvirt etc. The block devices exposed to the VMs end up using having an ext4 filesystem on them but that has nothing to do with ceph.

So far seems fantastic from a VM perspective -- I can take down storage nodes, add new nodes etc, all while the VMs running up top stay alive and well. I once lost 1/3 of my cluster (on a 3 node cluster with 30 disks) ..and everything stayed running until I fixed the broken node.

Ceph also supports Object storage and an S3 interface -- I haven't experimented much with that yet, but I hear its good.

xorcist · on Oct 27, 2014

Thanks, that sounds like a very interesting use case. It sounds almost perfect for KVM storage. How do you expose the file system to the KVM nodes? Just mount it?

What's performance like so far?

andyidsinga · on Oct 27, 2014

qemu has support for ceph's "rbd" ( rados block device). you create an image of a particular size using the ceph tools, then when you create a vm xml definition you specify and rbd disk. ( see here http://libvirt.org/formatdomain.html#elementsDisks )

finally, when the vm is booted you can use that block device like any other machine. for first install it will be empty ( need to attach a cdrom image to get going ) then youncan start making snapshots for "thin provisioning" using copy on write clones ...very sweet stuff :)

re performance ...i think its good, but i have heard its not as good as expensive proprietary systems. the product we're building is around storage analytics on open source. ..so eventually well know more :)

jdub · on Oct 26, 2014

But in 2014, 1PB in S3 – with 11 nines of data durability – costs ~USD$30,000.

epistasis · on Oct 26, 2014

That's $30,000 per month. These pods will last 3-5 years, which is $1-$1.8M of S3 storage.

Say you store 3-4 copies of the data on the BackBlaze pods in order to attempt comparable durability, and then the cost comparison comes down to power, space, network, and manpower costs. These costs are extremely variable, so depending on access to these, I could see someone still going with S3. S3 also scales larger and smaller quite quickly, unlike investing in your own equipment.

jdub · on Oct 27, 2014

Yeah, so, over three years in us-east-1:

  S3 = ~USD $1,085,552.64

Or more realistically:

  Glacier = ~USD $377,487.36

epistasis · on Oct 27, 2014

I wish Glacier was a realistic option for me, but I need to look at my data again.

bkruse · on Oct 26, 2014

This has always interested me. I need to do decently big storage for genomic data. It doesn't have to be fast, but it needs to be able to survive one data center blowing up. If I have 3 data centers, need to store 2-3 petabytes and need storage to survive in the case of a data center failure - the solutions really narrow down when you have to get under the $200/tb range.

Playing with Swift now - but it has really opened my eyes to how much more difficult 2-3 petabytes of storage is (disk failures, number of disks in your infrastructure, the time to redeploy a datacenter on a 1gpbs connection). All the little problems become much bigger!

ddorian43 · on Oct 26, 2014

Don't know if swift has erasure coding yet, but QFS[0] has it.

[0]: http://quantcast.github.io/qfs/

thrownaway2424 · on Oct 26, 2014

Single metadata node still? Seems like a good way to manage a large amount of data, and then lose it all.

harel · on Oct 26, 2014

Tech aside, I'm quite curious about the economics of storage here. By the price tags and 'Debian 4' I'm guessing this is an older post. But still, $7867 per 67TB and $5 per month, means they need 131 users pay for one year to recoup the cost of one pod, assuming those 131 do not generate over 67TB worth of storage in that period of time. I've not factored in data centre costs, salaries etc. Just a pod. I'm guessing they have enough users as they have been around for many years now, but still, $5 seems a bit on the cheap side to me (not that I'm complaining)

brianwski · on Oct 26, 2014

Brian from Backblaze here.

> I'm quite curious about the economics of storage here.

When we started almost 8 years ago, we were CLUELESS about the economics of it all, and scared witless over it. :-) But I'm happy to report the grand experiment has been a modest success, we are profitable (by a small amount). We even have a little money left over every month to spend on marketing efforts now, and we're growing by every metric - more customers each year, more revenue each year, more employees to swap failed drives (38,790 spinning drives as of yesterday - some fail like clockwork each and every single day).

coned88 · on Oct 27, 2014

any change you will ever offer backup without the need for a client. Sometime a user could just use sftp or something to send the files to. Something more open so I don't have to install your software. That's what holds me back.

Bluestrike2 · on Oct 26, 2014

I'd imagine--based on purely anecdotal evidence--that their market research indicated consumers are highly responsive to changes in price. People (even those who should know better) tend to have an irrational confidence in the idea that they won't need to fall back on a backup in the future. Hard drives crash, but mine? Unlikely. Foolish in the extreme, but far too common nevertheless.

Faced with that sort of thinking, the lower price likely helped them convince consumers. At $5, it's a low enough fee for most people to kind of just forget about it rather than consciously weigh the opportunity cost involved. The last think Backblaze wants is for its customers to think that they'd be better off spending the money on a really good hot dog or two rather than a backup service.

brianwski · on Oct 26, 2014

Brian from Backblaze here.

> their market research indicated consumers are highly responsive to changes in price

That makes it sound like we carefully thought it through. :-) Our "market research" was the 5 Backblaze founders asking their friends and family questions like "Mom, how do you feel about paying $5/month for a backup?" After asking that question to 15 different people, it seemed like there were reasonable tipping points around $5/month and then another one at $1/month. The math didn't work at $1/month - so $5/month it was! And we've never changed prices, because we got lucky and it worked out for us. So we've never run the experiment of whether we would lose half our customer base if we raised the price to $6/month. And I hope we never have to raise prices, because subscription customers seem to HATE that - like when Netflix raised prices it was a hate fest towards them.

harel · on Oct 27, 2014

That's the second best market research. I'll definitely consider your service, assuming you have a Linux client.

corv · on Oct 26, 2014

I wonder how they can guarantee data integrity.

Are they checksumming on a higher level and is that cheaper than using ZFS with necessarily more expensive hardware?

tachion · on Oct 26, 2014

I am also a bit suprised they're running Debian 4 + JFS combination instead of something a bit more modern, like FreeBSD and ZFS, for example, my default choice when it comes to filesystem nowdays. I'd love to hear abou their reasoning behind that and possible plans (if any) to move to something more recent, when it comes to software layer.

epistasis · on Oct 26, 2014

As of 2012, they checksum each file and check it every few weeks, and if a file went bad they have the backup client retransmit it:

https://help.backblaze.com/entries/20926247-Security-Questio...

brianwski · on Oct 26, 2014

Brian from Backblaze here.

Yep, we checksum the file on the client before sending it. In practice for any one user the number of "silent bit flips" on disk is very low and you probably wouldn't notice it (maybe 1 note in your mp3 file is slightly off). But when we check the checksums every few weeks we absolutely see it occur in our datacenter at our scale probably every day and take steps to heal the files (either from other copies we have or ask the client running on the customer's laptop to retransmit the file).

corv · on Oct 26, 2014

Ok, I see.

immortalx · on Oct 26, 2014

I decided to try it. You can only select entire hard drives and work around this by excluding folders. That's odd but what i don't understand is why you cannot exclude your c:/ (or Main Drive). Why should anyone be forced to backup something?

Seems to me like the design is backwards and doesn't make any sense.

brianwski · on Oct 26, 2014

Brian from Backblaze here.

The focus of our backup system is being "easy to use". We get a lot of advanced users who are computer savvy, but we also get some mom and pop novice who aren't very good with computers.

Ok, so we don't allow you to unselect your C:\ (main) drive. Here is the reason: when given that choice an alarmingly high number of novices accidentally unselected the drive. (sigh)

We don't charge customers any more or less on what they backup, so there are almost no legitimate reasons to UNSELECT a drive. What data do you have that you hate so much you MUST lose it when your computer is stolen? So we error on the side of helping these novice users and don't allow them to unselect their main laptop drive. The down side is it sometimes irritates the advanced customers who know what they are doing. :-)

atYevP · on Oct 26, 2014

Yev from Backblaze here -> There are certainly items related to Backblaze that we need to back up. You can exclude pretty much everything inside of the C: drive, but we do require the drive itself to be selected. Also, we chose the "backup everything, and let the user exclude data" route because the other way seems like a gap. For example, a lot of backup solutions make you select what you want to back up, but what if you forget something? Or forget to add something later? That would leave a hole in the backup, so we go the reverse route!

ciupicri · on Oct 26, 2014

If you submit an old article even if newer versions of it exist, at least mention the year in parenthesis.

hendzen · on Oct 26, 2014

Honest question, why JFS vs say, ext4?

bluedino · on Oct 26, 2014

Wasn't mature enough at the time, and JFS allowed larger volume sizes. After the second or third pod revision, BackBlaze switched to ext4.

Oculus · on Oct 26, 2014

In later blog posts they discuss moving over to ext4 despite the 16TB size limit.

brianwski · on Oct 27, 2014

Brian from Backblaze here.

Correct, we are entirely on ext4 now (there might be one or two old pods running JFS left, I would need to double check).

We figured out a way to work around the 16 TByte volume limit originally, but now ext4 even supports larger volumes so that isn't a problem at all anymore.

More random history: At the time we chose JFS it looked like XFS might lose support (XFS was created by a now defunct company I used to work for called SGI). Well, the world is a strange place and XFS now has more support than JFS, despite SGI going out of business. But in the end, ext4 is the "most commonly used" linux file system and therefore is arguably the most supported with extra tools and such, and the performance of ext4 in many areas seems higher than JFS (at least for our use case) so the decision becomes easier and easier to go with ext4.

fredsted · on Oct 26, 2014

>In the future, we will dedicate an entire blog post to vibration.

In the meantime, does anyone have a link?

aliakhtar · on Oct 26, 2014

Do they only do backups or also cloud storage? If they have an API for uploading / deleting / viewing files, I'd use them over S3 given how lower their costs are. But, I can't find any info on that on their website.

lee · on Oct 27, 2014

Also they only support Windows/OSX. I've been a customer of theirs for a few years, but I've recently switched to a competitor just so that I can backup my Linux NAS.

eeZi · on Oct 26, 2014

No API. Only backups, and you need their client.

mikeash · on Oct 26, 2014

It is possible to retrieve individual files or small collections of files, so you can use it as a "I forgot to upload my presentation" solution in addition to just plain disaster recovery. However, it's somewhat cumbersome and pretty slow (restoring a single file still takes on the order of minutes) so not a practical solution for general-purpose data storage and retrieval.

brianwski · on Oct 26, 2014

Brian from Backblaze here.

To expand on this, we require our client because of our pricing model, which is $5 per computer for unlimited storage FOR THAT ONE laptop or desktop. An API would mess with that pricing model, so we haven't opened it up yet.

mschuster91 · on Oct 26, 2014

I'd love it if either Backblaze or a 3rd party makes a business of selling these pods!

edit: just spotted it, their boot drive is PATA?! Why is this, given that PATA drives are slower and more expensive than SATA ones?

spartas · on Oct 26, 2014

45Drives http://45drives.com is one company that's selling pre-built versions of the storage pod.

mschuster91 · on Oct 26, 2014

http://www.45drives.com/store/order-beerinator.php

A 4U beer cooler, not sure if awesome or stupid... but still, a nice idea.

atYevP · on Oct 26, 2014

Yev from Backblaze here -> We are about to have one on order...maybe a whole cabinet full ;-)

cik · on Oct 26, 2014

Awesome - thanks for the find. I actually needed something like this and was way too lazy ;)

vertex-four · on Oct 26, 2014

https://www.backblaze.com/blog/backblaze-storage-pod-4/ seems to suggest their boot drive is SATA (google "WD5000LPVX"), where are you getting the idea that they're PATA?

justinsb · on Oct 26, 2014

Looks like they were PATA for version 1 (search for 'PATA').

My guess is that they wanted to keep the SATA ports for their storage, and the motherboard had a PATA port available.

It may also have been non-trivial to boot from JFS, which is probably why they wanted a separate disk.

kabdib · on Oct 26, 2014

Doesn't matter how fast your boot drive is. (A friend of mine runs a home RAID setup, a couple hundred terabytes, and boots off of thumb drives).

ksec · on Oct 26, 2014

Need the Add the Year at the title. This post is old.

tkinom · on Oct 26, 2014

Great article!

Love to see more write up on software selection process, tradeoff, failure recovery process/methods and benchmark data.

sidcool · on Oct 26, 2014

Quite detailed post. Loved reading it.

iflyun · on Oct 26, 2014

why such an old and expensive cpu?

tjl · on Oct 26, 2014

It's an old post. See the latest post,

https://www.backblaze.com/blog/backblaze-storage-pod-4/

for more details on the new one.

codeonfire · on Oct 26, 2014

What do you do when a drive goes bad? Do you move ~50TB of data, pull the entire pod out of the rack, and then try to determine which of 45 drives is bad?

X-Istence · on Oct 26, 2014

I don't know what Backblaze does, but if I were building their infrastructure I wouldn't care if a disk goes bad. They are a backup solution, it is a write once, read almost never. It's pure cold storage. 99.99% of their time after they are filled they will be powered down sitting in a rack waiting for someone to come fetch the data.

As long as they have stored my data across multiple replicas, then there is no caring if one of the drives goes bad...

us0r · on Oct 26, 2014

From another blog post:

"Funny thing about providing an online backup service that is currently storing over 100 Petabytes of customer data, sometimes people need their data back. In fact, nearly half (46%) of our customers restored one or more of their files in 2013 and that adds up."

https://www.backblaze.com/blog/six-billion-files-restored-an...

vidarh · on Oct 26, 2014

That doesn't contradict the earlier point. 46% of customers restoring one or more of their files is practically nothing; assume that many customers only restores a few files and that most customers do nightly backups, and the vast volume of backed up data is likely never touched.

So as was said earlier: As long as there are replicas (and presumably a repair process to maintain a reasonable replica count), a dead disk can sit in the enclosure until a high enough percentage is dead that it's cost effective to pull a a couple of them and consolidate into a new one.

Not just because it's awkward to pull them, but data centre visits gets expensive when you factor in the cost of ops people so there are definite tradeoffs in terms of how often it is cost effective to deal with.

brianwski · on Oct 26, 2014

Brian from Backblaze here.

> data centre visits gets expensive

Early on going to the datacenter was expensive (and time consuming, the datacenter was a 1 hour drive away from the office). We had some success with training "remote hands" (the datacenter provides some basic services like rebooting servers for hourly fees).

Now we employ several full time people who go to work in the datacenter (they don't even have desks at the corporate office). It makes our world SO much better and smoother now that there is enough work that they can be there full time.

Spooky23 · on Oct 26, 2014

I use Backblaze to backup something like 400,000 files on my Mac. Worst case scenario, I only care about maybe 100,000 files total. The rest is associated with the OS, apps, etc

atYevP · on Oct 26, 2014

Yev from Backblaze here -> If a disk goes bad we put the pod in a read-only mode, which limits the stress on the pod, and allows restores to continue. We then swap the drive, usually fairly quickly, but because we designed the system around failure, it's usually not critical, though we keep an eye on each one.

bluedino · on Oct 26, 2014

In their 2.0 blog post, they said they had a guy spending 1 day a week replacing approximately 10 drives. I would imagine that consists of identifying the drives and pods, issuing software commands to power them down, physically replacing the drive(s), then re-assembling and powering the unit back up and then bringing the storage software back online and issuing a rebuild command.

They only had 9,000 drives back then.

atYevP · on Oct 26, 2014

Yev from Backblaze -> We've staffed up ;-)

acmecorps · on Oct 26, 2014

Hi there Yev,

I'm a long time crashplan user and the reason why is because they have a family plan ($15 for 10machines). I have multiple machines that I need backed up, so that's why I go with this. I wished i can use backblaze because it looks very good and have native apps. If you have a family plan, I'll switch in a heartbeat. :)

brianwski · on Oct 26, 2014

Brian from Backblaze here.

CrashPlan is awesome, I have ALWAYS liked those guys.

Backblaze focuses on being really simple, we already have 3 different pricing choices which is probably two more than we should have: 1) $5/month, 2) $50/year, or 3) $95/two years. All are unlimited, none have throttles or caps, we don't even price businesses differently. Simple.

Our theory is (may not be correct) that if you offer customers too many complicated choices, they get stuck in the decision making process and don't buy any of the choices. If we offer a family plan that is MORE expensive for two computers but LESS expensive for three computers, blah blah, too much thought. It's $5/computer/month - done. You can get started IMMEDIATELY with one computer, and add them as you go, you don't have to decide to commit to 10 licenses to get the best price, etc, etc.

We don't have any magic knowledge more than anybody, so internally we sometimes angst over stuff like a family plan, but for now we just want it to be simple and straight-forward.

xcrunner529 · on Oct 27, 2014

Yeah, not a big deal to me. The no throttling is awesome.

However, I am seriously considering Crashplan myself when my subscription expires in February because I do not like the 30 day expiration policy. It is definitely a pain with large external drives. But I suppose I might not be profitable so maybe that's not worth fixing :P

brianwski · on Oct 27, 2014

We really want to bump it up to 6 months, and the decision all hinges on a slight mystery we have surrounding "data churn". In a nutshell: we aren't sure we can afford to bump it up to a 6 month retention without changing price, but we need to do some additional analysis or tests to make sure.

> I suppose I might not be profitable so maybe not worth fixing

No no, you are (one of) our target customers! Here is why: you post to ycombinator and you are technical enough to understand the implications of the 30 day roll back policy, and you are actually evaluating software on merits - this alone puts you in something like the upper 5 percent of technical computer society. Yeah, maybe we lose a dollar a month on your backup, but if you are using Backblaze you'll probably recommend it to five other people we make money off of. Your recommendation is GOLD to us.

rythie · on Oct 26, 2014

Why would you need to do any of that? RAID6 can tolerate 2 drive failures and Linux will tell you which drive is bad. Just slide the pod out and replace the drive, no data lost, very little down time.

codeonfire · on Oct 26, 2014

Three drive failures? My question is how do you practically determine which drive to swap. I don't see any labels or anything. Also I read the current version supports rails. The one in the article looks bolted to the rack. The article had no date on it and it sounds like a lot of the issues have been addressed.

brianwski · on Oct 27, 2014

Brian from Backblaze here.

> how do you determine which drive to swap

Every two minutes we query every drive (all 45 drives inside the pod) with the built in linux smartctl command. We send this information to a kind of monitoring server, so even if the pod entirely dies we know everything about the health of the disks inside it up until 2 minutes earlier. We remember the history for a few weeks (the amount of data is tiny by our standards).

Then, when one of the drives stops responding, several things occur: 1) we put the entire pod into a "read only" mode where no more customer data is written (this lowers the risk of more failures), and 2) we have a friendly web interface that informs the datacenter techs which drive to replace, and 3) an email alert is sent out to whomever is oncall.

Each drive maps has a name like /dev/sda1 and these drive names are reproducibly in the same location in the pod every time. In addition (before it disappeared) the drive also reported a serial number like "WD‑WCAU45178029" as part of the smartctl commands which is ALSO PRINTED ON THE OUTSIDE OF THE DRIVE.

TL;DR - it's easy to swap the correct drive based on external serial numbers. :-)

codeonfire · on Oct 28, 2014

Ok, thanks for the info. That doesn't sound too bad.

mikeash · on Oct 26, 2014

Wouldn't the SATA controller tell you which port the bad drive was on, and then you'd presumably have some standard mapping from ports to physical locations in the case?

codeonfire · on Oct 26, 2014

Yes, I'm not saying its not doable. It just seems error prone and time intensive to replace a drive.

sneak · on Oct 26, 2014

I don't understand why they're using raid6 instead of file-level replication/integrity. They're already running an application over top of it - do the replication there and skip replacing disks...

rythie · on Oct 26, 2014

I don't think they are doing replication at all, don't see it mentioned. RAID6 is cheaper than replicating the files, which would mean twice as many servers.

aftbit · on Oct 26, 2014

But you'll need some level of multi-data-center durability (or at least across racks), so you'll want to replicate user content anyway. Otherwise a dead server could prevent a restore.

rythie · on Oct 26, 2014

I see no mention that they do that, clearly they trying to make this as cheap as possible. If the server is down people can wait for their recovery, many tape based backup systems require you to wait for 30 mins or more to get the backup which covers most outages. Even waiting a days for a recovery isn't the end of the world, especially given very few recoveries are being done and big recoveries require sending of a drive that takes days anyway.

The worst case of course is that actually lose your data, probably as the result of a data center fire/explosion, though people should, in most cases, still have the primary copy of the data on their machines. However re-backing this up would take a long time.

vidarh · on Oct 26, 2014

You don't run a serious data storage service on just RAID6.

Data centres do fail, have fires etc.

rythie · on Oct 26, 2014

It's backup not primary storage. If they were going to spend the twice the cost, do you think there would be a least a single mention of this somewhere? anywhere? (I couldn't find one)

This outage post makes no mention of a backup datacenter and they say backups were halted as a result of this outage: https://www.backblaze.com/blog/data-center-outage/

keypusher · on Oct 27, 2014

No, you just replace the drive. I'm am sure they have automated means to notify them of exactly which drive has failed and this is a routine occurrence.