I look after about 15PB of tier1 storage, and I'd recommend not doing it the backblaze way.
Its grand that its worked out for them, but there are a few big drawbacks that backblasze have software'd their way around.
Nowadays its cheaper to use an engenio based array like the MD3260 (also sold/made by netapps/LSI)
First you can hot swap the disks. Second you don't need to engineer your own storage manager. Thirdly you get much much better performance(2gigabytes a second sustained, which is enough to saturate to 10 gig nics). Fourthly you can get 4 hour 24/7 response. Finally the air flow is a bit suspect.
we use a 1u two socket server with SAS to server the data.
If you're brave, you can skip the raid controller and the JBOD enclosure instead and ZFS over the top. However ZFS fragments like a bitch, so watch out if you're running at 75% plus
> Nowadays its cheaper to use an engenio based array like the MD3260
Cheaper than what? Backblaze still builds it's own storage pods because we can't find anything cheaper. We would LOVE getting out of that part of our business, just show up with something 1 penny cheaper than our current pods and we'll buy from you ENDLESSLY. NOTE: we have evaluated two solutions from 3rd parties in the last 3 months that were promising, but when the FINAL pricing came in (not just some vague promise, wait for the final contract/invoice) it was just too much money, we can build pods cheaper.
If your goal is to keep your job as an IT person - spend 10 or 100 times as much money on super highly reliable NetApp or EMC. You won't regret it. Your company might regret the cost, but YOUR job will be a bit easier. But if you want to save money - build your own pods. NOTE: remember Backblaze makes ZERO dollars when you build your own pods.
Disclaimer: VCE employee here. VCE is an EMC Federation Company [tm] with investments from Cisco and VMware.
> We would LOVE getting out of that part of our business [...]
That's a common theme when I meet with some unique companies that are still building out their own stacks of kit and just enough software to keep their consumption patterns consistent. It is about necessity some years but often comes down to the assembly line methods and sunk investments.
I'm curious how much variance you guys have / tolerate or if you ever see reaching VMI (maybe you are already there?) procurement levels?
Market wise, it isn't too outlandish to expect many of the "super highly reliable" options to have a future that is a spectrum of DIY + software (100% roll your own) all the way up to a full experience (status quo). Today, that roll your own option is a progression of hardware compatibility lists to reference architectures but there are only a few companies that go beyond that.
I'm curious to hear how much you see as "too much money" to keep your team from getting out of the rack/stack/burn/test/QA/QC/deploy. Do you plow the "too much money" less "build pods cheaper" into a value that can be compared an opportunity cost for the labor involved?
> I'm curious to hear how much you see as "too much money" to keep
> your team from getting out of the rack/stack/burn/test/QA/QC/deploy.
Early on the "too much money" would have been maybe a 20 percent premium. But now that we have scaled up and have all the teams in place, "too much money" is probably even a 1 percent premium. In other words, if we moved to a commercial solution now it would be because we actually SAVE money over using our own pods. That shouldn't be impossible - a commercial operation could make their profit margin out of a variety of savings, like maybe their drives fail less often than ours do because they do better vibration dampening. Or since this is "end-to-end cost of ownership" if the commercial system could save half of our electricity bill by being lower power then they can keep 100 percent of that.
We have a simple spreadsheet that calculates the cost of space rental for datacenter cabinets, the electricity pods use, the networking they use, the cost of all the switches, etc. It isn't overwhelmingly complex or anything. It includes the cost of salaries of datacenter techs with a formula that says stuff like "for every 400 pods we hire one more datacenter tech", stuff like that.
Hopefully if we went to a commercial built storage system we could increase the ratio like only hire one more datacenter tech every 800 pods. The storage vendor could pocket the salary of that employee we did not hire.
Or heck, maybe a large vendor can get a better price on hard drives than we get. We buy retail.
> reaching VMI (maybe you are already there?) procurement levels?
I'm not completely sure I understand the question, but here is one answer: in a modern web and email and connected world with tons of competition, I'm not sure there are any economies of scale past the first 10 computers you build. Places will laser cut and fold sheet metal in units of "1" for extremely reasonable prices, it's AMAZING what the small manufacturers can do for you nowadays.
So, looking at what you guys are doing. At least based on your design from four years ago, have you looked at what the open computer project is doing?
At 4U, you guys can fit 45 drives in your case. The equivalent OpenCompute Knox would be 6U, at 45 drives. They have a different backplane arrangement, and more redundancy than y'all, so I expect it'll be a little bit more expensive, but at the benefit of not having to manage your own supply chain.
I think it's a use case question - obviously you'd never want to use a backblaze storage pod for scenarios where you'd be using only one or two pods, or for anything with high random I/O requirements, for all the reasons you describe.
But there's a lot of use cases these days where the majority of the data in question is "cool to warm" as opposed to hot/active. While vendors like NetApp and EMC would be happy for you to pay tier-1 vendor prices to store cool data on high-performance storage, if your app can handle it, it's not really necessary.
Think about photo storage, backups, database archive/transaction logs, log aggregation, etc. etc.
The Pods are not general purpose storage devices. They are literally as close to tape as you can get without getting dirty.
the average throughput of our teir1 storage is around 10 gigabytes a second (not bits) with a peak of 25-30. however they are just 60disk with 4 raid6 (with 4 hot spares) and dual 10 gig servers, 4tb disks.
They are just plain NFS servers with XFS. They are fairly good general purpose fileservers, great for high streaming IO. replace those 4tb disks with SSDs or 15k disks, you've got amazing random IOPs
> the average throughput of our tier1 storage is around 10 gigabytes a second (not bits)
Any one Backblaze storage pod is absolutely limited by the 1 Gbit (not bytes) ethernet interface. So if you need your data sequentially from one pod it is not a very fast solution.
But remember we (Backblaze) have 863 pods (with another 35 being installed each month), each pod has 1 Gbit connection directly on the internet. So actually our "fleet" supports peak transfer rates of close to 863 Gbits/sec (107 GBytes/sec).
Backblaze storage pods aren't the right solution to all problems, but if your application happens to be parallelizable don't count them out as "slow".
> Do you see the need yet to dump in a 10 gig card?
We have a few of the pods in the datacenter running with 10 Gbit copper on the motherboard as an experiment. I'd have to double check, but it raises the price of the motherboard about $300? That's not bad AT ALL (only a 3% cost increase), the real cost hit is if you want to utilize the 10 Gbit you have to buy a 10 Gbit switch, which raises the cost of a 48 port network switch by thousands of dollars.
Netgear ProSAFE XS712T is a (recent) fairly cheap 10 Gbit switch. I haven't tried it yet, but soon.... And hopefully the fact that it exists will drive down the ridiculous price gouging of the more "datacenter" brands. Not to say Backblaze is above putting consumer hardware in our datacenter when the quality and price are right. :-)
For my home storage systems I've been looking at using Infiniband instead of Ethernet beyond the 1 Gbps link layer (for iSCSI and FCoE LUNs to boot thin clients off of). The cost of Infiniband switches is far lower than Ethernet and comparable 10Gbps cards are a bit cheaper, too. With Ethernet you don't have to think about software compatibility much and since Infiniband is available on enterprise class hardware a lot, you'll get a fair chance at support across OSes too.
But in this case, they're exposed to the internet at large, not using a high-speed internal backbone, so Infiniband isn't going to let him talk to the internet, and then he's got two links per machine.
I'm not sure what the actual network topology is supposed to be nor do I really understand their data replication methods across these storage pods, but they can always use an Infiniband to ethernet bridge / gateway and that'd centralize the conversion to the egress side of their switches without mucking up overhead locally. Not like you can stick an F5 load balancer right on top of a Mellanox switch and expect it to be hunky dorey, but there's lots of cheap options to be explored is all I'm saying.
It's an old post with old pricing, I simply don't see vendors pricing close to there system ( assuming your looking at the fourth iteration ) especially not including support.
You're suggesting something that costs $30K minimum. Given the number of vaults they have, they can afford a lot of additional pairs of hands for that saving.
Plus the software has an inherent value, it isn't purely a cost centre like you imply. The value of Backblaze is partly made by that very same software you scoff at.
I'm not scoffing, its a cost. The only reason to do a backblaze is to save on cash.
they are using really wide raid6 stripes, with horrific rebuild times. no hot swap ability.
put in anotherway: with 100 pods, you're going to be loosing 10 drives a week (on average, sometimes more) it takes 48 hours to rebuild an array (although I heard it was upwards of 72 hours, without load.)
so every week, you have to powerdown upto 10% of your storage to make sure you don't loose future data. or you can wait till you've exhausted all your hot spares, but then you're in the poo if you loose a disk on power recycle.(yes raid 6 allows two disk failures, but whats to say you wont loose two disks, as I said they are using large stripes (24?) so there is a high chance of a raid rebuild failing)
hotswap is awesome. running low on hot spares? sure replace online, walk away happy.
to make a backblaze work you need to keep more data in tier 1 storage, which means higher cost. (even if you use a software layer over the top to do FEC you still need at least 1/3rd more disks. )
Then there is the woeful performance.
if you want JBOD, with large scalability, just look at GPFS. The new declustered raid is awesome, plus you get a single namespace over zetabytes of storage, with HSM so you can spin out cold blocks to slower cheaper storage (or tape)
It blows everything else out of the water, ceph, gluster lustre, GFS2 & hadoop. Fast, scalable, cross platform and real POSIX, so a real filesystem which you can backup easily.
> they are using really wide raid6 stripes,... no hot swap ability.
SATA drives are inherently hot-swappable, you can hot swap drives in a Backblaze Storage Pod if you want to. In our datacenter for our particular application we just choose not to. We shut down the storage pod, calmly swap any drives that failed, then boot it back up. The pod is "offline" for about 12 minutes or there about. The RAID sync then happens as the pod is live and the pod can be actively servicing requests (like restoring customer files) while the RAID sync occurs.
Interesting, do you find that the power up causes more disks to pop? (with the change in temperature?)
How often do you see multi-drive failures(in the same lun)?
We've just past a peak of drive failures (had a power outage) where we've had multi-drive fails in the same lun. The awesome part about that is that each rebuild caused more drive to fail. I think one array swallowed 5 disks in two weeks. It didn't help that the array was being hammered as well as rebuilding (70 hour rebuilds at one point)
Anytime we reboot a pod, it runs a "significant" risk of not coming back up for a variety of reasons, and if a pod is powered off for several hours it is even worse. My completely unfounded theory is cooling off and heating back up is bad, short power downs things don't really cool off as much.
> How often do you see multi-drive failures(in the same lun)?
It happens - usually for a reason. Most recently a bunch of drives of one model all seemed to fail very closely together after running for 2 years, it's like some part wore out like clockwork. We are pretty relaxed about 1 drive popping, nobody gets out of bed and runs to the datacenter for that, it can wait until the next morning (the pods instantly and automatically put themselves in a state where customers can prepare restores but we stop writing NEW data to them to lighten the load on the RAID arrays). So if 2 drives fail within 24 hours on a Sunday then Monday morning we go fix it.
> each rebuild caused more drives to fail.
Same exact experience here, we feel your pain. We even have a procedure where we line up all brand new drives, clone each individual original drive to a new drive, then take all the new drives and insert them back into the original pod AND THEN run the RAID rebuild.
Its grand that its worked out for them, but there are a few big drawbacks that backblasze have software'd their way around.
Nowadays its cheaper to use an engenio based array like the MD3260 (also sold/made by netapps/LSI)
First you can hot swap the disks. Second you don't need to engineer your own storage manager. Thirdly you get much much better performance(2gigabytes a second sustained, which is enough to saturate to 10 gig nics). Fourthly you can get 4 hour 24/7 response. Finally the air flow is a bit suspect.
we use a 1u two socket server with SAS to server the data.
If you're brave, you can skip the raid controller and the JBOD enclosure instead and ZFS over the top. However ZFS fragments like a bitch, so watch out if you're running at 75% plus