> You have a moment where you envision the future of virtualized storage and think about how great it will be when storage is nearly free and outsourceable and you can stop buying disks from Amazon every few months
The future is now! Instead of you sending money to Amazon and them sending you disks, they keep the disks and you send them the money anyway.
Sadly, depending on what you want to accomplish with your storage the future is still not here. For example, if you want to share a virtual disk with multiple servers through NFS you will still need to manually create a NFS high-availability cluster using e.g. DRBD on most cloud providers (e.g. EC2 or Azure). To be fair, MS Azure offers fully managed SMB shares as a service, but the transfer speed is quite low (60 MB/s in the best case) so they are not very useful for most tasks.
Yes. One can build a system that holds a petabyte of data, fits in less than one rack, has multiple redundancy, for about a hundred grand these days. Multiplied by three locations, that'll get you to $300000. Add in another $35000 for three years of hosting at three different datacenters, and you get to $335000. Sounds like a lot, right?
Amazon's even more expensive. Storing one petabyte on Amazon, at the slowest, cheapest, glacier level, will cost you $10000/month, so at about 10 months, you've paid a hundred grand. At thirty six months, you've paid three hundred sixty grand. Plus, you have no hardware to show for it, while amazon sells that broken outdated junk to a reseller to regain some of their investment. Had you hosted it yourself, you could have sold all that equipment off to refurbishers and resellers and regained at least a portion of your investment back. So yeah, it is pretty terrible to choose Amazon.
Commodity cloud services (and pricing) are meant for the 99%, not the outliers. If you have a petabyte of data to store you should either be doing enough other business with Amazon that you can negotiate rates down or you should be rolling your own.
It costs about a hundred and fifty grand annually (fully loaded, base salary 80-90k) to have the lowest end person who could conceivably keep one of these things running well for three years.
n+1 means you need two of them in case one gets the flu.
Tell me again about how Amazon's a ripoff, please.
You're still going to need n+1 sysadmins to maintain the rest of your infrastructure. Maintaining a storage system like this really isn't much different for the day-to-day operations. So, yes, it's one more system, but you're already going to need those admins anyways.
I don't believe maintaining a high availability system with proper checking is less than 3 hours a day of work, or ~10% of someone's time on duty if we're talking around the clock.
So we're talking about $75,000-90,000 a year in salary to maintaining the cluster if you want 24x365 coverage (which Amazon provides), as you'll need 2 people per shift, 3 shifts per day at a minimum to have people in house, even if they're only spending 10% of their time actually working on that particular issue. In reality, these are unrealistically small numbers. Each employee will have a cost to the corporation of $125,000-150,000 a year. Employees of that caliber spending 10% of their time on supervising a cluster for a year is $75,000-90,000. I'm amortizing the amount of work over your 3 datacenters by imagining you already have the staff and only counting out the number of hours needed for just this work.
So the reality is that you left off something like $225,000-270,000 in actual cost of running your own cluster from your analysis, because while it's not "much different", it is a few hours a week from at least 6 employees if you're really talking about running managed, highly reliable storage.
I feel like these comparisons oversell the level of support that actually comes with Amazon. Yes AWS as a whole very rarely goes down, but instances have problems all the time, and who do you call?
When you have your own servers and staff, even if they are just on-call with a pager, you know they are going to work for you on your problem until it is fixed.
In comparison, the sentiment about AWS is this: better build redundancy into the application because at the server layer, you get whatever you get.
This is simply a perspective on systems engineering.
My argument would be that part of the system is always down, the question is which part and how long, and how that impacts the system performance on the whole.
AWS services work well if you build a stateless system which is in some senses "embarrassingly parallelizable", because you can talk about the capacity of such a system, and the impact of non-functioning components is easy to predict. This is how most standard engineering is done, across disciplines.
Traditionally, it has not been the case in computer systems, but most modern techniques advocate using such systems, because they're MUCH more reliable.
You just sound like you need a safety blanket for emotional reasons, not that you're making sound engineering points about how to most cheaply engineer a high availability system.
I mean, do you really believe AWS engineers aren't working hard to keep their system fully functional?
The first post was a back of the envelope estimation, excluding externalities on both sides, to show why someone would want to keep all that "old, obsolete hardware". For Amazon, I didn't put in the costs of pushing data to amazon in terms of API calls used, bandwidth, etc. Additionally, I posted the cost estimate using their slowest storage system with the least amount of flexibility. The cloud doesn't always save you time and money.
I'm just saying that you omitted a major component, as no one would argue that the trade-off in engineering time isn't one of the main cost-benefit components of considering AWS (as we can see here, where it weighed in at a substantial fraction of your estimate).
It's like forgetting to count the price of hardware, and only talking about the relevant cost of electricity.
> you can stop buying disks from Amazon every few months
This is something that has been puzzling me. Many years ago I purchased 4x 2TB 5900 RPM drives for a 4 bay ReadyNAS (cost about ~$300 for drives plus ReadyNAS). They have been spinning nonstop for ~4 years [1] and haven't had to replace a single one. Not even an increase in errors to signal that the drive is going.
Yet - I've worked on a SAN that cost hundreds of thousands of dollars and would have to replace a disk about every month.
Granted the disks in SANs probably spin faster (thus faster data access/lower MTBF) - but that high failure rate seems rather suspicious to me.
AFR buddy, AFR. AFR is the annual failure rate, its typically 2 - 5% of the population per year. 4 drives you don't have a large enough set to see this in action, just every day you're in danger of losing a drive by a small statistical amount.
In the Blekko cluster we have just under 10,000 drives. We have a two 20 drive 'boxes' (40 drives) from Western Digital, as drives fail we pull replacements from the 'new/refurbished' box, and we put the dead one in the outgoing box. When we get up to 20 we RMA them in bulk, 20 go out, 20 more come in. That becomes the new 'new/refurbished' box.
It really isn't SAN vs non-SAN it is all statistics.
That said, if you're running your ReadyNAS with raid 10 (mirrored drives in a RAID 0 config) you may find some unpleasantness when a drive does fail. Statistically you have a 1/10 chance of not being able to re-silver the mirror for a 5900 RPM desktop SATA drive. That gets a bit painful.
Just out of curiosity, has anyone done any research into determining whether a drive which fails after X days/years has some properties in the first Y days that could be a signal for future failure?
I'm certain there are more recent statistics but Google's "Failure Trends in a Large Disk Drive Population"[1] (2007) is a good start:
"In addition to presenting failure statistics,
we analyze the correlation between failures and several
parameters generally believed to impact longevity."
There is also a more recent open source dataset from Backblaze[2] that includes:
"Every day, the software that runs the Backblaze data center takes a snapshot of the state of every drive in the data center, including the drive’s serial number, model number, and all of its SMART data"
which forms the basis of an article correlating SMART data with drive failures at Backblaze[3].
The TL;DR answer is yes, there are some hard drive SMART values that can indicate failure is likely, but they vary by model and don't necessarily show before failure.
yeah, I was wondering if there were measurables that could be correlated with failure before SMART kicked in, even if they were something like date of year, or location of manufacture, or shipping route they took. :P
Generally the most common measurable is sector reallocation errors. This comes from various random things going wrong at the wrong time and the disk re-allocates a new sector from the spare pool to deal with one that has gone bad. In operation, our disks at Blekko pick up sector reallocation errors at a low statistical rate that picks up prior to total failure. Since our infrastructure is triply redundant (three disks hold a copy of every piece of data) we can simply reformat drives which develop sector errors. If you plot the time between sector errors developing over the life of the drive, it gets shorter rapidly as the drive as nearing complete failure. Sometimes however there is no warning, the drive simply fails. As with my previous experience at Google and NetApp before that, there is a small rise in early failure (infant mortality) then a long tail toward a steep failure rate after about 10 years.
> just every day you're in danger of losing a drive by a small statistical amount.
Which is why for important data I always use some sort of RAID (or cloud syncing). If I lose a drive I won't lose all my data (presumably though if I bought both drives at the same time there is a chance that both could fail at the same or close to the same time).
> That said, if you're running your ReadyNAS with raid 10 (mirrored drives in a RAID 0 config)
ReadyNAS and other products use a special type of RAID that is actually kind of clever (that will use all space regardless of disk size). I know many would criticize the special RAID but I've had more issues with Linux software RAID than I have had with ReadyNAS (nothing related to data loss). But suffice it to say if 1 drive fails then in theory the data should still be ok (at least that is what their claim to fame is).
My personal opinion - I feel like the ReadyNAS will die before the drives. I'm not saying they couldn't or won't die - but I feel like that would be highly unlikely and even more unlikely for more than 1 to fail at once.
Maybe I'm just paranoid - but when I worked on my last SAN it felt like the company used the cheapest possible drives with a short MTBF for the simple reason that my employer would have to keep using them and keep paying for their warranty service (this wasn't a big name like Dell).
> Which is why for important data I always use some sort of RAID (or cloud syncing). If I lose a drive I won't lose all my data
Just a friendly reminder that RAID != backup. There are numerous data loss cases that RAID does not deal with.
Personally I use striped ZFS with important volumes periodically snapshotted, replicated to external (and encrypted) disks and then stored offsite (cycle through a couple of sets of disks). Most important data is also periodically synced to cloud storage (as well as offsite disk).
This accounts for:
- Single disk failure (striping)
- Bitrot (ZFS scrubbing can reveal bitrot on disk and correct it from parity)
- Human error (snapshots)
- Catastrophic damage to home NAS (offsite backups)
RAID alone (depending on the particular implementation) will generally not deal with the 3 latter failure cases.
I knew EMC storage was utter shit when, upon attempting to create a new RAID group, I realized that the configuration tool's default was to stripe across drives within a shelf, not to create stripes that span shelves.
Worse, to create the more fault-tolerant, shelf-spanning RAID volumes, one must manually add drives, one by one to the array, in a process that involves about 44 (slight hyperbole) clicks per disk.
And then there was the fact that the configuration tool was Windows-only.
Actually it sounds like the EMC software was trying to save you from something which is a bad practice. If you want raid 1+0 it is better to have two disks on the same controller/chassis in raid 0, then raid 1 (or raid 5/6) across chassis. Otherwise a chassis failure will take fail both leaves of your raid mirror.
Having said that, EMC is stupidly overpriced bloat.
Did I really need to spell out the geometry of the RAID I was creating to avoid this kind of nitpicky follow-up?
It was a 20-something disk RAID 10 [1], arranged so that every mirrored pair of disks spanned different enclosures, in order to mitigate the failure of any one shelf — that is, interleaving mirrors across controllers and shelves, exactly as you suggest I should have done — and further, such that any one shelf failing only affects the mirrors that had disks on that shelf.
EMC's software wanted to allocate the drives from two shelves, with an unequal number of drives per shelf. It was just grabbing the next however many disks, linearly.
So, no, they weren't trying to balance the mirror across enclosures or controllers. They just weren't thinking.
[1] By "RAID 10", I mean "striped mirrors" — that is, create a bunch of mirrors that span shelves and then stripe across them — not "mirrored stripes" which is what you appear to be suggesting, with "it is better to have two disks on the same controller/chassis in raid 0, then raid 1".
A striped mirror is recommended in everything I've ever read on the subject, because it puts the redundancy at the lowest level of the array's geometry.
Using a mirrored stripe, on the other hand, means that when one disk fails, any other disks striped with it, still presumably perfectly functional, can't be used; the controller must instead read from and write to the mirror. If a disk in that mirror subsequently fails, you've lost data — and remember that when striping, the chance of failure is multiplied by the number of disks in the stripe.
Don't get all upset. You said stripe disks in your first comment, not mirror volumes.
In a big system I am usually more concerned with avoiding any SPOF (that will cause downtime), then adding redundancy for the most likely failures.
Individual mirrored pairs on split chassis will mitigate against long rebuild times for a single disk failure but it requires an all software architecture as the individual HBA can't do hot spare rebuild. It also reduces the benefit of the HBA cache. So intra chassis, striped mirrors; inter chassis, mirrored stripes. Of course, mirroring across striped mirrors would be better, but the cost to peak write and capacity might rule that out.
Also, yes I believe you, EMC sucks. But they do have some good engineers.
I guess a "shelf" shares some piece of hardware, such as a controller card or communication backplane, which could itself fail and thus disable access to all the drives on that shelf, which a RAID 4 or 5 stripe that spans shelves could survive?
It reminds me of Douglas Crockford's talk where he mentioned that a company asked him for an exemption to the "do no evil" clause in the JSON License, and he said he didn't want to name the company as that would embarrass them so he would instead give their initials: IBM.
It's not embarrassing; their lawyers are just being cautious, as befits a company with huge customers like theirs. The "do no evil" clause is stupid and utterly ambiguous. If Crockford becomes a fundie, then is any gay-related org in violation of the license?
It's his right to make up ridiculous licenses. Like the sisterware license (you can use the software if you send me a pic of your sister if you have one), people shouldn't take it seriously and avoid code licensed like that. That Crockford doesn't get this is either him trolling or being clueless.
Yes. The main cause of my beeper going off at 3 AM during the time frame he talks about (early to late 00s) was due to brand new EMC storage hardware continually collapsing under a moderate load, even though the hardware's specs claimed it should have more than adequately handled the load. A load that the three year old, cheaper storage hardware that the EMC unit replaced had handled fine. Like they did, we did not really fix the EMC hardware, we just found ways to work around the problem.
Come on, read Extremely Massive Corporation again, and try to think of an Extremely Massive Corporation which happens to deal in computer storage systems.
There was a time when they had adverts on seemingly every page of every computing magazine. Although that may have been back when magazines were a thing and Sun were a hardware company.
I'm too young to drink, I figured this reference would work just fine on the GP here... Though the five downvotes I got on my message tell a different story. :-P
Props then, mate. I have been in tech for almost a decade now, and I have literally never heard of EMC until now. Probably has something to do with never working in a datacenter, and never doing anything super cray with storage.
I was thinking similar thoughts from the other direction - that they have turned to money wringing desperation might mean there is no money to get the thing back up. If the situation turns out to be "it's dead Jim, we need to spend several million on a replacement EMC filer or shut down forever", it might be gone along with all the valuable projects hosted there.
In some ways the bundleware served like a SMART warning, causing people to back up and migrate their projects.
The way I discovered SF was down was when I was trying to install some Slackbuilds and kept getting MD5 mismatches on the source downloads. I noticed I was getting the same MD5 for several different packages, so I tried to manually grab a package and that's when I hit SF's error page (which sbopkg was happily downloading and trying to pass off as somesourcepkg.tar.gz).
I found myself in the odd position of hoping SF would come back up so I could finish what I was doing, when I'd normally welcome the news that they had shut down.
Sadly I too found it was screwed a few days ago when trying to pull some source code hosted there (ironically so I would have my own copy in case things went wrong there...). I just got their 'Disaster Recovery Mode' notice and a subset of binary packages available.
Lots of google searching for what's wrong with Sourceforge revealed nothing (other than complaints about their packaged installers/crapware). Now I have the answer to the question I was really asking.
I just need to wait and see if the developers of the packages I need will somehow migrate to github so I can get the source...
What's happened to the mailing list archives for all these projects, are they just down the toilet now? And of course the source, this is bloody terrible !
One can hope that if this hypothetical public-facing service never returns, they will ship the backup tapes to the Internet Archive instead of Honest Bob's Social Data Mining and Market Manipulation.
> It writes a full 32 bits of numeric user ID to its filesystem, but to save a few bytes it only stores 16 bits of group IDs. Some engineer probably thought that'd be enough for anybody.
I'm having the same issue with the number of hardlinks, which, for linux ext4 systems, is limited to 65000.
"Let's talk about a hypothetical public-facing service that offers tools for collaboration, revision control, and software publishing." - it was hard not to notice that he was referring to SourceForge. :)
The future is now! Instead of you sending money to Amazon and them sending you disks, they keep the disks and you send them the money anyway.
Progress! :-D