"A relevant observation from our Operations team on the Seagate drives is that they generally signal their impending failure via their SMART stats. Since we monitor several SMART stats, we are often warned of trouble before a pending failure and can take appropriate action. Drive failures from the other manufacturers appear to be less predictable via SMART stats."
~10 years ago, I remember google research put out a highly cited paper wherein they found that SMART stats were not a particularly strong indicator of impending drive failure (50% of drives had no SMART indications of problem before failure). http://research.google.com/pubs/pub32774.html
Has this now changed (at least for Seagate)?
Reliability/longevity is nice but a signal of impending failure is far more valuable from an operations point of view.
Hi! Yev from Backblaze here -> Yes, we only report the stats of what we have in our environment. As much as we'd love to have a test of SSDs in a pod (augmented for SSDs of course) they're just not feasible from a cost per GB perspective. Hopefully sometime though :)
Input/Output rate, bandwidth and IO roundtrip delay.
* even the slowest SSDs have significantly higher I/O rates than the best mechanical drives, and the comparison between best-in-class mechanical and enterprise-class PCIe SSDs is just ridiculous: a 15K SAS drive will do 200 IOPS, a high end SSD will do a million
* 15K SAS drives will top out around 250MB/s on bulk sequential reads (that's a best-case scenario), high-end PCIe SSD are in the 2.5GB/s range
* HDDs have a latency of 10~20ms, SSDs have a latency of 100~200µs (RAM has a latency of ~100ns)
Have you productized these learnings in a a powertop-like tool for Linux?
Smartmontools are not intuitive enough for the layman to use in any meaningful way.. And backblaze
has really built some serious learning here that could be of use to everyone.
I suspect rotating drives have a variety of several failure modes, some of which could be predicted by SMART, others which it's unlikely to be predicted.
Each new model is probably bound to have a different pareto of failure modes.
Now, if only Seagate had human-readable SMART values.
(I say this as I've been recently built a freeNAS box with a combination of Seagate NAS and WD Red HDDs - the WD's make it easy to look at the smart stats and know what's going on. The Seagate ones, not so much.)
HGST, formerly Hitachi Global Storage Technologies, part of WD as of 2012. The cool thing about them was that their consumer Deskstars were at least as reliable as enterprise disks by other manufacturers. I still have 12-year-old PCs here with HGST 80 and 160 GB drives that were subject to daily use and a lot of inappropriate handling. The Deskstars don't mind.
Very unfortunately, HGST has apparently scaled back Deskstar sales and development significantly since the acquisition. I guess it has to do with WD selling off some of HGST's 3.5" assets to Toshiba in order to appease competition authorities. See also https://news.ycombinator.com/item?id=10057519
"In May 2012, WD divested to Toshiba assets that enabled Toshiba to manufacture and sell 3.5-inch hard drives for the desktop and consumer electronics markets to address the requirements of regulatory agencies."
Added link to an older comment of mine that addresses the HGST/Toshiba thing. To me, it looks like newer Toshiba 3.5" models are based on Fujitsu tech (if the enclosure design is any indication). Also, Toshiba might abandon their HDD business completely. [1]
Fujitsu :( I worked at Fujitsu wholesale distributor around 1999, EVERY SINGLE drive sold between 1999-2001 died within 3 years (PB15/PB16). Those were great drives, cheap, silent, fast, and smelled great fresh from the factory due to pine sap rosin.
Allegedly Cirrus Logic controller had a manufacturing defect and died due to heat. Myself I always suspected that very peculiar and strong smelling rosin flux. PCB was drenched in it, this type of flux is usually highly activated and requires cleaning, otherwise acid will eat solder joints and copper away, especially in humid and hot environments.
I had one fail, got a replacement, and with the replacement and its replacements continued that cycle until a new generation of drives came out, at which point I sold the stupid thing on eBay. It was unreal.
Of course, the reality is China wanted a piece of WD, and used the merger as leverage to get it. I would expect by 2017, HGST drives are just as shit as WD. Which is unfortunate, because the Japanese designed one hell of a hard drive.
Thank you for sharing this! I actually was naive enough to believe that WD would continue to let HGST operate as a seperate entity.
There is probably some good news in the article though, for what it's worth: "At that time John Cyne ran WD. He since retired, with HGST bss Steve Milligan taking on his job." ... "In other news Western Digital has announced a new executive management team, and it looks almost like HGST executed a reverse take-over of Western Digital." ... "A person who was close to the corporate action in Western Digital and HGST said: 'All key positions are with HGST people; it's a reverse buyout. First HGST took Coyne's money to buy themselves (probably with a clause that Milligan is becoming CEO) and then they watched WD dismantling itself.'"
The 400GB Hitachi Deskstar in my old old Dell Desktop (Dimension 9100) so think early 2005, was still going strong before the power supply in that desktop died in 2013 or so. It had about 20 bad sectors according to SMART, but it still was chugging along.
Quantum was the premiere maker of SCSI drives back in the day. They were beating IBM and IBM needed more capacity so IBM bought them. Then IBM sold to Hitachi, who sold the drive business to Western Digital who sold the drive business to Toshiba.
I believe these Deskstar types and derivatives are the same essential mechanism and processes as those old quantum drives (probably especially in terms of the QA processes). The heads and the technology have improved to give better capacity of course, but I've been buying and relying on these drives for ~25 years at this point.
I'm not surprised to see them showing up well on these charts.
Not quite the full picture. And there's bad news: Newer Toshiba desktop drives look more like the server stuff that they have been manufacturing since acquiring Fujitsu's HDD business. Plus, Toshiba might give up the HDD business entirely. [1]
Seems like WD effect appeared on their 8TB drives :-/ My NAS has 5x 4TB HGST I bought due to previous Backblaze reports and am waiting to figure out which 8TB drives should I buy - Seagate Archive had really bad real-world usage reviews and now HGST seems to be slipping as well :-(
If you look at the number of drives and the length of time they've had the drives, you quickly realize that you need to take it with a grain of salt. They even state in the article they don't have enough data to come to any conclusion on those drives.
I don't really understand their methodology for computing failure rate. The page says they calculate the rate on a per annum basis as:
([#drives][# failures]) / [operating time across all drives]
Wat? The numerator and denominator seem unrelated. What is being measured here?
To me, it would make more sense to look at time to failure. Together with data on the age of the drive and the proportion of failures each year one could create an empirical distribution to characterise the likelihood of failure in each year of service. That would give a concrete basis from which to compare failure rates across different models.
Are you referring to the "(100*drive-failures)/(drive-hours/24/365)"? There's no multiplication of total # of drives and # of failures in there.
It's all just a scaling: you have a number of broken drives in a corner of the datacenter in the wire bucket that says "broke during 2015", you count them, divide by total hours of that type of disk running (since they may have been brought in commission at different points), and then scale it so you get it in percent-per-year, not likelihood-per-hour.
It smells of someone explaining code, rather than illustrating an important engineering formula, but there's nothing wrong with the rescaling calculation per se.
> Are you referring to the "(100drive-failures)/(drive-hours/24/365)"? There's no multiplication of total # of drives and # of failures in there.
Perhaps the problem is the specific example given. 100 is the size of the drive fleet and also the multiplier required to convert to percentages. Let's assume you are right and the 100 in the equation is not #drives.
Even so, I find the approach questionable. If the point is calculate the proportion of failures then that (overly simplistic) calculation is:
But this isn't what's calculated. Instead the author calculates the proportion of drive-years per annum affected by failure. For the 100 drives in the example the cumulative number of operational hours given in 2015 is 750K hours (out of a possible 876K hours, had the drives been operating 100% of the time).
That's a problem because 750 / 876 = 85.6% of total time.
5 / 85.6 = 5.84% "failure rate" which seems to me an overstatement.
The problem gets worse as the number of operational hours decrease. Imagine for a moment the 100 drives only operated 50% of the time in 2015. We have:
100 (5 / ((875K*0.5) / 875K)) = 10% "failure rate". This despite only 5% of the drives having failed.
Survival data is tricky to model. They have just done an overall average number of failures per running hour. There do seem to be a lot of factors not taken into account here such as drive ages, (in both running time and elapsed), etc.
With right censored data (as this is), if you measure age at death but then you're only modelling already failed drives, so you'll under represent good drives.
It would be good to see some statistics done so we can see confidence intervals around a hazard rate at different ages.
yep. it is not science. they don't have... they don't even consider a null hypothesis.
It is raw data that they provide in an excellent fashion.
You can consume it raw and trust their high driver numbers to drive p high enough for you, or you can use their data for real science. either way, it is a great deal that they take the time to share it.
A useful additional metric is the age of the drive at failure.
This would determine if the failure rate was constant for the life of the drive (meaning random failure) or is it age related (infant mortality or old age).
25 drives that fail after 1 week plus 25 that fail after 50 weeks is different to 50 drives that fail one per week.
Luckily they open source their operational and smart status for all of their drives[0]. This means that you can do this additional analysis (and more). Which is awesome.
Brilliant, thanks for letting me know. I'm studying Logistics Engineering and reliability analysis is part of my degree. This will make a great case study :)
To save you some time, they did this analysis previously. I can't find it, but the summary was that the failures happen at the two ends and not a lot in the middle. ie. a bunch die early (infant mortality) and the rest die pretty late (old age).
The whole article reads like the excuses of someone with a vested interest in discrediting evidence of their favorite brand's poor performance. I don't think the take away from the data provided by Backblaze is "I can expect to get a failure rate of exactly 1.231971 if I buy brand X's hard drives." The end-user-useful conclusions are things like "HGST's drives are the best," and "6TB drives are less reliable than 4TB drives right now."
Sure, all of the factors listed in the criticism may play a role in the failure rates (except the external enclosure bit, since A. The majority of the "shucked" drives were 3TB, and B. They've outgrown that practice.) But they only have the weakest of justifications for believing that those factors vary systematically across the manufacturers. And indeed, even those factors did vary systematically we'd still get the right answer if we had made the more general conclusions. For example, if the vibrations in the seagate-only enclosures are greater than the vibrations in the HGST-only enclosures, that can only be because the HGST drives are better and vibrate less. Or alternatively, maybe the pods all vibrate the same, but HGST is better because it is more resistant to vibrations.
True, and what those criticisms actually show is that Backblaze's data is highly relevant for the average consumer.
I regularly buy external HDDs, rip them out and put them into desktops and laptops, put them back in different enclosures, and so on. As a result, my HDDs experience a lot of movement and extreme temperatures (e.g. being left in the trunk of a car on a hot summer day). It's good to know which models are the most likely to survive such abuse in the long term.
Thanks for this. Although it makes some important points, it reads as if the author is annoyed that backblaze's data from tens of thousands of drives is getting so much press, compared to the rather useless single drive reviews published by sites like tweak town.
It's also a bit disingenuous to criticize back blaze's methodology when you know that a 'comprehensive study' under more controlled conditions will NEVER actually happen with the necessary sample size to draw conclusions.
Stress testing is a valid methodology for determining reliability - eg car makers crash their cars into walls at high speed to make sure they are safe, or use a robot to push the brake pedal a million times to see when it fails - so they hardly deserve criticism for pushing the drives hard. More information for the consumer is a good thing.
I read the same thing a year ago and I came away actually upset at the tweaktown article. Off the top of my head, I remember some of the complaints being that the drives were subject to abnormal amounts of heat and that the drives were consumer-level drives.
I remember a study Google did on harddrive reliability and it seemed to show that heat had little to no effect on it. I also don't regard consumer-level as being a bad thing. As a consumer, I kind of want to know which drives are built for abuse better. All drives fail; which drives fail more and at what cost?
The tweaktown article did talk about temperature. I think you were right to feel they were being silly with that. Temperature MAY correlate with failure but Backblaze found it did not do so within the ranges they actually see in their environment. Something about which it appears they would have more than enough data to be able to compute.
Google's study some time ago found that temperature either didn't correlate with failures or, in the ranges they ran their machines, had an inverse correlation with failure. It would appear to be one problem disk manufacturers have largely surmounted.
The Tweaktown article is a straight hit piece. I'm not really sure what the motivation would be. The writing is so sloppy and negative that it's hardly compelling.
I didn't think their arguments that Backblaze's early drive failures(First week or what have you) can be explained by their purchasing methods. My understanding is that they still see this well after they have stopped buying from Costco ect.
I've noticed this to an extreme degree on HN lately to the point that I upvote posts I disagree with because they present a valid point. The only reason I can see for it is that they have a different opinion from most HN readers.
Not just the Bay Area either. They coined the phrase "hard drive shucking", 'cause they were buying consumer external usb drives, then digging the drive out and thorwing away the enclosures.
That story is linked down near the end of the article:
Yev from Backblaze here -> Not ALL the hard drives, but yes. In fact we explicit told anyone that was out buying hard drives for us to leave some on the shelves for the average consumers going in to the stores, hopefully they listened.
HGST and its cousin HDS don't get nearly the recognition they deserve in the North American Enterprise storage market. Their products have, in my experience, always offered phenomenal value and rock-solid reliability at very reasonable prices. HDS arrays in particular are pretty great at outperforming 'big name' storage vendors at far lower prices.
I think they changed brands too many times to keep their enterprise reputation intact. Few people even remember that HGST used to be Hitachi which used to be IBM.
On the other hand, those who do remember IBM hard drives probably remember them as the Deathstar, so HGST might not want to be associated with their old home so much.
What would be really helpful is if they could simply put some amazon links on this report to the drives with the best reliability according to their tests.
People will always be sour and accuse you of things.
But actually I don't see how this makes them biased in any way. All drives essentially sell for the same amount (and Amazon pays a percentage of that) so if you trust the info as being accurate (and why wouldn't it be?) then how could it biased then given there is such little lattitude in pricing?
And who is going to accuse them anyway? People who read HN? If so, so what?
The data presented is a nice shortcut answering the question of "which drive should I buy" without having to read all of the charts and most importantly think.
Lastly, you don't have to buy from amazon just because they give you a link but it does make it easier to see a price and compare to whatever vendor you might typically use (or provide several links to different vendors).
Brian from Backblaze here. If Backblaze would become an Amazon affiliate, IF you clicked our link and then purchased a hard drive, Backblaze would get about 3% "kickback" from Amazon! (That's the way the Amazon Affiliate program works, you provide a link and you get 3% kickbacks.) The problem is we would look like we are "pushing" drives to get the 3% kickback and it damages our credibility and reputation.
As a backup company, we hold ALL our customers data, so our reputation is incredibly important to us. People MUST trust us as impartial and trustworthy and not sleazy or we would go out of business quickly.
> The problem is we would look like we are "pushing" drives to get the 3% kickback and it damages our credibility and reputation.
1) So what does it look like now with what you are doing? For example you are offering free credible information about drive reliability which contradicts what you actually do which is make using drives for backup irrelevant. While I am sure that the following is not the case, I could easily say that you are doing this to make people think drives aren't reliable and hence they need backblaze! Wow look at drive failure I should DIY this! (Do I think that is your strategy? To repeat I don't..)
2) Note that http://www.dpreview.com was purchased by Amazon and it has only grown larger and more reputable (in terms of the reviews) since then. And they openly link to Amazon and they could easily be accused of a tremendous bias but apparently they either aren't worried about that or the effect is nominal.
3) I can fully understand, as a business decision, why you might not want to "cheesy" up (my words) your site with amazon links or perhaps you might feel the 3% is not consequential enough to do so. It is certainly a judgement call. However don't assume that everyone that would be a potential user of your company really would think that way because I can assure you that isn't the case.
> we hold ALL our customers data, so our reputation is incredibly important to us.
The fact that you are earning money from affiliate links does not mean you are not reputable and doesn't give me any less confidence that my data will be safe. It's a non issue (for that reason). You have a right to earn money in any reasonable fashion. Affiliate links are an accepted way to earn money (we aren't talking about selling customer data). If anything I think almost the opposite. I want to know that you are making money and robust in business practices so you have the funds to insure your operation will continue for the foreseeable future.
> perhaps you might feel the 3% is not consequential enough to do so
We struggle with it internally, I assure you we doubt ourselves all the time. :-) Some companies have an "informal fun loving" outward appearance, like if you purchase from Zappos they send emails like "the magic elves are making your shoes, we will send them along very soon..." But bankers tend to wear suits and ties and appear "very serious" in their communications even while frittering away your money on sub prime mortgages.
Anyway, the point is I'll forward your note along and heck, maybe next quarter our drive stats blog post will have Amazon links and we'll make a little extra money. :-)
However it's important that you wrap this in the proper words [1] not just plop the links on the page.
You need to explain the links but without apologizing for putting them there. You can even say perhaps that you were asked to do this (because you were). And don't chicken out and say you are donating the $$ to charity or anything like that.
Depending on how you write this, you will minimize the whiny blowback (if any). That said, running a business is not running a popularity contest to the tune of the most vocal commenters on HN or reddit or wherever.
If you are not doing so already you might want to issue traditional press releases with your results as well.
Of course if you do the links (and I would try this for more than one quarter) if it works or if it doesn't work you can then do a blog post on that!
[1] In the business I am in we charge for a service that our other competitors give away for free. By wrapping it in the proper words we often get a thank you instead of a complaint.
Considering a 4 TB hard drive has to track 32,000,000,000 individual bits, allowing reading and writing repeatedly of each one, on platters that are spinning 120x per second, spaced a hair's width from their heads...I think it's actually incredible.
As for SSDs, we keep wishing that we could switch to them, but they're still 10x more expensive on a $/TB basis. That may change in the next few years, and if it does, we'll look forward to sharing data on SSD usage at scale as well.
I guess my point about hard drives is most people never back them up and kind of always expect them to hold up over 5-10 years. They have years of photos, videos and documents stored on them. Then there are friends savvy enough to setup a raid system and invariably the raid hardware fails before the drive does and they can't get a replacement.
Thanks again for sharing the drive reliability statistics.
Is it possible for this data to ever be useful? Given the time necessary to acquire the data, and the rate at which improvements are made to drives, cannot we make the assumption that drives purchased today probably won't operate in exactly the same manner as drives purchased a year ago?
I don't mean to insult, just to ponder the relevance of such long-term studies on tech that changes so quickly.
My takeaway in the long run is in trying to narrow the list of HD manufacturers I am comfortable purchasing. Your point has truth in it, but it also seems fair to observe that companies known for producing reliable products consistently will continue to do so, all other things equal.
When the data is consistent for several years you should already figure out that Seagate is not going to improve so fast, when they do improve in BackBlaze data you can start buying them again.
Large companies may buy Seagate due to the price advantage and the fact that their storage systems can better handle the drive failure rate.
The Seagate drives do seem to be improving in reliability though. The higher capacity Seagate drives which I presume are newer models have better failure rate numbers than the lower capacity drives. The 4 and 6TB drives seem to have reasonable failure rates compared to the other manufactures - only HGST is better than Seagate for the 4TB and Seagate 6TB drive has a lower failure rate than the HGST 8TB. FOr >4TB drives the Seagate 6TB has the lowest failure rate.
6TB 1.89%
4TB 2.19/2.99% (depending on model)
3TB 5.1/28.34% (depending on model)
2TB 10.1%
1.5TB 10.16%/23.86% (depending on model)
This is definitely useful, although maybe not for purchasing. It lets you know which smart attributes are most useful, for example. Also, given the periodic reports, you can make judgments about how brands are trending (although you have to be careful about age factors). I think their reports showed 3tb drives are not great as well.
It'd be interesting and quite helpful to see the failure rate vs. drive age, per manufacturer.
For example, for less reliable manufacturers there might be a "if you get past first N weeks, you are fine" pattern, or a failure cliff exaclty 1 week past the warranty period, or something equally entertaining.
I've got 5 Western Digital drives which have failed out of original purchase of 6. Now I'm wondering if it's really worth it trying to go through the RMA process (I need to figure out exactly how old they are and how long the warranty is) or if I should just give up on Western Digital and go with a different manufacturer... though I am not looking forward to spending that amount of money all at once.
Great stuff. Does anybody have any stats for drives' Bit Error Rates (BER) / maximum unrecoverable read errors (URE) / non-recoverable read error rates ? By my understanding, manufacturer quoted BERs for commodity drives, often 10^14, tend to be 10^15 or higher in practice.
This information is super useful. I have an ST3000DM001 and only trust it because its smart stats are still all in the green (and of course I have local and cloud backups of anything important).
I've had it for four years now and there are no warnings of any kind yet, so I guess I got one from a good batch.
On the basis of buying a single personal hard drive this data is interesting but wouldn't have much impact on your purchasing. As usual the advice is to have multiple backups of everything
That's the wrong message to take away. The right message is: every manufacturer goes through periods of good and bad disks. Don't depend on any drive to be perfect.
The Seagate 3TB were awful, but their 4TB seem to be just fine.
I've had two Seagate drives "fail" recently after <2 months - the drive is fine but the (OEM) USB3 enclosure is dead. No idea who manufactures those for Seagate but I'm not impressed.
Similar experience with WD. I bought a western digital USB drive. It failed, I RMA'd it for a new one. It failed. A year later I bought a bigger one, and it failed after another year.
I cracked open the enclosures and the drives are just fine. I still use them for backups with no errors.
Honestly not that much : to feel comfortable at home I would need 20 TB of storage (of course it's only Linux ISO ;-). A bit more than 10 thousand people like me and they would have to reorder more drives.
Would be more interesting to find out reliability figures for high-throughput data-center models of hard drives instead of backup drive models, with low access rates.
It's a common scam to sell flash drives with modified firmware that causes it to report a larger size than the underlying flash chips provide. Usually once you write past that point, the writes will loop around or drop the data entirely. It's likely that the $17 flag drive you linked to is a scam.
Beyond that, flash drives tend to have low write durability and horrible performance on large writes (because of poorly implemented garbage collection).
I have 2 x ST40000DM and 1x ST40000VX in my desktop, plus one 4TB Seagate 'surveillance' drive as a USB luggable, though OSX to which it is currently connected doesn't want to give me the specifics (neither right-click Info, nor DiskUtil).
"A relevant observation from our Operations team on the Seagate drives is that they generally signal their impending failure via their SMART stats. Since we monitor several SMART stats, we are often warned of trouble before a pending failure and can take appropriate action. Drive failures from the other manufacturers appear to be less predictable via SMART stats."
~10 years ago, I remember google research put out a highly cited paper wherein they found that SMART stats were not a particularly strong indicator of impending drive failure (50% of drives had no SMART indications of problem before failure). http://research.google.com/pubs/pub32774.html
Has this now changed (at least for Seagate)?
Reliability/longevity is nice but a signal of impending failure is far more valuable from an operations point of view.