> you can stop buying disks from Amazon every few months
This is something that has been puzzling me. Many years ago I purchased 4x 2TB 5900 RPM drives for a 4 bay ReadyNAS (cost about ~$300 for drives plus ReadyNAS). They have been spinning nonstop for ~4 years [1] and haven't had to replace a single one. Not even an increase in errors to signal that the drive is going.
Yet - I've worked on a SAN that cost hundreds of thousands of dollars and would have to replace a disk about every month.
Granted the disks in SANs probably spin faster (thus faster data access/lower MTBF) - but that high failure rate seems rather suspicious to me.
AFR buddy, AFR. AFR is the annual failure rate, its typically 2 - 5% of the population per year. 4 drives you don't have a large enough set to see this in action, just every day you're in danger of losing a drive by a small statistical amount.
In the Blekko cluster we have just under 10,000 drives. We have a two 20 drive 'boxes' (40 drives) from Western Digital, as drives fail we pull replacements from the 'new/refurbished' box, and we put the dead one in the outgoing box. When we get up to 20 we RMA them in bulk, 20 go out, 20 more come in. That becomes the new 'new/refurbished' box.
It really isn't SAN vs non-SAN it is all statistics.
That said, if you're running your ReadyNAS with raid 10 (mirrored drives in a RAID 0 config) you may find some unpleasantness when a drive does fail. Statistically you have a 1/10 chance of not being able to re-silver the mirror for a 5900 RPM desktop SATA drive. That gets a bit painful.
Just out of curiosity, has anyone done any research into determining whether a drive which fails after X days/years has some properties in the first Y days that could be a signal for future failure?
I'm certain there are more recent statistics but Google's "Failure Trends in a Large Disk Drive Population"[1] (2007) is a good start:
"In addition to presenting failure statistics,
we analyze the correlation between failures and several
parameters generally believed to impact longevity."
There is also a more recent open source dataset from Backblaze[2] that includes:
"Every day, the software that runs the Backblaze data center takes a snapshot of the state of every drive in the data center, including the drive’s serial number, model number, and all of its SMART data"
which forms the basis of an article correlating SMART data with drive failures at Backblaze[3].
The TL;DR answer is yes, there are some hard drive SMART values that can indicate failure is likely, but they vary by model and don't necessarily show before failure.
yeah, I was wondering if there were measurables that could be correlated with failure before SMART kicked in, even if they were something like date of year, or location of manufacture, or shipping route they took. :P
Generally the most common measurable is sector reallocation errors. This comes from various random things going wrong at the wrong time and the disk re-allocates a new sector from the spare pool to deal with one that has gone bad. In operation, our disks at Blekko pick up sector reallocation errors at a low statistical rate that picks up prior to total failure. Since our infrastructure is triply redundant (three disks hold a copy of every piece of data) we can simply reformat drives which develop sector errors. If you plot the time between sector errors developing over the life of the drive, it gets shorter rapidly as the drive as nearing complete failure. Sometimes however there is no warning, the drive simply fails. As with my previous experience at Google and NetApp before that, there is a small rise in early failure (infant mortality) then a long tail toward a steep failure rate after about 10 years.
> just every day you're in danger of losing a drive by a small statistical amount.
Which is why for important data I always use some sort of RAID (or cloud syncing). If I lose a drive I won't lose all my data (presumably though if I bought both drives at the same time there is a chance that both could fail at the same or close to the same time).
> That said, if you're running your ReadyNAS with raid 10 (mirrored drives in a RAID 0 config)
ReadyNAS and other products use a special type of RAID that is actually kind of clever (that will use all space regardless of disk size). I know many would criticize the special RAID but I've had more issues with Linux software RAID than I have had with ReadyNAS (nothing related to data loss). But suffice it to say if 1 drive fails then in theory the data should still be ok (at least that is what their claim to fame is).
My personal opinion - I feel like the ReadyNAS will die before the drives. I'm not saying they couldn't or won't die - but I feel like that would be highly unlikely and even more unlikely for more than 1 to fail at once.
Maybe I'm just paranoid - but when I worked on my last SAN it felt like the company used the cheapest possible drives with a short MTBF for the simple reason that my employer would have to keep using them and keep paying for their warranty service (this wasn't a big name like Dell).
> Which is why for important data I always use some sort of RAID (or cloud syncing). If I lose a drive I won't lose all my data
Just a friendly reminder that RAID != backup. There are numerous data loss cases that RAID does not deal with.
Personally I use striped ZFS with important volumes periodically snapshotted, replicated to external (and encrypted) disks and then stored offsite (cycle through a couple of sets of disks). Most important data is also periodically synced to cloud storage (as well as offsite disk).
This accounts for:
- Single disk failure (striping)
- Bitrot (ZFS scrubbing can reveal bitrot on disk and correct it from parity)
- Human error (snapshots)
- Catastrophic damage to home NAS (offsite backups)
RAID alone (depending on the particular implementation) will generally not deal with the 3 latter failure cases.
This is something that has been puzzling me. Many years ago I purchased 4x 2TB 5900 RPM drives for a 4 bay ReadyNAS (cost about ~$300 for drives plus ReadyNAS). They have been spinning nonstop for ~4 years [1] and haven't had to replace a single one. Not even an increase in errors to signal that the drive is going.
Yet - I've worked on a SAN that cost hundreds of thousands of dollars and would have to replace a disk about every month.
Granted the disks in SANs probably spin faster (thus faster data access/lower MTBF) - but that high failure rate seems rather suspicious to me.
[1] http://i.imgur.com/NRQCedj.png