Isn't running raid1/5/6 on ssds silly b/c they'll all die at the same time? And hardware raid on top of that? Why?
SSDs have a fairly consistent failure curve (exusing firmware bugs and other random events) for a given model, so they'll wear evenly in a raid setup. This means they'll all die at the same time as writes/reads are distributed fairly evenly across the disks. Given the size of today's drives, you may not complete a rebuild before losing another disk.
Has this been proven to not be true within the past few years? I don't run redundant raid on ssds. It's either raid0 or jbod.
Not necessarily. We've been running Intel SSDs in productions at Stack Exchange for 4+ years, and just recently had our first 2.5" drive die.
That said, most of the drives in this article are consumer drives. The problem with consumer drives is that they don't have capacitors. And since your writes are cached by the drive before they go to the NAND, if you lose power all of your drives will be corrupted in the exact same way at the exact same time.
If you don't care about the data, go ahead and use them. If you do pay the extra for Enterprise drives. They really aren't _that_ much more expensive these days.
Interesting. Do you have more info on what you mean by "Not necessarily?" From what I've seen during reliability studies on SSDs, they have a fairly tight failure curve. This is very dissimilar from hard disks where there's much more variance from drive to drive.
Nothing that would pass deep scrutiny. Just our experience running SSDs in almost every server at Stack. We've only had one mass failure of drives. That was when 5/8 Samsung drives died around the same time in our packet capture box. The remaining 3 are still alive, although we don't really use them.
We have only had two Intel drives die on us. I'm interested (well academically, not professionally) if they will die at the same time or keep dropping off one at a time.
We tend to retire the machines or the drives in them before they fail.
In my experience (Flash Platforms Group, HSGT), a significant number of flash device failures are caused by mechanical issues, such as: on-die faults, wire bonding problems and solder joint/package stresses which may occasinally be down to a production issue, but are more often attributed to rough handling during installation or thermal stress. This is less of an issue with SFF SSDs, but especially true of PCIe products crammed into 2U servers with poor airflow.
In general, thermals tend to be a significant issue for all form factors when devices are retrofitted. Less so for 'products' (All flash arrays etc.) which are designed as whole products from the outset.
This creates a failure pattern that is totally separate from predictable wear due to use and means that 'they'll all die at the same time' becomes much less certain for some categories of device.
That's what I had considered as well. Specifically, that the wear out failures are on the right hand of the curve, meaning that other issues would tend to dominate. What you said makes a lot of sense: failures due to thermal/electrical stress, manufacturing issues, and handling.
Also, there's the controllers, which tend to be a significant source for issues.
So yeah, maybe theoretically, the wear out would be a concern for raid. However, practically this is rarely an issue as other failures randomize the distribution enough to where this never causes a problem in-situ.
SSDs have a fairly consistent failure curve (exusing firmware bugs and other random events) for a given model, so they'll wear evenly in a raid setup. This means they'll all die at the same time as writes/reads are distributed fairly evenly across the disks. Given the size of today's drives, you may not complete a rebuild before losing another disk.
Has this been proven to not be true within the past few years? I don't run redundant raid on ssds. It's either raid0 or jbod.