Isn't running raid1/5/6 on ssds silly b/c they'll all die at the same time? And ...

GeorgeBeech · on June 3, 2015

Not necessarily. We've been running Intel SSDs in productions at Stack Exchange for 4+ years, and just recently had our first 2.5" drive die.

That said, most of the drives in this article are consumer drives. The problem with consumer drives is that they don't have capacitors. And since your writes are cached by the drive before they go to the NAND, if you lose power all of your drives will be corrupted in the exact same way at the exact same time.

If you don't care about the data, go ahead and use them. If you do pay the extra for Enterprise drives. They really aren't _that_ much more expensive these days.

deelowe · on June 3, 2015

Interesting. Do you have more info on what you mean by "Not necessarily?" From what I've seen during reliability studies on SSDs, they have a fairly tight failure curve. This is very dissimilar from hard disks where there's much more variance from drive to drive.

I'm genuinely interested.

GeorgeBeech · on June 3, 2015

Nothing that would pass deep scrutiny. Just our experience running SSDs in almost every server at Stack. We've only had one mass failure of drives. That was when 5/8 Samsung drives died around the same time in our packet capture box. The remaining 3 are still alive, although we don't really use them.

We have only had two Intel drives die on us. I'm interested (well academically, not professionally) if they will die at the same time or keep dropping off one at a time.

We tend to retire the machines or the drives in them before they fail.

linker3000 · on June 3, 2015

In my experience (Flash Platforms Group, HSGT), a significant number of flash device failures are caused by mechanical issues, such as: on-die faults, wire bonding problems and solder joint/package stresses which may occasinally be down to a production issue, but are more often attributed to rough handling during installation or thermal stress. This is less of an issue with SFF SSDs, but especially true of PCIe products crammed into 2U servers with poor airflow.

In general, thermals tend to be a significant issue for all form factors when devices are retrofitted. Less so for 'products' (All flash arrays etc.) which are designed as whole products from the outset.

This creates a failure pattern that is totally separate from predictable wear due to use and means that 'they'll all die at the same time' becomes much less certain for some categories of device.

deelowe · on June 3, 2015

That's what I had considered as well. Specifically, that the wear out failures are on the right hand of the curve, meaning that other issues would tend to dominate. What you said makes a lot of sense: failures due to thermal/electrical stress, manufacturing issues, and handling.

Also, there's the controllers, which tend to be a significant source for issues.

So yeah, maybe theoretically, the wear out would be a concern for raid. However, practically this is rarely an issue as other failures randomize the distribution enough to where this never causes a problem in-situ.