Hacker News new | past | comments | ask | show | jobs | submit login

Yep: if you buy a pair disks together, there's a fair chance they'll both be from the same manufacturing batch, which correlates with disk defects.



Yeah just coming here to say this. Multiple disk failures are pretty probable. I've had batches of both disks and SSDs with sequential serial numbers, subjected to the same workloads, all fail within the same ~24 hour periods.


Had the same experience with (identical) SSDs, two failures within 10 minutes in a RAID 5 configuration.

(Thankfully, they didn't completely die but just put themselves into read-only)


Seems like it was only a few days ago that there was a comment from a former Dropbox engineer here pointing out that a lot of disk drives they bought when they stood up their own datacenter had been found to all have a common flaw involving tiny metal slivers.


This makes total sense but I've never heard of it. Is there any literature or writing about this phenomenon?

I guess proper redundancy is having different brands of equipment also in some cases.


I hadn't heard of it either until disks in our storage cluster at work started failing faster than the cluster could rebuild in an event our ops team named SATApocalypse. It was a perfect storm of cascading failures.

https://web.archive.org/web/20220330032426/https://ops.faith...


Great read, thank you!


I also don't know about literature on this phenomenon, but i recall HP had two different SSD recalls because when the uptime counter rolled over, they would fail. That's not even load dependent, just did you get a batch and power them on all at the same time. Uptime is too high causing issues isn't that unusual for storage, unfortunately.

It's not always easy, but if you can, you want manufacturer diversity, batch diversity, maybe firmware version diversity[1], and power on time diversity. That adds a lot of variables if you need to track down issues though.

[1] you don't want to have versions with known issues that affect you, but it's helpful to have different versions to diagnose unknown issues.


The crucial M4 had this too but it was fixable with a firmware update.

https://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashi...


That one looks not too bad, seems like you can fix it with a firmware update after it fails. A lot of disk failures due to firmware bugs end up with the disk not responding to the bus, so it becomes somewhere between impossible and impractical to update the firmware.


I don't know about literature, but in the world of RAID this is a common warning.

Having a RAID5 crash and burn because the backup disk failed during the reconstruction phase after a primary disk failed is a common story.


Not sure about literature but that was a known thing in the Ops circles I was in 10 years ago: never use the same brand for disk pairs, to minimize wear-and-tear related defects from arising at the same time.


We used to use the same brand, but different models or at least ensure they were from different manufacturing batches.


Wikipedia has a section on this. It's called "correlated failure." https://en.wikipedia.org/wiki/RAID#Correlated_failures


Not sure about literature, but past anecdotes and HN threads yes.

https://news.ycombinator.com/item?id=4989579


This is why I try to mismatch manufacturers in RAID arrays. I'm told there is a small performance hit (things run towards the speed of the slowest, separately in terms of latency and throughput) but I doubt the difference is high and I like the reduction in potential failure-during-rebuild rates. Of course I have off-machine and off-site backups as well as RAID, but having to use them to restore a large array would be a greater inconvenience than just being able to restore the array (followed by checksum verifies over the whole lot for paranoia's sake).


Eek - now I'm glad I wait a few months before buying each disk for my NAS.

Not doing it for this reason but rather financial ones :) But as I have a totally mixed bunch of sizes I have no RAID and a disk loss would be horrible.


Have to be careful doing that too or you'll end up with subtly different revisions of the same model. This may or may not cause problems depending on the drives/controller/workload but can result in you chasing down weird performance gremlins or thinking you have a drive that's going bad.


That's why serious SAN vendors take care to provide you a mix of disks (e.g. on a brand new NetApp you can see that disks are of 2-3 different types, and with quite different serial numbers).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: