"How does it manage to stay consistent if a cosmic ray strikes it and flips one or more bits?"
At the time (and I think its still true) cosmic rays do not have sufficient energy to flip a magnetic domain on disk. Memory bit flips are detected by ECC and channel (between the I/O card and memory and/or disk) are identified with CRC codes.
"How does it manage to stay consistent if you physically bump in to the drives and cause physical damage by having the disk head briefly touch the disk surface?"
The disks are part of a RAID4 or 6 group (RAID 6 preferred for drives > 500MB, required for drives >= 2TB) so physically damaging a drive results in a group reconstruction of the data on that drive.
NetApp has always had a pretty solid "don't trust anything" sort of mantra that has been tested and fortified a few times by various events. The ones I got to see first hand were an HBA that corrupted traffic through it in flight, drives that returned a different block than you asked for, and drives that acknowledged they had written data to the drive when in fact they had not.
Back in the early 2000's anything that could happen with a disk with a probability larger than once in billion operations or higher, they got to see once a month. It was an interesting challenge which requires a certain discipline to deal with. When I went to Google and saw their "we assume everything is crap, we just fix it in software" model it gave me another perspective on how to tackle the problem of storage reliability.
Both schemes work and have their plusses and minuses.
At the time (and I think its still true) cosmic rays do not have sufficient energy to flip a magnetic domain on disk. Memory bit flips are detected by ECC and channel (between the I/O card and memory and/or disk) are identified with CRC codes.
"How does it manage to stay consistent if you physically bump in to the drives and cause physical damage by having the disk head briefly touch the disk surface?"
The disks are part of a RAID4 or 6 group (RAID 6 preferred for drives > 500MB, required for drives >= 2TB) so physically damaging a drive results in a group reconstruction of the data on that drive.
NetApp has always had a pretty solid "don't trust anything" sort of mantra that has been tested and fortified a few times by various events. The ones I got to see first hand were an HBA that corrupted traffic through it in flight, drives that returned a different block than you asked for, and drives that acknowledged they had written data to the drive when in fact they had not.
Back in the early 2000's anything that could happen with a disk with a probability larger than once in billion operations or higher, they got to see once a month. It was an interesting challenge which requires a certain discipline to deal with. When I went to Google and saw their "we assume everything is crap, we just fix it in software" model it gave me another perspective on how to tackle the problem of storage reliability.
Both schemes work and have their plusses and minuses.