I'm having trouble finding information about this. It's disturbing how little at...

nostrademons · on Feb 15, 2013

Yeah, it's interesting what sort of emergent properties come out of massively-scalable, massively-distributed systems. For example, when you write software in school or for single-machine deployment, you're taught to assume that when there's a bug it's your fault, a defect in your software. That's no longer the case when you get into massive (10K+ machine) clusters, where when your program fails, it might be your software, or it might be the hardware, or it might be a random event like a cosmic ray (seriously...in early Google history, there were several failed crawls that happened because cosmic rays caused random single-bit errors in the software).

And so all the defect-prevention approaches you learn for writing single-machine software - testing, assertions, invariants, static-typing, code reviews, linters, coding standards - need to be supplemented with architectural approaches. Retries, canaries, phased rollouts, supervisors, restarts, checksums, timeouts, distributed transactions, replicas, Paxos, quorums, recovery logic, etc. A typical C++ or Java programmer thinks of reliability in terms of "How many bugs are in my program?" The Erlang guys figured out a while ago that this is insufficient for reliability, because the hardware might (and will, in a sufficiently large system) fail, and so to build reliable systems you need at least two computers, and it's better to let errors kill the process and trigger a fallback than to try to never have errors.