Hacker News new | past | comments | ask | show | jobs | submit login

I'm having trouble finding information about this. It's disturbing how little attention seems to be given to load balancing relative to how important it is.



Yeah, it's interesting what sort of emergent properties come out of massively-scalable, massively-distributed systems. For example, when you write software in school or for single-machine deployment, you're taught to assume that when there's a bug it's your fault, a defect in your software. That's no longer the case when you get into massive (10K+ machine) clusters, where when your program fails, it might be your software, or it might be the hardware, or it might be a random event like a cosmic ray (seriously...in early Google history, there were several failed crawls that happened because cosmic rays caused random single-bit errors in the software).

And so all the defect-prevention approaches you learn for writing single-machine software - testing, assertions, invariants, static-typing, code reviews, linters, coding standards - need to be supplemented with architectural approaches. Retries, canaries, phased rollouts, supervisors, restarts, checksums, timeouts, distributed transactions, replicas, Paxos, quorums, recovery logic, etc. A typical C++ or Java programmer thinks of reliability in terms of "How many bugs are in my program?" The Erlang guys figured out a while ago that this is insufficient for reliability, because the hardware might (and will, in a sufficiently large system) fail, and so to build reliable systems you need at least two computers, and it's better to let errors kill the process and trigger a fallback than to try to never have errors.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: