I'm not sure that should be a concern at the routing layer or even necessarily a...

btilly · on Feb 15, 2013

The problem with that is when your code is failing to run properly due to their hardware problem - now whose fault is it?

I don't know how big they are. 50k machines? Could be off by an order of magnitude either way but I'll go with that. Suppose that your servers have, let's be generous, a 5 year mean time between failure. That's 10k machines dying every year. About 27 per day. A bit over 1 per hour.

Machines don't necessarily die cleanly. They get flaky. Bits flip. Memory gets corrupted. Network interfaces claim to have sent data they didn't. Every kind of thing that can go wrong, will go wrong, regularly. And every one of them opens you up to following "impossible" code paths where the machine still looks pretty good, but your software did something that should not be possible, and is now in a state that makes no sense. Eventually you figure it out and pull the machine.

Yeah, it doesn't happen to an individual customer too often. But it is always happening to someone, somewhere. And if you use least connections routing, many of those failures will be much, much bigger deals than they would be otherwise. And every time it happens, it was Heroku's fault.

benatkin · on Feb 15, 2013

I'm having trouble finding information about this. It's disturbing how little attention seems to be given to load balancing relative to how important it is.

nostrademons · on Feb 15, 2013

Yeah, it's interesting what sort of emergent properties come out of massively-scalable, massively-distributed systems. For example, when you write software in school or for single-machine deployment, you're taught to assume that when there's a bug it's your fault, a defect in your software. That's no longer the case when you get into massive (10K+ machine) clusters, where when your program fails, it might be your software, or it might be the hardware, or it might be a random event like a cosmic ray (seriously...in early Google history, there were several failed crawls that happened because cosmic rays caused random single-bit errors in the software).

And so all the defect-prevention approaches you learn for writing single-machine software - testing, assertions, invariants, static-typing, code reviews, linters, coding standards - need to be supplemented with architectural approaches. Retries, canaries, phased rollouts, supervisors, restarts, checksums, timeouts, distributed transactions, replicas, Paxos, quorums, recovery logic, etc. A typical C++ or Java programmer thinks of reliability in terms of "How many bugs are in my program?" The Erlang guys figured out a while ago that this is insufficient for reliability, because the hardware might (and will, in a sufficiently large system) fail, and so to build reliable systems you need at least two computers, and it's better to let errors kill the process and trigger a fallback than to try to never have errors.