I understand the need for reliability, i.e. "come back up if you get bad data so it doesn't just die on you", but this seems like something that dates back from when any kind of raised exception would bring the entire system down. Don't we have modern frameworks and systems now that can tolerate individual failures instead of completely dying just because someone tried to POST some bad data?
Plus, having reliability and coming back up if it goes down doesn't solve the root problem, which is that there's a bug in the code, or some sort of unhandled edge case. 99.9999999% uptime doesn't help if your code doesn't work right.
My personal anecdote to illustrate the difference: I've worked on a python app before which had a bug that sometimes made the websocket closed callback throw an exception. The exception was still being caught by the framework, so the system kept working fine.
...or did it? Eventually, we noticed instances of the app locking up, not responding to any request. There was no obvious error, it just kept running but did not log or do anything. It turned out to be due to the callback bug: the exception was being caught alright, but we were leaking file descriptors like a sieve and eventually running out. Due to this a minor bug affecting only some requests was able to take down the entire app.
That's the kind of problem Erlang solves well. Every request runs in its own lightweight process. The process can open (and thus own) files, sockets, OS processes, in-memory DB tables, etc. If the process crashes or dies for any reason, its memory and any other resource it owned gets cleaned up by the VM. If the process tries to hog the CPU or even enters an infinite loop, the preemptive scheduling means other requests will still get a chance to run.
In short: having strong isolation between processes means it's unlikely for a bug in a minor feature somewhere to affect the overall application health.
Imagine there is a bug in the code or unhandled edge case, only exhibited on 0.01% transactions. Elixir/OTP handler dies on this edge case, supervisor spawns another worker, 99.99% of user transactions continue to be served.
Programmer wakes up next morning, comes to work, reads the exception log, fixes this edge case, optionally hot-upgrades the code in production if it's easier than doing a full restart, problem solved.
Other frameworks have to carefully implement the same semantics (that is, prevention of propagation of bad state/cascading failures). Elixir/OTP/Erlang come with it by default, and coding style encourages it.
> Plus, having reliability and coming back up if it goes down doesn't solve the root problem, which is that there's a bug in the code, or some sort of unhandled edge case. 99.9999999% uptime doesn't help if your code doesn't work right.
Yeah but you don't have to wake up at 4am in the morning to fix the problem if the system stays up and still works (even if a bit slower or some feature is degraded). I've seen systems that had parts crashing and restarting for a weeks and months. Yeah it is bad nobody noticed, but it was secondary service that didn't affect most users. Having the backend with an exception or segfault because if it would not make sense.
> Don't we have modern frameworks and systems now that can tolerate individual failures instead of completely dying just because someone tried to POST some bad data?
Yeah we do. Nothing Elixir does is magic or absolutely impossible with other systems. You can spawn an OS process to handle a request. Can have a load balancer or something to check the health of backend nodes, and redirect. Or wrap everything in exceptions and try to restart and so on. It is kind of doable. But now there are multiple libraries, frameworks, services to also maintain and look after, maybe that thread just crashed made a mess on the heap and now maybe restarting don't really fix the problem. So doable, but more awkward.
> Plus, having reliability and coming back up if it goes down doesn't solve the root problem,
Obviously, nothing will write the code to fix the problems except the programmer. Is it worth to completely crash the system because some bug in one thread or some unimportant component failed? It is a bit like having a human drop dead the first time they scratched their finger. Yeah they can't write as fast for a while, but they can still largely function.
I mentioned Rust in another comment. Perhaps you'd prefer to have the compiler check correctness, that's very reasonable, something like Rust or Haskell might be what you'd like to use to sort of ensure you have less bugs when the system runs. Higher level of proof of correctness are possible, that is done for avionics software, crypto and other life-critical systems. So it is a continuum there. But there usually dollar signs attached to it.
I think you are misunderstanding something. "Let it crash" doesn't mean crash on everything. It's about handling the _unexpected_.
Someone posting bad data may or may not be expected. The primitives are such that you can _choose_ how much or how little supervision you want, for a given complexity or maturity of a project, or what you learned as users put the system through the wringer. This is a _choice_ you make as an engineer, and OTP makes it easier for you to make such a tradeoff. Other systems and framework do not necessarily let you make such a tradeoff as gracefully.
This not only allows the software to scale gracefully with the number of users, this allows the software to evolve gracefully in complexity and maturity.
The supervisor is what is responsible for restarting the worker or not. If you have a process that saves files to local disk and the disk is full, causing the worker to crash, the supervisor can recognize that reason and conclude that the worker process should not be restarted. Additionally, the supervisor can take other actions like informing the rest of the system and/or potentially even spawning a different worker process that stores files to a slower, more expensive cloud storage option.