In Erlang the fault-tolerant behavior is not builtin either, only tools to make it. You still have to make the right supervision tree, dependencies between processes, links, making sure you handle the process termination messages correctly and many other details.
In my experience in non-Erlang setups while doing requests to other services you have to check the response status and add some code handling it, so it's not really a complete afterthought. The only difference I see here is that Erlang handles failure in a real-time way, but it also can be done using some periodic task to query important services. And implementing in outside of Erlang gives more flexibility (think of Erlang cluster size and network limitations)
Our team had worked on an Elixir app for a couple years, before splitting off game logic into Dotnet. Scaling the dotnet server was a much different beast:
- It wasn't designed to crash on failure. It uses thread pools with no supervision trees. We had to add in liveliness probes to check if it is alive. I've only had to use readiness checks for Elixir
- No REPL. With a REPL in production, we can debug things live, even try patches to see if those work. Can't do that with Dotnet. That's also something that contributes to reliability
Now, cluster size do matter. The way Erlang and BEAM was designed were for vertical scaling. You can minimize cluster size by biasing towards vertical scaling. That's what we do on our systems. There's a way to do that with Kubernetes so that we scale vertically during our daily traffic cycle.
At some point though, you start looking at partial clustering topology for BEAM, or use one of the many process registeries that are better suited for dynamic membership. (The one bitwalker wrote comes to mind).
That's wrong. Fault tolerance is basically the default. Yes you have to build a supervision tree, but unless you're writing a one-off script, you have to build it to anyways to do anything.
Well you confirming what I said - you have to build it, it's not that Erlang programs automatically never fail and always handle problem correctly as required.
In my experience in non-Erlang setups while doing requests to other services you have to check the response status and add some code handling it, so it's not really a complete afterthought. The only difference I see here is that Erlang handles failure in a real-time way, but it also can be done using some periodic task to query important services. And implementing in outside of Erlang gives more flexibility (think of Erlang cluster size and network limitations)