I went from a company that used Elixir in the backend to one that uses Nodejs.
I had gone in neutral about Nodejs, having never really used it much.
These projects I worked on were backend data pipeline that did not even process that much data. And yet somehow, it was incredibly difficult to isolate exactly the main bug. Along the way, I found out all sorts of things about Nodejs and when I compare it with Elixir/Erlang/OTP, I came to the conclusion that Node.js is unreliable by design.
Don't get me wrong. I've done a lot of Ruby work before, and I've messed with Python. Many current-generation language platforms are struggling with building reliable distributed systems, things that the BEAM VM and OTP platform had already figured out.
Elixir never performs all to well in microbenchmarks. Yet in every application I've seen Elixir/Erlang projects compared to more standard Node, Python, or even C# projects and the Elixir one generally has way better performance and feels much faster even under load.
Personally I think much of it is due to async being predominant in Node and python. Async seems much harder than actor or even threading for debugging performance issues. Sure it feels easier to do async at first. But async leads to small bloat adding up and makes it very difficult to debug and track down. It makes profiling harder, etc.
In BEAM, every actor has its own queue. It's trivial to inspect and analyze performance blockages. Async by contrast puts everything into one giant processing queue. Plus every function call in async gets extra overhead added. It all adds up.
This has to do with how async works without preemption and resource limits.
There's a counter-intuitive thing when trying to balance load across resources: applying resource limits helps the system run better overall.
One example: when scaling a web app, there comes a point when scaling up the database doesn't seem to help. So we're tempted to increase the connection pool because that looks like a bottleneck. Increasing the pool can make the overall system perform worse, because often times, it is slow queries and poorly performing queries that is stopping up the system.
Another example: one of the systems I worked on has over 250 node runtimes running on a single, large server. It used pm2 and did not apply cgroups to limit CPU resources. The whole system was a hog, and I temporarily fixed it by consolidating things to run on about 50 node runtimes.
When I moved them over to Kubernetes, I also applied CPU resource limit, each in its own pod. I set the limits based on what I measured when they were all running on PM2 ... but the same code running on Kubernetes ran with 10x less CPU overall. Why? Because the async code were not allowed to just run grabbing as much CPU as it can for as long as it can, and the kernel scheduler was able to fairly run. That allowed the entire system to run with less resources overall.
There's probably some math that folks who know Operations Research can prove all this.
> When I moved them over to Kubernetes, I also applied CPU resource limit, each in its own pod. I set the limits based on what I measured when they were all running on PM2 ... but the same code running on Kubernetes ran with 10x less CPU overall. Why? Because the async code were not allowed to just run grabbing as much CPU as it can for as long as it can, and the kernel scheduler was able to fairly run. That allowed the entire system to run with less resources overall.
As someone who has advocated against Kubernetes CPU limits everywhere I've worked, I'm really struggling to see how they helped you here. The code used 10x less CPU with CPU limits, with no adverse effects? What were all those CPU cycles going before?
> The code used 10x less CPU with CPU limits, with no adverse effects?
The normal situation is that defective situations get a much large latency, while the correct requests run much faster.
It's a problem on the cases when the first set isn't actually defective. But it normally takes a reevaluation of the entire thing to solve those, and the non-limited situation isn't any good either.
> Async by contrast puts everything into one giant processing queue
How can you make performance claims while getting the details completely wrong?
Neither .NET's nor Rust's Tokio async implementations work this way. They use all available cores (unless overridden) and implement work-stealing threadpool. .NET in addition uses hill-climbing and cooperative blocking detection mechanism to quickly adapt to workloads and ensure optimal throughput. All that while spending 0.1x CPU on computation when compared to BEAM, and having much lower memory footprint. You cannot compare Erlang/Elixir with top of the line compiled languages.
That sounds about right for .NET. One of the Elixir projects I worked on lived alongside a C# .NET, the latter being a game server backend. The guy who architect and implemented it made it so that large numbers of people can interact in realtime without having to shard. It is pretty amazing stuff in my book.
On the other hand, I have yet to have to implement a liveness probe with an Elixir app, and I've had to do that with .NET because it can and does freeze. That game server also didn't use up all the available cores as well as the Elixir app. We also couldn't attach a REPL directly to the .NET app, though we certainly tried.
I would be curious to see if Rust works out better in production.
I think you read my reply incorrectly. Also, would you attach “repl” to your C++/Rust (or, God forbid, Go) application?
Sigh. I swear, the affliction of failing to understand the underlying concepts upon which a technology A or B is built is a plague upon our industry. Instead, everything clearly must fit into the concepts limited to whatever “mother tongue” language a particular developer has mastered.
> I swear, the affliction of failing to understand the underlying concepts upon which a technology A or B is built is a plague upon our industry. Instead, everything clearly must fit into the concepts limited to whatever “mother tongue” language a particular developer has mastered.
Ironic, since any time you post about a programming language it's to inform that C# does it better.
Not just here; someone with your nick also whined when the creator of C# made a technical deficient decision when choosing Go over C# to implement typescript.
It's hard for a rational person to believe that someone would make the argument that the creator of the language must have made a mistake just because he reached for (in his words) a more appropriate language in that context.
You have a blind spot when it comes to C#. You also probably already know it.
> Not just here; someone with your nick also whined when the creator of C# made a technical deficient decision when choosing Go over C# to implement typescript.
You know you could have just linked the reply instead? It states "C#, F# or Rust". But that wouldn't sound that nice, would it? I use and enjoy multiple programming languages and it helps me in day-to-day tasks greatly. It does not prevent me from seeing how .NET has flaws, but holistically it is way less bad than most other options on the market, including Erlang, Go, C or what have you.
> It's hard for a rational person to believe that someone would make the argument that the creator of the language must have made a mistake just because he reached for (in his words) a more appropriate language in that context.
So appeal to authority trumps observable consequences, technical limitations and arguments made about lackluster technical vision at microsoft? Interesting. No, I think it is the kind of people who refuse to engage with the subject on their own merits that are a problem, relegating to the powers that be all the argumentation. Even in a team environment, sure it is easier to say "a team/person X makes a choice Y" but you could also, if the situation warrants it, expand on why you think this way, and if you can't maybe you shouldn't be making a statement?
So no, "TypeScript, including Anders Hejlsberg, choosing Go as the language to port TS compiler to" does not suddenly make pigs fly, if anything, but being seen as an endorsement from key C# figure is certainly a bad look.
> So appeal to authority trumps observable consequences, technical limitations and arguments made about lackluster technical vision at microsoft?
Your argument is that you have a better grasp of "technical limitations" than Anders Hejlsberg?
You'll forgive the rest of us for not buying that; he has proven his chops, you haven't, especially as the argument (quite a thorough explanation of the context) from the typescript team is a lot more convincing than anything we've seen from you (a few nebulous phrases about technical superiority).
> but being seen as an endorsement from key C# figure is certainly a bad look.
Yeah, well, the team made their decision with no regard to optics. That lends more weight to their decision, not less.
> Neither .NET's nor Rust's Tokio async implementations work this way.
Well that’s great. I didn’t mention Rust in that list because it does seem to perform well. Its async is also known as to be much more difficult to program.
> and having much lower memory footprint. You cannot compare Erlang/Elixir with top of the line compiled languages.
And yet I do and have. Despite all the cool tech for C# and .Net, I’ve seen simple C# web apps struggle to even run on Raspberry pi’s for IoT projects while Elixir ones run very well.
Also note Elixir is a compiled language and BEAM has JIT nowadays too.
I did hesitate to add C# to that list because it is an impressive language and can perform well. I also know the least about its async.
Nothing you said really counters that async as a general paradigm is more likely to lead to worse performance. It’s still more difficult to profile and tune than other techniques even with M:N schedulers. Look at the sibling post talking about resource allocation.
Even for Rust there was a HM post recently where they got a Rust service to run a fair bit faster than their initial Golang implementation. After months of extra work that is. They mentioned that Golang’s programming model made it much easier to write fairly performant networking code for. Since Go doesn’t use async it seems reasonable to assume go routines are easier to profile and track than async even if I lack knowledge of Go’s implementation details on the matter. Now I am assuming their Rust implementation used async but don’t know for sure.
> Also note Elixir is a compiled language and BEAM has JIT nowadays too.
Let's see it perform faster than Python first :)
Also, if the target is supported, .NET is going to unconditionally perform faster than Elixir. This is trivially provable.
> Nothing you said really counters that async as a general paradigm is more likely to lead to worse performance. It’s still more difficult to profile and tune than other techniques even with M:N schedulers. Look at the sibling post talking about resource allocation.
Can you provide any reference to support this claim as far as actually good implementations go? Because so far it looks like vibe-based reasoning with zero knowledge to substantiate the opinion presented as fact.
That's not surprising however - Erlang and Elixir as languages tend to leave their heavy users with big knowledge and understanding gaps and their communities are rather dogmatic about BEAM being the best next thing since sliced bread. Lack of critical thinking leads to such a sorry place.
> Can you provide any reference to support this claim as far as actually good implementations go?
Ah yes now to the No True Scotsman fallacy. Async only works well when it’s “properly implemented” which is only .NET.
Even some .NET folks prefer actors model for concurrent programming:
> Orleans is the most underrated technology out there. Not only does it power many Azure products and services, it is also the design basis for Microsoft Service Fabric actors, which also power many Azure products. Virtual actors are the perfect solution for today’s distributed systems.
> In my experience Orleans was able to handle insane write load (our storage/persistence provider went to a queue instead of direct, it was eventually consistent) so we were able to process millions of requests without breaking a sweat. Perhaps others would want more durability, we opted for this as the data was also in a time series database before Orleans saw it.
Ironically what got me into Elixir was learning about Orleans and how successful it was in scaling XBox services.
> Because so far it looks like vibe-based reasoning with zero knowledge to substantiate the opinion presented as fact.
Aside from personal experience and years of writing and deploying performance sensitive IoT apps?
Well quick googling shows quite a few posts detailing async issues:
> What tools and techniques might be suited for this kind of analysis? I took a quick glance at a flamegraph but it seems like I would need a relatively deep understanding of the async runtime internals since most of what I see looks like implementation details.
> Reading a 1GB file in 100-byte chunks leads to at least 10,000,000 IOs through three async call layers. The problem becomes catastrophic since these functions are essentially language-level abstractions of callbacks, lacking optimizations that come with their async nature. However, we can manually implement optimizations to alleviate this issue.
> I’m not going to say all async frameworks are definitely slower than threads. What I can say confidently is that asyncio isn’t faster, and it’s more efficient only for huge numbers of mostly idle connections. And only for that.
Do you realize that actor model and virtual/green threads/stackful coroutines vs stackless coroutines / async/await and similar are orthogonal concepts?
Also picking asyncio from Python. Lol. You can't be serious, can you?
The only impression I get is most Elixir/Erlang practicioners simply have very ossified perception and deep biases that prevent them from evaluating implementation/design choices fairly and reaching balanced conclusions on where their capabilities lie. Very far cry from the link salad you posted that does not answer my question e.g. the issues with .NET and Rust async implementations performance-wise.
It's impossible to have a conversation with someone deeply committed to their bias and unwilling to accept that BEAM is not the shining paragon of concurrent and multi-threaded runtimes it once was.
Starting with the most general: Nodejs suffers in the same way that other async systems do -- the lack of preemption means that certain async threads can starve other async threads. You can see this on GUI desktop apps when the GUI freezes because it wasn't written in a way to take that into account.
In other words, the runtime feature that Nodejs is the most proud of and markets to the world as its main advantage does not scale well in a reliable way.
The BEAM runtime has preemption and will degrade in performance much more gracefully. In most situations, because of preemption (and hot code reloading) you still have a chance for attaching a REPL to the live runtime while under load. That allows someone to understand the live environment and maybe even hot patch the live code until a the real fix can run through the continuous delivery system.
I'm not going to go into the bad Javascript syntax bloopers that still haunts us, and only partially mitigated by Typescript. That is documented in "Javascript: The Good Parts". Or how the "async" keyword colors function calls, forcing everything in a call chain to also be async, or forcing you to use the older callbacks. Most people I talk to who love Typescript don't consider those as issues.
The _main_ problems are:
1. Async threads can easily get orphaned in Nodejs. This doesn't happen when using OTP on BEAM because you typically start a gen_server (or a gen_*) under a supervisor. Even processes that are not supervised can be tracked. Because pids (identifiers to processes) are first-class primitives, you can always access the scheduler which will tell you _all_ of the running processes. If you were to attach a Nodejs REPL, you can't really tell. This is because there is no encapsulation of the process, no way to track when something went async, no way to send control messages to those async processes.
2. Because async threads are easily orphaned, errors that get thrown gets easily lost. The response I get from people who love Typescript on Nodejs tells me that is what the linter is for. That is, we're going to use an external tool to enforce all errors gets handled, rather than having the design of the language and the runtime handle the error. In the BEAM runtime, unhandled errors within the process crashes the process, without crashing anything else; processes that are monitoring that process that crashed gets notified by the runtime that it has crashed. The engineer can then define the logic for handling that crash (retry? restart? throw an error?).
3. The gen_server behavior in OTP defines ways to send control messages. This allows more nuanced approaches to managing subsystems than just restarting when things crash.
I'm pretty much at the point where I would not really want to work on deploying Nodejs on the backend. I don't see how something like Deno would fix anything. Typescript is incapable of fixing this, because these are design flaws in the runtime itself.
Just to further hammer point 2 and how it’s a problem in the real world, Express, probably the go to server library for close to a decade, has only within the last couple months sorted out not completely swallowing any error that happens in async middleware by default. And only because some new people came in to finally fix it! It’s absolutely insane how long that took and how easy it was to get stung by that issue.
The problem with Node is observability. They've optimized away observability to where it's hard to find performance problems compared to the JVM to Beam.
I have been looking for an Erlang thing akin to Apache Airflow or Argo Workflows. Something that allows me to define a DAG of processes, so that they run one after the other. How would you implement something like that?
I had gone in neutral about Nodejs, having never really used it much.
These projects I worked on were backend data pipeline that did not even process that much data. And yet somehow, it was incredibly difficult to isolate exactly the main bug. Along the way, I found out all sorts of things about Nodejs and when I compare it with Elixir/Erlang/OTP, I came to the conclusion that Node.js is unreliable by design.
Don't get me wrong. I've done a lot of Ruby work before, and I've messed with Python. Many current-generation language platforms are struggling with building reliable distributed systems, things that the BEAM VM and OTP platform had already figured out.