> Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads such that one can easily use hundreds of thousands of OS threads without negatively impacting performance
I actually can't imagine how that would ever be accomplished at the OS level. The fact that each thread needs its own stack is an inherent limiter for efficiency, as switching stacks leads to cache misses. Asynchronous I/O has an edge because it only stores exactly as much state as it needs for its continuation, and multiple tasks can have their state in the same CPU cache line. The OS doesn't know nearly enough about your program to optimize the stack contents to only contain the state you need for the remainder of the thread.
But at the programming language level the compiler does have insight into the dependencies of your continuation, so it can build a closure that has only what it needs to have. You still have asynchronous I/O at the core but the language creates an abstraction that behaves like a synchronous threaded model, as seen in C#, Kotlin, etc. This doesn't come without challenges. For example, in Kotlin the debugger is unable to show contents of variables that are not needed further down in the code because they have already been removed from the underlying closure. But I'm sure they are solvable.
> But at the programming language level the compiler does have insight into the dependencies of your continuation
This is really the key point - coupled with the fact that certain I/O operations are just inherently asynchronous.
The TX/RX queues in NICs are an async, message passing interface - regardless of whether you're polling descriptors or receiving completion interrupts.
So really, async I/O is the natural abstraction for networking.
Anything that happens far enough from the CPU is async, and here far probably means 10cm thanks to the speed of light not being fast enough, even in vacuum.
So, any computation that spans a machine the size of our hands needs async unless you are willing to drop the clocks to push "far" a bit further away (and bring in power, heat and noise with it).
Another cut off could be 3 cm away: the RAM. If data needs to go on the heap, be shared, one can consider the truth lives farther away than 3cm and thus has async/impure effects.
But the same way the CPU works very hard to hide that away from you (reordering, caches, etc), green threads could do the same with barely any performance hit.
Which is fine until we decide that actually, we _would_ like that performance back, and now we re-implement async flows at language level again, realise what an advantage it’s bets us, and then that trickles into other languages as they also seek those advantages.
With async, things happening far away can completely prevent you from making progress, in which case you are better off doing something else in the meantime, like simply blocking and scheduling a thread from a different process as even the cost of swapping threads, destroying the caches and other cpu state is minimal and nothing else can be done.
The optimization begins instead of a single thing prevents you from making progress, it's a set of things, so you can parallelize the tasks and handle them in the seemingly random order they happen to finish. Now the question is how is this not an event loop running async tasks?
It’s been said that the only sync I/O behaviour in software is the interrupt handler in the top half of your kernel (or equivalent). Everything else, at every other layer, however you contextualise it or write the API, is in actuality some variant of polling.
> certain I/O operations are just inherently asynchronous.
That's technically not true.
The fact that its inherently async is an implementation detail. You either have blocking sync or non-blocking async. the implementation could be synchronous if the blocking didn't cause overhead and that was the proposed idea here - at least as far as I interpreted it.
No, the hardware is frequently inherently asynchronous. You write some memory and then the hardware consumes the prepared data asynchronously, in parallel, until it informs you in some manner that the operation is complete (usually either a asynchronous interrupt, or asynchronous write to a location you are polling). You can do whatever you want after preparing the data without waiting for completion. That is a inherently asynchronous hardware interface.
The software interfaces built on top of the inherently asynchronous hardware interface can either preserve or change that nature. That is a implementation detail.
Hardware is even-driven not asynchronous (the event-driven paradigm is an asynchronous paradigm, but I assume here you mean asynchronous as in async/await)
While that is technically true, it's also missing the point entirely. That's why I said you can have either blocking sync or non-blocking async.
The article explicitly talks about the API provided by the OS. As a matter of fact, they're even more specific talking about spawning os threads vs async non-blocking file access.
At this level, the async is an implementation detail.
I guess your comment confirms that you didn't read the article, and neither did the people down voting me.
As usual on HN. lots of people suffering from the Dunning Kruger complex
I guess anything is an implementation detail depending on what level we're talking about, but each layer is constrained by the "previous" layer in some ways, and we're working within those constraints.
Sometimes we have the luxury of not caring about this, and we can let the language runtime pick what it wants to do, and abstract it away for us. But sometimes you need to care, because you're constrained by the existing environment/conditions the code runs in, and need specific control over eg like timings.
I think in an ideal world where we can start from scratch and aren't constrained by existing tech stacks/layers/hardware, maybe the OS and compilers (and runtimes) could integrate more tightly when it comes to threading, stack, and memory management. It would be a radically different architecture, though (and I dunno which layers we'd need to invalidate).
No, it shows you did not understand the context or the point of the comment you originally responded to.
The author wants both synchronous and asynchronous modes, but they complain it is painful to provide asynchronous modes using synchronous primitives due to limitations of OS handling for such cases.
dist1ll was pointing out how the lower abstraction level, the hardware, actually presents a asynchronous interface. As such, the higher abstraction level the author of the article was complaining about, the OS API (which is a software implementation detail) lacks hardware sympathy.
The implication being that instead of the hardware presenting a asynchronous API, that the OS transduces into a synchronous API, that the author must transduce back into a asynchronous API; it might be simpler if the OS just directly presents the underlying asynchronous API and the author can transduce that into a synchronous API where needed. That involves fewer layers, fewer conversions, more hardware sympathy, and sidesteps the OS limitations preventing simple implementation of asynchronous modes using synchronous primitives.
Basically, if you have one wrong, you should not make a second wrong to make a right. You should just drop the first wrong. That is not obvious if you do not realize you are starting from or assuming one wrong.
Thank you, this was exactly my point. If efficiency is important, you want a very thin abstraction over hardware (in other words: library over frameworks).
This lack of hardware sympathy is actually something operating systems have been addressing in recent times. E.g. another top-level comment mentioned things like AF_XDP from the Linux kernel. I'm guessing similar ideas exist for modern, high-speed non-volatile storage (thinking of SPDK).
The approach the author takes with their language is just threads, but scheduled in userland. This model allows a decoupling of the performance characteristics of runtime threads from OS threads - which can sometimes be beneficial - but essentially, the programming model is fundamentally still synchronous.
Asynchronous programming with async/await is about revealing the time dimension of execution as a first class concept. This allows more opportunities for composition.
Take cancellation for example: cancelling tasks under the synchronous programming model requires passing a context object through every part of your code that might call down into an IO operation. This context object is checked for cancellation at each point a task might block, and checked when a blocking operation is interrupted.
Timeouts are even trickier to do in this model, especially if your underlying IO only allows you to set per-operation timeouts and you're trying to expose a deadline-style interface instead.
Under the asynchronous model, both timeouts and cancellation simply compose. You take a future representing the work you're doing, and spawn a new future that completes after sleeping for some duration, or spawn a new future that waits on a cancel channel. Then you just race these futures. Take whichever completes first and cancel the other.
Having done a lot of programming under both paradigms, the synchronous model is so much more clunky and error-prone to work with and involves a lot of tedious manual work, like passing context objects around, that simply disappears under the asynchronous model.
One of the other aspects of this is that implementing a synchronous model on top of asynchronous primitives is absolutely trivial. You just wait until the asynchronous operation completes. Any program designed for asynchronous execution can be trivially retrofitted for synchronous execution.
In contrast, implementing a asynchronous model on top of synchronous primitives is extremely challenging requiring the current mess of complex implementations such as state machine rewriting and thread pools. Furthermore, retrofitting a program using synchronous execution to use asynchronous execution is a tremendous amount of work as you need to do it bottom up to maintain compatibility during the transition.
> Furthermore, retrofitting a program using synchronous execution to use asynchronous execution is a tremendous amount of work as you need to do it bottom up to maintain compatibility during the transition.
Can attest to that. I was part of the team that rewrote Firefox to be fully async. Took us years and we could not maintain compatibility with XUL add-ons (async was not the only reason for this, but that's where the writing on the wall started to become visible).
It's also very easy to implement a future/promise style API over a blocking IO primitive, as long as you have cheap threads: you spawn a thread that executes the blocking operation (with cancelation and timeout support as needed) and sets the future's result once the result is done, or some error state. It's really not such a huge problem.
I will also note that most async runtimes include much more complex program rewrites and implicit state machines than thread based models. Java style or Go style green threads are much simpler frameworks than C#'s whole async task machinery, or even than Rust's Tokio.
And any program written as a series of threads running blocking operations with proper synchronization is also pretty easy to convert to an async model. The difficulty is taking a single-threaded program and making it run in an async model, but that is a completely different discussion.
However, I do agree that ultimately you do need the OS to provide async IO primitives to have efficient IO at the application level. Since OS threads can't scale to the required level, even the green threads + blocking IO approach is only realistically implementable with async IO from the OS level. This could change if the OS actually implemented a green threads runtime for blocking operations, but that might still have other inefficiencies related to costs of crossing security boundaries.
Yes, if you have cheap threads at the required scale. But that is the entire problem as you also attest and agree that with the author and me that OS threads do not scale to the required level.
Asynchronous, non-blocking primitives do scale to the required level, demonstrate greater hardware sympathy, and can easily and practically be used for the other half of the equation, blocking I/O, at the required scale.
Asynchronous primitives robustly solve the entire problem space, where as synchronous primitives suffer in high concurrency cases.
The only reason to prefer only exposing/implementing synchronous primitives in the API is due to implementation complexity. But at the OS layer you are abstracting hardware interfaces that almost always present asynchronous interfaces. Thus, exposing asynchronous APIs is usually not very hard where as exposing synchronous APIs is actually a mismatch that requires smoothing over (though to be fair not very much since, as I mentioned previously, implementing a synchronous operation in terms of asynchronous primitives is quite easy).
Though in truth I think we largely agree anyways. I was just presenting a more complete explanation.
> Asynchronous programming with async/await is about revealing the time dimension of execution as a first class concept
People are more likely to assume their code is fast enough and not worry about the execution time of synchronous data processing, then spend weeks investigating why the p99 latency is 5 seconds with clusters of spikes.
Async IO is almost entirely about efficiency. It's telling the OS that you can manage context switches better than it. Usually this means you're making a tradeoff for throughput over latency. That tradeoff is for efficiency is fine, but it needs to be conscious, and most of the time, you actually want lower latency.
Are you sure that the developer is the best at determining these context switches? I mean, for a low-level language like rust, sure. But for higher level programming, e.g. some CRUD backend, should the developer really care about all that added complexity, when the runtime knows just as much, if not more. Like, it’s a DB call? Then just use the async primitive of the OS in the background and schedule another job in its place, until it “returns”. I am not ahead from manually adding points where this could happen.
I think the Java virtual thread model is the ideal choice for higher level programming for this reason. Async actually imposes a much stricter order of execution than necessary.
Same as we used to do asm, but then the generated code is becoming good enough, or even better than hand written asm.
I predict the async trend will fade, as hardware and software will improve. And synchronous programming is higher level than using async. And higer level always prevail given enough time, as management always wants to hire the cheapest devs for the task.
Not sure why the downvotes. Async programming is harder than sync, as one needs not only know one's code, but also all the dependencies. Since the benefits of async are in many scenarios limited[1] I'd expect the simpler abstractions to win.
I am a CTO at a large company and I routinely experience tech leads who dont understand what happens under the hood of an async event loop and act surprised when weird and hard to debug p99 issues occur in prod (because some obscure dependency of a dependench does sync file access for something trivial).
Abstractions win, people are lazy and most new developers lack understanding of lower level concepts such as interrupts or pooling, nor can predict what code may be cpu bound and unsafe in a async codebase. In a few years, explicit aync will be seen as C is seen today - a low level skill.
[1] if your service handles, say, 500qps on average the difference between async and threaded sync might be just 1-2 extra nodes. Does not register on the infra spend.
Cancellation isn't possible in general. For example, if you've kicked off expensive work on another thread or passed a pointer to your future to the kernel via io_uring, you must add a layer of indirection that's quite similar to a cancelation context. You can't just accept that the expensive work you don't care about will keep happening or resume a future that has been dropped when the cqe entry bearing a pointer to it arrives. The cancellation facility provided by io_uring that guarantees that you won't receive such a thing does so by blocking your thread for a while, which is undesirable.
With both synchronous and asynchronous flows, if you want to support cancelation from a high level (e.g. the user can click Cancel in the UI), you need to pass some kind of context from there down to each and every operation that needs to be cancellable. Whether that is done by passing around context objects from the UI down to IO operations, or by ensuring all functions called from the UI down return Task<T> objects, the problem is the same. The context object approach even has the advantage that it also allows you to pass other application-specific things, such as passing progress information up from the bottom of the stack to the UI, or logging ids etc, that the generic Task object won't have.
Also, deadline-style contexts aren't as hard as you make them out to be: you keep track of the remaining time, and pass that as a timeout to every blocking operation, then subtract the actual time taken and pass the remaining time to the next blocking task etc. Or, you can do the exact same thing as the async case: you spawn two threads, one handling the blocking operations, the other waiting for a timeout, and both sharing a cancelation context. Whichever finishes first cancels the other.
The difficulty of doing cancellations for most real operations is anyway going to be much much higher than these small differences. The real difficulty of cancellations lies in undoing already finished parts of atomic operations that you completed. The effort to do that is going to dominate the effort to get pass down the context object.
> Under the asynchronous model, both timeouts and cancellation simply compose. You take a future representing the work you're doing, and spawn a new future that completes after sleeping for some duration, or spawn a new future that waits on a cancel channel. Then you just race these futures. Take whichever completes first and cancel the other.
That only works when what you're trying to do has no side effect. Consider what happens when you need to cancel a write to a file or a stream. Did you write everything? Something? Nothing? What's the state of the file/stream at this point?
Unfortunately, this is intractable: you'll need the underlying system to let you know, which means you will have to wait for it to return. Therefore, if these operations should have a deadline, you'll need to be able to communicate that to the kernel.
> Take cancellation for example: cancelling tasks under the synchronous programming model requires passing a context object through every part of your code that might call down into an IO operation.
Does it? Wouldn’t you just kill the thread in the synchronous model?
Anything that makes cancellation decisions from the outside of a black box is unsound. You are asking the programmer of the thing being cancelled to program in such a way, that the impact on the world outside abides by all invariants no matter where the cancellation happens.
For threads, that an impossible ask. Thread cancellation is a big no-no. Just never do that. Sibling's mention of leaks and deadlocks are just examples of things that can go wrong.
Task cancellation in async has the same issues. Less so, maybe much less so, because every space between two awaits acts as a no-cancellation-zone, but it's still a very suspect thing to do.
You'd interrupt it using something like Java's interrupt model. That doesn't require any context object (the Thread is itself the context) and works correctly with (synchronous) I/O operations.
The big problem with InterruptedException is that it's checked, and developers often don't know what to do with it so tend to swallow it or retry. There isn't necessarily a solid discipline about how to handle interruption in every library. But that is of course an orthogonal problem that you'd have with any sufficiently pervasive cancellation scheme.
You can probably make this work if and only if the thread is shared-nothing. If you share any data structure with another thread then you have the possibility of leaving it in an invalid state when killed.
This also requires any "channel" primitives you use for inter-thread communication to be tolerant of either thread being killed at any instruction boundary, which is hard to design.
Killing threads uncooperatively wrecks all your other concurrency primitives.
Not a theoretical consideration, for example we recently discovered that with GRPC Java if you kill a thread while it is making a request, it will leave the channel object in an unusable state for all other threads.
These are just assumptions on your part - the synchronous/threading model doesn’t have to be that primitive, the Thread itself can take on the semantics of cancellation/timeouts just fine. While there are some ergonomic warts in java’s case, it does show that something like interrupts can work reasonably well (with a sufficiently good error handling system, e.g. exceptions).
With “structured concurrency”/nurseries it can be much more readable than manually checking interruptions, and you can just do something like fire a bunch of requests with a given timeline, and join them at the end of the “block”.
I'd argue that few usages of async are motivated this way. In Rust land, it's efficiency and in JS land it's the browser scripting language legacy.
> cancelling tasks under the synchronous programming model requires passing a context object through every part of your code that might call down into an IO operation.
This is true for some but not all implementations. See eg Erlang or Unix processes (and maybe cancellation in pthreads?).
To be more precise, in JS land, we introduced async not directly because of scripting but because of backwards compatibility - prior to async/Promise, the JS + DOM semantics were specified with a single thread of execution in mind, with complex dependencies in both directions (e.g. some DOM operations can cause sync reflow while handling an event, which is... bad) and run-to-completion.
Promise made it easier to:
- cut monolithic chunks of code-that-needs-to-be-executed-to-completion-before-updating-the-screen into something that didn't cause jank;
- introduce background I/O.
async/await made Promise more readable.
(yes, that's for Promise and async/await on the browser, async callbacks have a different history on Node)
I am not sure I buy the underlying idea behind this piece, that somehow a lot of money/time has been invested into asynchronous IO at the expense of thread performance (creation time, context switch time, scheduler efficiency, etc.).
First, significant work has been done in the kernel in that area simply because any gains there massively impact application performance and energy efficiency, two things the big kernel sponsors deeply care about.
Second, asynchronous IO in the kernel has actually been underinvested for years. Async disk IO did not exist at all for years until AIO came to be. And even that was a half-backed, awful API no one wanted to use except for some database people who needed it badly enough to be willing to put up with it. It's a somewhat recent development that really fast, genuinely async IO has taken center stage through io_uring and the likes of AF_XDP.
Make os thread runs more efficient is like `faking async IOs (disk/network/whatever goes out from the computer shell) into the sync operations in a more efficient way`. But why would you do it at first place if the program can handle async operations at first place? Just let userland program do their business would be a better decision though.
Somewhat controversial take: the current threads implementation is usually already performant enough for most use cases. The actual reason why we don't use them to handle more than a few thousand concurrent operations is that, at least in Linux, threads are scheduled and treated very similarly to processes. E.g. if a single process with 3000 threads gets bottlenecked on some syscall, etc, your system load average will become 3000, and it will essentially lead to no other processes being able to run well on the same machine.
Another issue with threads performance is that they are visible to most system tools like `ps`, and thus having too mamy threads starts to affect operations _outside_ the kernel, e.g. many monitoring tools, etc.
So that's the main reason why user-space scheduling became so popular: it hides the "threads" from the system, allowing for processes to be scheduled more fairly (preventing stuff like reaching LA 3000 when writing to 3000 parallel connections), and not affecting performance of the system infrastructure around the kernel.
BTW the threads stacks, as well as everything else in Linux are allocated lazily, so if you only use like 4Kb of stack in the thread it wouldn't lead to RSS of full 8M. It will contribute to VMEM, but not RSS
> your system load average will become 3000, and it will essentially lead to no other processes being able to run well on the same machine.
That doesn't sound right. I mean, there are many schedulers available, but I was under the impression that most have a separate blocked queue. (Or more specifically, anything blocked will not be in the runnable pool - those will be dequeued) I.e. anything waiting on a syscall will be mostly ignited.
(Please correct me if I'm misunderstanding the usual behaviour here)
> More specifically, what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient, such that one wouldn't need asynchronous IO in the first place?
This is still living in an antiquated world where IO was infrequent and contained enough that one blocking call per thread still made you reasonable forward progress. When you’re making three separate calls and correlating the data between them having the entire thread blocked for each call is still problematic.
Linux can handle far more threads than Windows and it still employs io_uring. Why do you suppose that is?
One little yellow box about it is not enough to defend the thesis of this article.
Linux and Windows are limited by the commit limit when it comes to threads. Linux has a default thread stack size of 8Mb vs. NT of 1Mb, in theory meaning Linux will run out of allocation space much quicker.
But in the end, both are limited by available memory.
> Now imagine a parallel universe where instead of focusing on making asynchronous IO work
Funny choice of words. In the JVM world, Ron Pressler's first foray into fibers -quasar- was named "parallel universe". It worked with a java agent manipulating bytecode. Then Ron went to Oracle and now we have Loom, aka a virtual thread unmounted at each async IO request.
Java's Loom is not even mentioned in the article. I wonder for a cofounder: does the "parallel universe" appear in a other foundational paper, calling for a lightweight thread abstraction?
I don't quite agree with this piece, as it is comparing apples and oranges.
What you want is patterns for having safety, efficiency and maintainability for concurrent and parallelized processing.
One early pattern for doing that was codified as POSIX threads - continue the blocking processing patterns of POSIX so that you can have multiple parallelizable streams of execution with primitives to protect against simultaneous use of shared resources and data.
IO_URING is not such a pattern. It is a kernel API. You can try to use it directly, but you can also use it as one component in a userland thread systems, in actor systems, in structured concurrency systems, etc.
So the author is seemingly comparing the shipped pattern (threads) vs direct manipulation, and complaining that the direct manipulation isn't as safe or maintainable. It wasn't meant to be.
To put a different spin on what others are saying, asynchronous IO is a different programming model for concurrency that's actually more ergonomic and easier to get right for an average developer (which includes great developers on their not-highly-focused days).
Dealing with raciness, deadlocks and starvation is simply hard, especially when you are focused on solving a different but also hard business problem.
That's also why RDBMSes had and continue to have such a success: they hide this complexity behind a few common patterns and a simple language.
Now, I do agree that languages that suffer from the "color of your functions" problems didn't get it right (Python, for instance). But ultimately, this is an easier mental model, and it's been present since the dawn of purely functional languages (nothing stops a Lisp implementation from doing async IO, and it might only be non-obvious how to do "cancellation" while "gather" is natural too)
Async/await is a language semantics thing. It's not really relevant whether there's a "real" OS thread under the hood, some language level green thread system, or just the current process blocking on something - the syntax exists because sometimes you don't want to block on things that take a long time semantically - I.e. you want the next line of code to run immediately.
You could absolutely write a language where the blocking on long running tasks was implicit and instead there was a keyword for when you don't want to block, but the programmer doesn't really need to care about the underlying threading system.
That's something like what Go does. Goroutines are "green threads" - they can be preempted. There's a CPU scheduler in user space. Go tries to provide "async" performance, and goroutines have minimal state. This seems to work well for the web server case.
Pure "Async" means your application is now in the CPU dispatching business. This works well only if
your application is totally I/O bound and has no substantial compute sections. Outside of that use case, it may be a huge mismatch to the problem. Worst case tends to be programs where almost all the time, something is fast, but sometimes it takes a lot of compute. Then all those quick async tasks get stuck behind the compute-bound operation. Web browsers struggle with that class of problems.
Async I/O tends to be over-used today because Javascript works that way. Many programmers came up from Javascript land. That's the only way they can conceive concurrency. It's a simple, clean model - no need for locks.
Threading is hard. Especially in languages that don't provide much help with locking. Even then, you have lock order problems, deadlocks, starvation, futex congestion...
Advanced JS is a pain for that reason. Any interaction with a "slow" API, even doing cryptographic operations (super common inside libraries), can introduce points at which literally anything can change and in particular a point where new UI events can be triggered. So it's like concurrency but without any of the tools to manage it: you call a function, and by the time it returns arbitrary unrelated stuff may have executed.
I've encountered quite a few programmers over the years who think the absence of tools like locking or concurrent data structures is a feature, that they don't need them because async is simpler. Wrong. They're not unneeded, they're just missing, like many other basic APIs you'd expect that are missing inside browsers.
In standard GUI programming you're in control of the event loop and can decide whether to block it or not. This is a powerful tool for correctness. If you need to do a slow IO in response to a button being clicked you can just do it. The user may see the button freeze in the depressed state for a moment if their filesystem is being slow, for example, but they won't suffer data corruption or correctness issues, just a freeze. If you don't want the UI to freeze you can kick off a separate thread and then use mutexes or actor messaging to implement coordination, doing the extra work that surfaces the possible interactions and makes you work out what should happen.
In the browser environment there's none of that. You have to find other ways to disable the event loop, like by disabling all the UI the user could interact with once an interation starts, or - more commonly - just ignore race conditions and let the app break if the user does something unexpected. It's partly for this reason that web apps always seem so fragile.
And don't get me started on the situation w.r.t. database concurrency ... how many developers really understand DB locking and tx isolation levels?
foo.a = a;
foo.b = await b(); // while waiting for b's completion, something happens that changes the value of `a`, making `foo` invalid
(or more complex variants of this)
In a multi-threaded model, this would be classified as a race condition on `a`. That's why Rust has RefCell for r/w data sharing in single-threaded code.
I don't remember the exact details, but I was hit by this many times when refactoring e.g. db access or file access in the (JS) code of Firefox. This can happen without async/await, without Promise and even without an event loop, you just need callbacks. But of course, having an event loop, Promise and async/await give you way more opportunities to hit such an issue.
Having race condition in js is more about having questionable programming practice though.
Instead of write result of operations into separate variables and aggregate them later. You write them into the same variable with unspecified order and prey they will work correctly. There won't be memory corruption or something. But the results you got won't be correct either.
This type of problems is probably what rust try to address. (Rust will probably tell you to fxxk off because the write permission shouldn't be grant by two place at same time) But unfortunately there isn't rust for js. So only thing you can do is take care of it yourself.
Async/await and threading+blocking are ultimately duals of each other. You can express the same semantics with one model or the other, regardless of the underlying implementation of either. You could in fact implement multi-threading and blocking on top a task-based API if you wanted to - e.g. you could implement a Java style Threads API in JS if you really wanted to (of course, code would still run single threaded, like in old times with single-CPU systems).
To me the article reads as if the programming language author wants to push a difficult problem out of his language without deeper analysis. As if it would be easier if it was somebody else's problem.
Asynchronous IO is just simply how the world works. Instead, the idea that changes happen only during CPU computation is the mistake. Your disk drive/network card exists in parallel to your CPU and can process stuff concurrently. Your CPU very likely has a DMA engine that works in parallel without consuming a hardware thread.
The "world" these days works by ringbuffers and interrupts, neither of which looks like async/await. Async/await is a chosen abstraction over what hardware/kernel interfaces really look like.
Well, async/await in languages is just promises/futures made to look like regular flow control via state machines. And promises are controlled via callbacks, which are called from some event.
It's essentially the same thing as when you get an interrupt from your microcontroller's DMA engine that the ring buffer needs to be refilled, just with 2 layers of abstractions on top to make it look nicer.
> Asynchronous IO is just simply how the world works.
How the world works looks more like message passing. What abstraction you construct on top of that is a choice. Async/await is not necessarily the global maxima.
Many synchronous I/O operations under the hood are just async I/O + blocking waits, at least that's the case with Windows. Why? Because all I/O is inherently async. Even polling I/O requires timed waits which also makes it async.
That said, I like async programming model in general, not just for I/O. It makes modeling your software as separetely flowing operations that need to be synchronized occasionally quite easy. Some tasks need to run in parallel? Then, you just wait for them later.
I also like the channel concept of Golang and D in the same manner, but I heard it brought up some problems that async/await model didn't have. Can't remember what it was now. Maybe they are more susceptible to race conditions? Not sure.
> "Not only would this offer an easier mental model for developers..."
Translation: "I find async i/o confusing and all developers are like me".
This argument has been going on for over 20 years at this point. There are some people who think having pools of threads polling is a natural way of thinking about IO. They keep waiting for the day this becomes an efficient way to do IO.
I would concur, the author also mentions that file IO isn't async; this makes me conclude that the author hasn't grasped how much the Linux kernel has moved on since in the mid 2000s.
I suspect that the author has an incomplete view of the current kernel / userland interface as well as the inner workings of how a kernel actually "does what it does".
The Linux kernel isn't "primarily" sync, the kernel itself has been almost entirely async except for some small parts, like the bottom half of interrupt handlers which can't be pre-empted. Even in the early days of Linux the scheduler was pre-emptive and used "wait channels" create a synchronous interface for the asynchronous kernel. In the early days of Linux, if a process was in kernel mode then it couldn't be pre-empted until it returned to user mode, but this is no longer the case, and with PREEMPT_RT this is even the case for RT processes /kernel tasks.
Now, the POSIX API is largely synchronous, but this is mainly because of the history of UNIX (and Multics before it).
That particular paragraph can be phrased better so I'll adjust that. What I meant to say is that you can't handle it asynchronously like you can with sockets, i.e. polling it for readiness. This is because for file IO, reads and writes are always reported as being available, making epoll/kqueue/etc effectively useless.
With io_uring, you don't poll a file descriptor for readiness and then attempt a non-blocking operation and hope it works. You submit a request and later get a response.
To be fair to the author, io_uring has not made its way very far into userspace libraries yet. Even if the language you use has async primitives, and it’s on paper a great fit for io_uring, there’s a slim to zero chance the underlying libraries you’re using are actually using it yet. (Unless you’ve gone out of your way to target io_uring and write your own code for it.)
Yeah, it's a huge challenge. A big part of the overhead of read(2) is that historically each read required a buffer allocated to it before you start the slow request, and the way languages implemented non-blocking I/O followed that API. io_uring avoids the problem with buffer pools, but that completely changes the I/O API that most languages have settled on.
io_uring can poll filesystem descriptors for readiness; epoll, select, etc come from the networking world, and aio, etc come from the disk world, and as a result both have their various blinkers / limitations.
And I guess I could have phrased it better when I mentioned newer Linux developments, as io_uring does a far more comprehensive job of handling not just filesystem descriptors and network descriptors, but also other types of character devices.
There are programs where async IO is great, but in my experience it stops being useful as your code “does more stuff”.
The few large scale async systems I’ve worked with end up with functions taking too long, so you use ability to spin off functions into threadpools, then async wait for their return, at which point you often end up with the worst of both threads and async.
> File IO is perhaps the best example of this (at least on Linux). To handle such cases, languages must provide some sort of alternative strategy such as performing the work in a dedicated pool of OS threads.
AIO has existed for a long time. A lot longer than io_uring.
I think the thing that the author misses here is that the majority of IO that happens is actually interrupt driven in the first place, so async io is always going to be the more efficient approach.
The author also misses that scheduling threads efficiently from a kernel context is really hard. Async io also confers a benefit in terms of “data scheduling.” This is more relevant for workloads like memcached.
> Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads such that one can easily use hundreds of thousands of OS threads without negatively impacting performance
Isn't that why async I/O was created in the first place?
> Just use 100 000 threads and let the OS handle it.
How does the OS handle it? How does the OS know whether to give it CPU time or not?
I was expecting something from the OP (like a new networking or multi-threading primitive) but I have a feeling he lacks an understanding of how networking and async I/O works.
FTA: “Need to call a C function that may block the calling thread? Just run it on a separate thread, instead of having to rely on some sort of mechanism provided by the IO runtime/language to deal with blocking C function calls.”
And then? How do you know when your call completed without “some sort of mechanism provided by the IO runtime/language”? Yes, you periodically ask the OS whether that thread completed, but that doesn’t come for free and is far from elegant.
There are solutions. The cheapest, resource-wise, are I/O completion callbacks. That’s what ”System” had on the original Mac in 1984, and there likely were even smaller systems before that had them.
Easier for programmers would be something like what we now have with async/await.
It might not be the best option, but AFAICT, this article doesn’t propose a better one. Yes, firing off threads is easy, but getting the parts together the moment they’re all available isn’t.
I think you missed the point: io_uring is the newer API. Microsoft saw the new async IO API developed for Linux, and made a version for Windows. If what Windows has had all along is so great, why did they bother cloning io_uring when io_uring came along? It definitely isn't because there's any significant body of software relying on io_uring yet.
My theory: people like to extol the virtues of NT's IOCP as "checking the box" for being async, but seldom bother to assess whether the decades-old async API actually offers good performance in comparison to more recent alternatives. But some parts of Microsoft understand quite well that they have pervasive issues with IO performance: replacing WSL1 with WSL2, introducing DirectStorage and a partial copy of io_uring, the Dev Drive feature in Windows 11—all changes that are driven by the understanding that the status quo isn't all that great.
> I think you missed the point: io_uring is the newer API.
GP was pointing out that io_uring came after RIO; RIO has similar capabilities to io_uring for the network space. GP did not "miss the point".
> Microsoft understand quite well that they have pervasive issues with IO performance
This has nothing to do with IOCP but rather the file system filters. This is why removal of file system filters from "DevDrive" (a ReFS volume) improves performance vs a ReFS volume with file system filters in place.
> If what Windows has had all along is so great, why did they bother cloning io_uring when io_uring came along?
IoRing eliminates the need to copy buffers between kernel and user space [0][1].
With regards to performance, this is a bench between a traditional multithreaded UDP server and an RIO IOCP UDP server. The gains are significant. [2] And further comparisons... [3]
> seldom bother to assess whether the decades-old async API actually offers good performance in comparison to more recent alternatives
There are no complete alternatives [r]. Nor is Linux fully asynchronous throughout the kernel.
> introducing DirectStorage
Not really relevant to the context. DirectStorage is a method to deliver data from nVME storage with minimal CPU overhead and/or compressed [4]. Such a system would improve performance on any OS and indeed, we can see that on PS OS (PS5, FreeBSD-based (probably)). Again, limited in scope.
I'm honestly not sure what you're attempting to argue. IoRing brings improvements, yes, but isn't a replacement of IOCP. We should see improvements in any mainstream kernel as a good thing. It's progress. And hopefully the Linux kernel gets to full async I/O at some point in the future.
Yes, it is nearly a 1:1 copy of io_uring, but unlike io_uring which applies to a single function, all I/O in the NT kernel is asynchronous. IOCP acts on file, network, mail slot, pipes, etc.
IoRing/io_uring is for files only. RegisteredIO is network only.
While Windows userland itself may be a 'mess' and archaic in it's own way (among the other anti-consumer bits), the NT kernel is technically quite advanced, not only for it's time, but at the present time.
David Cutler's team got the kernel architecture correct the first time. Linux still has a ways to catch up in certain aspects.
Correct me if I'm wrong, but Microsoft's DirectStorage seems to me something like what the author is writing about. It lets you do eg massively parallel NVME file io ops from the GPU itself of lots of small files. This avoids the delay of the path through the CPU, any extra threads/saturation of the CPU, and even lets you do eg decompression of game assets on the GPU itself thereby saving even more CPU. This demo benchmark shows DEFLATE going from 1 GB/s on CPU to 7 GBs/ on GPU https://github.com/microsoft/DirectStorage/tree/main/Samples...
I see, I had this notion that DirectStorage worked by somehow allowing the GPU cores to host multiple parallel threads each doing full requests on their own, but on further research I was mistaken - requests are still CPU-submitted
I recently sought to check if DirectStorage used PCIe P2P to transfer data from the SSD to the GPU memory. It definitely could use it but I didn't see any evidence that it did.
There are fundamental reasons for OS threads being slow, mostly to do with processor design. Changing the silicon would be hundreds of times more expensive than solving the problem in software in user mode.
This is a billion-dollar solution to a hundred-billion dollar problem.
There is a history of trying to support context switching in CPU hardware. That was a hot idea decades ago. Intel had call gates and some other stuff. Some of the RISC machines had hardware to spill all the registers to memory in background. Some early machines, from the days where CPUs were slower than memory, just had a context pointer in hardware. Change that and you're running a different thread.
None of this helped all that much. Vanilla hardware won out.
This is to some extent a consequence of C winning. C likes a single flat address space.
One interesting angle is Mill (note, vaporware). Their "portal" calls between security boundaries are essentially the same as their EBB function calls.
C allows casting to (void *). That implies all pointers are in the same address space. Segmented architectures, where that's not true, have mostly died out. Burroughs used to build segmented address space machines where an address was a path, something like programname/scopename/variablename[subscript]. IBM's System 38 had an "object" architecture. Both worked fine, but were not very C-compatible.
My interpretation of what the author wants, is essentially lightweight threads in the kernel, standardised a lá POSIX , that every proglang could use as a primitive.
That'd be sweet if this were a well understood problem. Unfortunately, we're still finding the sweet spot between I/O Cs CPU bound tasks, "everything is a file" clashing with async network APIs and mostly sync file APIs, and sending that research to the kernel would mean having improvements widely distributed in 5 years or more, and would set back the industry decades, if not centuries. We learned this much already with the history of TCP and the decision of keeping QUIC in userspace.
Well, the runtime (instead of the OS) knows better and can e.g. allocate part of the call stack on the heap itself, like how Java’s virtual threads do.
What if the author’s proposed solution is the billion dollar mistake?
IMO the best programming paradigms are when the abstractions are close to the hardware.
Instead of pretending to have unlimited cores, what if as part of the runtime of we are given the exactly one thread per core. As the programmer we are responsible for utilizing all the cores and passing data around.
It is then up to the operating system to switch entire sets of cores over different processes.
This removes the footgun of a process overloading a computer with too many threads. Programmers need to consider how to best distribute work over a finite number of cores.
I think async vs threads is about completely different trade-off. Nowadays all operating systems so preemptive scheduling, but green threads (and all async by the extent) use cooperative scheduling.
I believe most of discussion here is actually about pros and cons of those scheduling models.
one exception is I think cancellation model, but I'm only aware about rust that does it that way, all other runtimes will happily run your green thread until it finishes or cancel by itself similarly that you do with synchronous code.
Computers are inherently asynchronous, we just plaster over that with synchronous interfaces.
We put a lot of effort into maintaining these synchronous facades - from superscalar CPUs translating assembly instructions into “actual” instructions and speculatively executing them in parallel to prevent stalls, to the kernel with preemptive scheduling, threads and their IO interfaces, right up to user-space and the APIs they provide on top of all this.
Surely there has to be a better way? It seems ridiculous.
I think compare programing pattern(threads) to kernel async apis is a questionable comparation.
The point of kernel async apis is not about letting programmers write system calls directly. It's about expose the actual async operations under the hook (it could be disk, be network, be anything outside of the computer case).
Those actions are never mean to be interleaved with cpu computation, because they are usually with ms level delay (which could be millions of cpu ticks). The kernel fakes these into sync calls by pause everything. But it isn't always the best idea to do these.
Let userland program decide what they want to do with the delay will be a way better idea. Even they eventually just invent blocking io calls again. They can still decide what operations are more relevant to itself instead of let the kernel guessing it.
> the cost to start threads is lower, context switches are cheaper, etc.
Physics would have a word with this one. We are already pushing limits of what is possible with latency between cores vs overall system performance. There isn't an order of magnitude improvement hiding in there anywhere without some FTL communication breakthrough. In theory, yes we could sweep this problem under the rug of magical, almost-free threads. But these don't really exist.
I think the best case for performance is to accumulate mini batches of pending IO somewhere and then handle them all at once. IO models based upon ring buffers are probably getting close to the theoretical ideal when considering how our CPUs work internally (cache coherency, pipelining, etc).
Is this actually proven though? I've seen other people argue that threads are already as fast as they can be, yet nobody is able to actually substantiate that claim.
It must happen at the language level; When it comes to execution context knowledge: what context to compile out (stackless), or what context to serialize to the heap (stackful). Programming language will always know much more about the program than the OS.
If I'm not mistaken in Erlang the programmer will provide the exact context to serialize : the actor.
This abstraction that a heap is somehow different than stack is hurting understanding; they're just regions of memory.
Go stacks are allocated areas of memory like anything else. When a green thread is suspended, it's stack is not "serialized to the heap". Switching green threads is mostly a question of setting the stack pointer to point to the new green thread's stack.
What makes threads "green" is that they are not supposed to call blocking syscalls (they can queue work in a separate blocking threadpool and call the scheduler to suspend themselves). Go has a signal-based mechanism for non-cooperative scheduling of green threads, so green threads are not even required to cooperatively yield to the scheduler.
The author appears to contradict the very issue they argue, by presenting languages such as Go, Erlang or their own toy language. These languages hide the async / await constructs that are present in languages like Rust, Swift or Typescript. The former languages and runtimes have no function colouring problems, when working within their own SDKs. There are trade offs, and these “async” languages tend to be a bit more awkward when interacting with OS frameworks.
"Not every IO operation can be performed asynchronously though. File IO is perhaps the best example of this (at least on Linux). To handle such cases, languages must provide some sort of alternative strategy such as performing the work in a dedicated pool of OS threads."
File IO can absolutely be async on linux[1] and this has been supported since version 2.5 or something provided that the device supports it (which they all have since about 2000). io_uring was the syscall interface that was introduced in 5.1 to improve async file i/o performance so it is relevant in that the previous way to do it wasn't great.
In-kernel file or block I/O is async if the driver/file implemented the async operations. If not, it's blocking.
io_uring is a relatively new userspace syscall API to minimize the number of syscalls used for I/O, and to provide a common mechanism for doing many kinds of operations. Previously, large numbers of syscalls could become a bottleneck for busy applications, and some kinds of operations did not have a usable non-blocking interface.
Async IO is hilarious in a universe where Hewitt's actors or even Hoare's CSP exist.
They are a subpar technique that works well with languages with semantics from the 1970s that do not have communication primitives, in the age of multicore and the Internet.
The saddest thing is the most hyped language of the decade went all in with this miserable idea, and turned me completely off the ecosystem.
I think CSP and Actors are overkill for nearly all programs. A simpler async system can do the job and introduce much less cognitive overhead. If really necessary there are still special libraries in most ecosystems like RX, Akka, Zio, Hopac, lwt......
The 1990s called. They want their threads vs events debates back.
Todays processors are fast enough to serve many useful workloads with a single core. The benefit of the async abstraction outweighs the performance benefit in the majority of the cases.
And debugging multithreaded code is way harder than async code, mainly if its the kind of program that needs stepping into.
Async IO is not only about creating sockets and spawning threads. Idea of async IO is that the world is not controlled by your CPU. There are network, storage, sound devices that might and will take time to produce the result and the CPU has to wait for it.
I feel like there is a big misunderstanding about what async IO is and what problem it solves.
The post seems to be assuming that multi threaded code is easy to build and maintain. From my experience it is horrible, every new thread means going from n bugs to nn bugs. As a programmer I prefer* async constructs in languages, and do not want to spin up and manage threads and all the state synchronisation that involves.
It's not clear that context switches can be made sufficiently cheap on fast CPUs without disabling mitigations for side-channel attacks. So the idea of making OS threads comparably performant to goroutines, rust async, or any implementation of cooperative multithreading seems impractical.
In several projects I switched from async code to threads and mpsc queues. Timers, data streams, external communications all run in their threads. They pass their messages to the main thread's queue. The entire thing suddenly became much easier to reason about and read.
No, it isn’t, the author is just confused, as it usually is.
Worked great in C# since its introduction for task interleaving, composition, cancellation (worse languages call it structured concurrency) and switching to privileged context (UI thread), and will work even better in .NET 10.
I thought this article was going to talk about async IO's potential to cause insidious bugs that you don't notice until they're crashing jet planes. That's definitely a topic worth of a blog post.
With async IO you can do stuff concurrently without doing it in parallel which is a desirable thing for many workloads. With only threads you‘d need sync primitives all over the place.
This is extremely myopic. there is not a 1:1 correspondence between using asynchronous io and using one thread per file. Asynchronous io lets you dodge thread safety mechanisms like semaphores.
Not all of us are trying to write a webapp or whatever, some of us just need to load a lot of data from several descriptors without serializing all the blocking operations.
>Not every IO operation can be performed asynchronously though. File IO is perhaps the best example of this (at least on Linux). To handle such cases, languages must provide some sort of alternative strategy such as performing the work in a dedicated pool of OS threads.
Uhhh this is just wrong file io can definitely be done asynchronously, on Linux, and without language support.
Synchronous IO has always been more efficient. Anyone that thought otherwise doesn't understand how complicated context switches are in CPUs. The benefit of async io has always been handling tons of idle connections.
Anyone that thought otherwise doesn't understand how complicated context switches are in CPUs
That’s not true. I understand how context switches work down to tss records, but can’t immediately see why nonblock should be less efficient. Is it due to for-rw vs for-poll-rw? Doesn’t kqueue/iocp ought to solve that?
You can perform all the I/O you want with 0 context switches today in Linux. Why would performing more context switches (e.g. to call read(2)) speed things up?
You cannot reason synchronously about things that are asynchronous. Unless you want to just block on IO. Which is clearly not great as a user experience or very efficient.
> Languages such as Go and Erlang bake support for asynchronous IO directly into the language, while others such as Rust rely on third-party libraries such as Tokio.
This is so wrong. Go and Erlang have message passing, not async. Message passing is its own thing; it should not be mixed with threading or async.
Go offers channels in the standard library, which you can use to do message passing, certainly (so does Rust). But I/O through the stdlib is absolutely asynchronous. I/O routines that would block, like sending and receiving from the network, are all magic async switch points where your goroutine can be replaced on the current OS thread with another. When epoll (or whatever) signals readiness on the blocked descriptor you get switched back on. As described in the previous post, the asynchronous I/O is hidden from you as an implementation detail of the language but it's absolutely there.
I actually can't imagine how that would ever be accomplished at the OS level. The fact that each thread needs its own stack is an inherent limiter for efficiency, as switching stacks leads to cache misses. Asynchronous I/O has an edge because it only stores exactly as much state as it needs for its continuation, and multiple tasks can have their state in the same CPU cache line. The OS doesn't know nearly enough about your program to optimize the stack contents to only contain the state you need for the remainder of the thread.
But at the programming language level the compiler does have insight into the dependencies of your continuation, so it can build a closure that has only what it needs to have. You still have asynchronous I/O at the core but the language creates an abstraction that behaves like a synchronous threaded model, as seen in C#, Kotlin, etc. This doesn't come without challenges. For example, in Kotlin the debugger is unable to show contents of variables that are not needed further down in the code because they have already been removed from the underlying closure. But I'm sure they are solvable.