"Because synchronization is almost entirely avoided, this strategy can be very fast. However, it’s not a silver bullet. Unless the workload is entirely uniform, some processors will become idle while other processors are under load, resulting in resource underutilization. This happens because tasks are pinned to a specific processor. When a bunch of tasks on a single processor are scheduled in a batch, that single processor is responsible for working through the spike even if other procesors are idle."
The trade off of this is high performance over uniform utilization, which is a better goal than loading cores evenly. And all you have to do to make use of it is a bit of thought about your load distribution, but you also have plenty of thought budget here by not having to think about synchronization. Stealing work from other queues is not going to be faster, it's going to be wasted on synchronization and non-locality overhead and would require you to do synchronization everywhere, not a good direction to pursue.
Other parts of the Rust ecosystem have had great luck using work-stealing queues. There's a talk about how Rayon did this[1] and a good article about using it in Stylo[2], a new CSS engine written by Mozilla.
The thesis of Practical Parallel Rendering was that any useful task distribution strategy requires a work stealing mechanism because you are leaving serious amounts of responsiveness on the table if you don't.
With an infinite queue, the same number of tasks per second happen either way, but the delay until the last task you care about finishes can be pretty substantial.
as far as i understood (though i have to admit i didn't understand much) you can use different schedulers.
i haven't had a closer look at rust in general, async and tokio, but i suspect you can use the different schedulers in the same codebase. can someone comment on this?
if that was the case one might use a different runtime for more uniform workloads to maximize throughput, while using tokio for general purpose async work.
After all this time, I'm amazed at how many people still think that cooperative multitasking is always the best strategy.
Cooperative multitasking works great as long as everyone is cooperating. Guess what? Cooperation is a voluntary act. If people are ignorant of their impact or min-maxing between five different 'top priority issues', they aren't going to stay on top of it at all times and in all things.
The cost of freedom is eternal vigilance, and the quarterly roadmap doesn't list it. Which is why we use preemption most of the time. It reduces the blast radius at the cost of some efficiency.
All this is just the preemptive versus cooperative argument wearing new hats.
Preemption and cooperation are unrelated to any of it, you can have affinity or lack thereof with either of them. This is about misunderstanding of trade offs, as if work stealing can somehow fix slow broken shared memory multithreading model. It can't, the only way to fix it is to get away from it.
Amortized cost algorithms are everywhere, including (or perhaps especially) in concurrency management code. Is there a memory allocator in any threaded programming language today that doesn't use a variant of thread local allocators? Getting two or three orders of magnitude out of a solution is not something to sniff at.
These queues each execute tasks sequentially. The library provides for multiple queues to run in parallel. All of these scheduling problems and patterns begin to look fundamentally similar if you stop getting hung up on the implementation details. Separate queues have worse 'behavior' than shared queues, which have worse behavior than sequential operation. Except for the extremely real and very huge problem of utilization. Avoiding utilization problems by forcing that mental work onto the callers is exactly the argument of cooperative multitasking; you could make it work, if you were as smart as we are.
Transferring a block of tasks at one go reduces the use of concurrency primitives. But now your worst-case behavior is worse. Solution? Provide way to avoid worst-case behavior. Even if it's more expensive, it only runs infrequently, so the cost is still amortized.
Usually these libraries strongly recommend that your jobs be working with independent data. So I don't buy the shared memory argument. There are in fact tasking libraries in other languages, like Node, that push the tasks off into a separate process, and yet still have the same class of concerns.
You are arguing that if they put more thought into the usage that the library wouldn't need to do the work.
That's not how software works. Blaming others for not thinking hard enough is what fragile smart people do to prop up their egos, instead of getting to the hard work providing real solutions. Essentially, all of these problems were identified in the 70's, and yet there's still plenty of R&D dollars in this space because this shit is hard. Being flip about it is not helpful.
It sounds like you might be arguing the relative merits of preemptive multitasking in some generic case rather than the specific domain tokio operates in.
Tokio is a lot about IO. The resources the scheduler is waiting for will almost exclusively be network and file operations.
There is a risk that a task does some long running CPU work on the back of a future or uses sync network/file operations. Maybe a preemptive scheduler could help then, but I'm not convinced by your argument that tokio needs to optimize for this.
"Because synchronization is almost entirely avoided, this strategy can be very fast. However, it’s not a silver bullet. Unless the workload is entirely uniform, some processors will become idle while other processors are under load, resulting in resource underutilization. This happens because tasks are pinned to a specific processor. When a bunch of tasks on a single processor are scheduled in a batch, that single processor is responsible for working through the spike even if other procesors are idle."
The trade off of this is high performance over uniform utilization, which is a better goal than loading cores evenly. And all you have to do to make use of it is a bit of thought about your load distribution, but you also have plenty of thought budget here by not having to think about synchronization. Stealing work from other queues is not going to be faster, it's going to be wasted on synchronization and non-locality overhead and would require you to do synchronization everywhere, not a good direction to pursue.