Hacker News new | past | comments | ask | show | jobs | submit login
Using io_uring for network I/O (redhat.com)
347 points by mfiguiere on April 12, 2023 | hide | past | favorite | 138 comments



I’m no kernel hacker, but it seems like io_uring is almost being undersold despite the hype. “Async IO” seems like an understatement. It’s (to my meager knowledge) a general purpose syscall batching interface. There could even be more interesting use-cases (like IPC?) in the future.

The genius (and in hindsight almost obvious) idea behind io_uring seems too good to ignore, which makes me hopeful about cross-platform adoption in the long run. That would enable huge simplifications in language runtimes and low-level frameworks, lowering the barrier of entry to high performance IO across the board.


Weirdly the low barrier batched sys calls & ebpf user-programmable-kernel sort of feel like a comeback-ish story for un-monolothic kernels/OSes.

After a decade+ of it feeling like Linus/Tannenbaum debate was a de jure settled debate. The debate being anchored to either/or certainly didn't help, confined the dimensionality of what we might consider, eliminated vaster exploration of the possible. But the idea that it's all in the kernel seems improbably from the modern outlook (with some of the GPL IP copy-left awesomeness as notable sacrifice, which has yet to really manifest, but IMO hugely looms).

Again and again I keep thinking we mistake popularity & rapid adoption for success. There's so many brilliant wonderful possibilities. People & attitudes coallesce to the negative enormously fastyl, with much more emotion than belief spreads. Finding ways to give long time for tech to slowly grow & quietly build & thrive is a huge huge challenge. Uring is definitely a nice example of a small conceptual extension of what is that can snowball & grow & metastatize, until it far outshines everything we did before.

Now if only Deno, Node & others would start getting more of this shipped! That said, slow & right is what I'm defending here, soooo, OK.


This really has nothing to do with the mono vs. micro kernel argument. Also micro kernels were very popular in the research community at the time that Linux was in its infancy, mach and minix are both examples of this. Linux did win on merit and the future work of micro kernels was forked and past micro kernels are only vestigially micro.

Edit: Not sure why this is being downvoted, micro kernels are the antithesis of what eBPF is trying to accomplish. I say this as a university researcher that has built and worked on many kernels with different tradeoffs. If you disagree please explain in a comment.

Micro kernels increase security and decrease complexity by running a very small resource manager in the processor's highest privilege level. eBPF runs code at the highest privilege level by actually extending the kernel, it actually supplants doing kernel work in userspace. It greatly increases kernel complexity with an new subsystem and decreases security in order to allow extensibility in the face of a hard to change interface (syscalls in a mature operating systems are difficult to change).


One of the bigger shortcomings of microkernels is the IPC delay. io_uring amortizes those delays, making it less painful to implement a chatty protocol, like the conversation between ukernel processes.

EBNF is sending code instead of data, which is another common way to solve IPC problems, but with different trade offs. Being harder to prove is an important one.


Presumably you mean eBPF, not EBNF.


I appreciate what you're saying, but I feel like your argument (including downthread) is something like saying "fuel injectors have nothing to do with carburetors" or "vacuum tubes have nothing to do with transistors". It's true, but the user may end up solving similar problems with them.


Heeyyyy much more cleanly said than my attempts. Shit yeah nicely made.


> eBPF runs code at the highest privilege level by actually extending the kernel

Yeah this was my thought when reading the above argument too.

Maybe they're thinking that in future, more and more kernel functionality will be moved to eBPF programs, and eBPF has a restricted execution environment? I can follow both lines of logic.


Too easy a dismissal, doesn't reach to explore. This very much enables a much more interesting future of smaller interoperating & pieces.

For example the DPDK work created a small alternative micro-monolith for high speed networking. If someone needs to make a new fast protocol like quic or something further afield conventionally they might have split between building a similar application-owning hardware view, building a kernel module, or facing extreme performance loss. Io-uring opens up much greater fields of extensibility by removing huge barriers, allowing work to spread out far wider than it could have before.

I agree on the surface this seems dissimilar. But architecturally what is enabled is a far greater range of considerations for how we might interlace responsibilities on systems. Io-uring on the surface is for apps to commuicate with the kernel, but if systems like fuse or uhid or other user land devices also prosper, it will disrupt the monolith's lock.


Isn't DPDK sort of an exception that proves the rule? Almost nothing gets implemented in DPDK precisely because pulling the TCP/IP stack out of the kernel is a huge logistical pain in the butt for development, and far more energy is put into making the kernel's TCP/IP interface more useful for developers.

(I'm not pushing back on you, just exploring the space, so to speak.)

I'm not sure I see how eBPF is moving us away from monolithic OS's; the whole point of eBPF is running more stuff inside the monolithic kernel. :)


Good point about eBPF. :) It is running in kernel space, but it's not code from the kernel, it's user code, which expands the model, that was kind of my thought.

DPDK did seem to be limited to pretty niche/exceptional applications. There's like a really fast Open Vswitch backend (and some other very network-centric projects). Practically a partial-unikernel system... trying to almost never ever use the kernel, just stay in user land with your own network stack forever. Just... In Linux.


It's still running code inside of the kernel. The point of a microkernel is to run as little in ring 0 as possible and have a very thin resource manger that can talk to system servers. The point of eBPF is to move even more stuff dynamically into the kernel to avoid expanding the kernel's static interfaces (syscalls).


Which brings us back to my previous point,

> The debate being anchored to either/or certainly didn't help, confined the dimensionality of what we might consider, eliminated vaster exploration of the possible.

eBPF doesn't seem to fit squarely in either mold. It has some characteristics of monolithic kernels, in that code runs in ring 0. But it's also now it's not the kernel that's running there. It's user code. Which doesn't fit either pattern.

I do think io_uring could potentially spawn more microkernel-y outcomes. I don't know how exactly eBPF comes into the picture but it's another tool. Just to whip up a nonsense example, maybe we use eBPF to build a secure network switch in ring0 that io_uring based processes communicate to each other with. Io_uring is still kernel based, isn't processes directly communicating with each other (not sure how many microkernels could do that), but again my point is less super specifically about microservices & more that this is a rebalancing possibility, where the kernel code might become less the focus of the system, where other non-kernel-code platform substrates have a higher chance of emerging/performing well. There's more possible than monolithic chunks of code allowed. What new dimensions we might find out there excites me.


I think you really need to understand that the term micro kernel refers to a very specific thing. Please go read the wikipedia definition because it does a good job of summarizing it. A microkernel handles only the basic kernel primitives in the highest privilege level and critically provides an rpc like communication mechanism for authenticating and delegating everything else.

eBPF does not do that, it does not enable doing that. It does the inverse. It adds code to the kernel in the exact same way a kernel module adds code to the kernel, but in very specific places to allow extending the kernel at points of policy. It does not run kernel code in userspace. It does not run userspace code in the kernel. It loads code from userspace into the kernel and runs it as kernel code.


No, sorry, I am pretty well versed I think.

On the topic of eBPF, you keep talking about eBPF adding code to the kernel. Yes: that is NOT the microkernel model. But you keep not acknowledging that it's not code shipped by the kernel. If we make the kernel programmable, yes, it means running some code in the kernel. But tons and tons of people are using this to make really really exceptional userland tools that take over kernel-like responsibilities, often running faster than or with more/newer/interesting features the kernel never could do.

Adding kernel user-programmability lets us move stuff out of the kernel. And that keeps being what we see happening.

It does allow other models too! It allows a whole raft of different things to happen. Some of the eBPF uses are to push more responsibility into the kernel, to make more stuff happen in ring0 or to take what would have been userland & make it ring0. It's a very general capability that breaks us out of the low-dimensionality world we've lived in. It makes a more more exploration possible. Some eBPF possibilities will in some ways resemble microkernel-y things. Some will look exactly the opposite.

Right now there's no clear plan to moving most driver code into userland (the microkernel model), but I also would not rule it out & I think there's a number of clearly visible shifts (and my tea leaves tell me increasingly ongoingly) already well under way in that direction.

An example of this is some of the Fuse + eBPF work, extfuse. File Systems in Userland is a recognizably microkernely idea, I hope we can kind of agree: let's have a driver for file-systems that runs in userland. But typically FUSE has a pretty big performance hit that can make it unviable, just too much cost of syscall'ing to the kernel. Well, what did some intrepid blokes do? They wrote a eBPF powered Fuse system that hyper-optimized the IPC layer, ignoring a lot of the normal communication bottlenecks & inventing their own new eBPF powered control/communication system. Which let them both expand speed, but also grow a raft of new features for their userland driver, some of which are utterly unique and new. It allows custom permissions checks, differing caching strategies, and creating io redirection to underlying FSes. All from the userland. So, userland greatly greatly grew, & assumed many roles the kernel either did (& some it couldn't do), by having a more malleable eBPF interface. (But yes, that still required talking to the kernel/having some new code in the kernel.) https://extfuse.github.io/

It's a bit less micro-kernel-y an example, but certainly it's an amazing story watching networking grow via eBPF. Yes the kernel is doing the work, but it's such a different situation that it's not not kernel code: it's user-code, running in ring0. We can focus on that code. But notably, that code while heavily trafficed, is often much smaller than the kernel code it replaces. The simple data-plane eBPF code replaces of dozens of fairly complex kernel modules making a lot of the decisions, figuring out how to prioritize or open tunnels or do any of a dozen other deeds. And that eBPF code is all built by a new Master Control Process, a ring -1, the userland process that wrote & injected it all. This userland code doesn't have privileged access to other processes (so still a micro-kernel-y win!), but it sure as heck has a very powerful position over the kernel, is calling the shots/making the decisions, & programming the (again much smaller) kernel according to it's userland desires. Is there still a kernel? Yes. But it's clear the power dynamic is greatly different in this case. There's so many more isolation barriers & complex who-controls-who power relationships here that all seem so much more interesting than the monolithic kernel world, so many specific assignments of responsibilities, done with so much more isolation than before. Seems like a win. I'd still be more hesitant to push this as an overtly micro-service-y case, where-as extfuse I think is more obviously in-the-mold, but reducing kernel responsibilities and giving them to userland at such huge scale with such critical success should make most Microkernelites cheer & feel vindicated over. I hope, even if it's not exactly 100% the purist they'd hoped for.

We can keep trying to say, but oh! Gotcha! There's stuff happening in the kernel. I think the topic deserves a much broader view than that. We probably are not going to get rid of the kernel (maybe Fuschia's Zircon microkernel, maybe Genode, maybe whomever will some day make a real break-out win though! nothings decided!). Giving processes much faster ways to batch talk to others/the kernel (io_uring) and giving processes the ability to program/modify the kernel & inject some tendrils ther are- I would agree- not inherently micro-kernel-y. But they open up so many possible spaces that before had felt settled. And in quite a number of examples we're seeing core responsibilities move from kernel to userland. The idea that a more IPC centric network-of-processes userland-driver world emerges seems much more possible to me, thanks to io_uring and ebpf.


Micro kernel-ish behavior would be features that allow for drivers/modules in userspace, which Linux does provide (and I do agree that Fuse is an example of this). eBPF is not sort of, or kind of, or remotely related to micro kernels and I'm not being pedantic. It really, really isn't related to microkernels because it is doing the exact opposite of what a microkernel does. Pluggable policy in kernelspace is not the same as running something in userspace that's implementing a system function like a filesystem, a networking stack, etc. which would be analogous to FUSE, DPDK, etc.

eBPF is no different from runtime loadable kernel modules, which has never made Linux a microkernel. Modules allow for runtime extension, but they are still running in the kernel. The point of eBPF is that kernel modules are dangerous and error prone to write, and eBPF allows non-kernel programmers to write reasonably safe modules that can also be ported to different kernels via CO-RE. They are pluggable policy points or probes.

The Fuse developers just figured out the same thing microkernel developers figured out in the late 80s/early 90s. Performance is much better if you move more processing to the kernel.


I don't feel like you read my post at all or are replying to anything I've said. You're repeating the same assertions without moving the argument along or talking details. Rather than recognize some similarity, rather than taking any kind of parallax view, rather than making friends, I feel like you've adopted a combatative and quite limited stance, and I think it's a shame you're not willing to chalk up some wins where a monolithic kernel has loosened up & allowed some new microkernel or microkernel-ish either offloading or sharing of responsibilities.

I provided two pretty amazing examples already of how really amazing new userland capabilities were made by having a configurable kernel. You keep ignoring that eBPF may be code running in the kernel, but it also enables not running a dump-truck load of kernel modules that a user would need, while being more flexible, while letting a non-privileged isolated process either have control (ring -1) as the networking Calico/Cilium example shows off, or even more radically by moving the entire task into userland (ExtFUSE) by using eBPF as an io-port with the kernel to let the external process do more of the tasks than it could have before.

You seem to be very rooted in a very precise & narrow minded view of what's afoot here. And fixated on any kernel code being disqualifying, even though it keeps being examples of less & less kernel code running. I think I've really given you a ton of great evidence already, & tried to help you break free of such a narrow conception. My examples really show how eBPF has helped move a ton of work from kernel into userland. I don't think it should be so hard to communicate this, but I'll say it yet again:

programmable kernels sometimes will be programmed to do less in the kernel. That's what we've seen.

(But they can also be used to add more to the kernel too. To repeat again: eBPF is not inherently MK nor inherently monolithic; it depends on what it's used for!)

You talk about Fuse's downside, again boosterizing microkernels, but again, it's like you haven't read my post: where I mention how eBPF was used to greatly speed up & to add brand new novel capabilities neither kernel nor FUSE could do before enhance & Extend FUSE. That's what I keep trying to emphasize. If you have a fixed all-encompassing monolithic kernel, the boundaries are fixed. By introducing eBPF programmability, there's far more flexibility to calve off tasks, to pull them out of the kernel. eBPF is sometimes just a communication tool, not a processing tool, to get the data for the task out of the kernel, into userland, and that 110% qualifies as a microkernel-y thing.

Again, to your points, I said I'd be less interested in trying to claim the Cilium/Calico model is microkernel-ish, because as you say, the data-plane remains in the kernel (although I think it's not hard to recognize that even though it's a different architectural pattern than MK it's still a revolution in moving huge responsibiliites out of the kernel & into userland, & getting the lions share of MK benefits). But the ExtFUSE model shows that that's not the only thing one can do with eBPF (allow a "policy agent", which still seems like a colossal MK-ish win to me!), it shows how it can also help to move work itself outside the kernel in new novel ways.

Rather than focusing on eBPF being something "in the kernel", I think everyone needs to step back & re-ask themselves what eBPF is. It's a malleability system. It's a way to make the kernel programmable. As I've said again and again, that isn't inherently microkernely. But in a huge number of examples, users do exactly that: take things that the kernel would be doing and they make a small eBPF shim to replace complex kernel code with a small port, that sends the task out to userland, where userland does the work. This is quite obviously microkernely.


You’re arguing about score and the person you are replying to is not talking about points for or against different approaches.

The person you are replying to is being very specific about ebpf not being a microkernel approach. By definition it is the opposite.

You’re looking to debate someone on the pros/cons of dogs and cats but you’re doing it by insisting that a golden retriever is actually a cat. Then when people call you out saying it’s very much not a cat, you lash out about the great benefits of a golden retriever as if they refutes something.


I'm kinda with rektide on this. It is true that eBPF is not itself a microkernel approach. But nobody is asserting that it is. The claim is that eBPF is a building block that allows emulating some microkernel characteristics on a monolithic kernel.

The final result is not what anyone would describe as a microkernel. But it does allow implementing things on top of a different API boundary. And microkernel vs monolithic kernel is all about where the API boundary lies. (And I don't mean the ring 0 boundary! If ring 0 is the only relevant definitional characteristic of "microkernel-ish" for you, then I agree to disagree.)

Said another way: if you consider driver-on-API-implemented-with-eBPF-on-monolithic-kernel, then looking from the bottom it will look nothing like a microkernel. Looking from the top, it will.

(rektide is not insisting that a golden retriever is actually a cat, they're saying that a golden retriever can keep the mice under control similar to how a cat might.)

But at this point, I doubt it really matters who agrees with what perspective!


As I said I have done research work and a written a PhD thesis on operating systems, and I've worked on several production kernels including Linux and I find it hard to fathom this hill you all are trying to die on. You can't just redefine important concepts however you like as the arguer is trying to do. Names mean things. If you want to pretend that x is y and blue is red, then we simply can't discuss the topic because you're not able to agree on the basic particles the community uses to have discussions. It's just contrarianism.

Microkernels are defined by privilege levels. It's not debatable. You wouldn't say that something is true microkernel because it's modular. We could have more interesting discussions about the actual topic if you and the original poster could accept common definitions like any sane technologist.


I agree that eBPF running on a monolithic kernel is not a microkernel. I agree that microkernels are defined by privilege levels. I would not say that something is true microkernel simply because it is modular. I would agree that I am not a fully sane technologist.

I disagree that it is impossible to emulate any characteristics of running on a microkernel when one is, in fact, running on a monolithic kernel.


Using one single narrow criteria to completely disregard any & all other similarity is a shame & a sham. This is a huge disservice & actively harmful to understanding the world.

I've said again and again microkernel-y microkernel-ish. But every single time I try to build a bridge & explain how it's similar but different, you burn it down. I think you have been poisoned by being too close to the subject matter & lack objective sensibility about the subject, are unable to adequately step back to actually understand. And you don't seem to have any interest in learning or trying to see, which is a crying shame.

I feel like you as a deep practioner/academic of this area should be best able to help explain relationships & similarities, to see connections. But you focus only on making distance & setting things apart. That's just not good enough. It's not sufficient a viewpoint.


> Using one single narrow criteria to completely disregard any & all other similarity is a shame & a sham. This is a huge disservice & actively harmful to understanding the world.

Just pick a different fucking word. Why do you want to use the term microkernel?

It’s like insisting that a rock is actually and airplane because they both fly through the air. It’s a pointless redefinition and everyone qualified to talk about things that fly won’t call it an airplane and will be confused about what you’re talking about.


It's not a redefinition when you add "-like" at the end. If I say a particular airplane flies like a rock, nobody needs to clarify that airplanes are not rocks.

But please refer back to the original comment. rektide said it was "un-monolithic", which it is. A virtual machine is less monolithic than baked-in code.

Softirq is the one who brought up microkernels specifically. Softirq is the one that repeatedly steered the conversation back to monokernel versus microkernel.

If you put words in someone's mouth repeatedly until they start using them, you lose the right to complain that they're using the wrong words!

"Why do you want to use the term microkernel?" is the exact opposite of what happened.


Not sure that is the right analogy. rektide is arguing on a conceptual level, while softirq is arguing on an academic level with very strict definition.

If I had to use your analogy that the new breed of Golden Retriever somehow has many of the same looks as Cat, act like a cat, but it is biologically still defined as a Dog.


I've said repeatedly I think eBPF is neutral, not inherently one way or another.

But it certainly creates a lot of new possibilities for microkernel-like things (and other things) to be built.


It’s worth looking at the work that Martin Thompson has done with similar techniques in the JVM space with Aeron and Disruptor:

https://aeron.io/

https://lmax-exchange.github.io/disruptor/


It seems to be the same evolution that we've seen in 3D rendering APIs (or more generally: GPU APIs) over the last two decades, and roughly for the same reasons (but it goes beyond just reducing syscall overhead, it's also establishing a pipeline for sending batches of work prepared by the CPU for 'decoupled' execution on the GPU, e.g. trading latency for better throughput).


Yep, command queues are basically how drivers have been for decades now anyway. It just happens to be more widely exposed in user-facing APIs now. If you can call them user-facing APIs. They are hard to use and get right, and nobody writes directly against them. Instead it's best done through a higher level API (eg. an asynchronous system call API) or even higher level through not even knowing about it (eg. modern game engines).

Modern caches are heavily pipelined in order to saturate links and well, as you probably know, games are 1-2 frames behind but do tens of thousands of GPU operations per frame. I guess we can even add modern CPUs to the list? Eg. certain operations cause a pipeline flush just like some rendering operations can cause the same thing, under certain circumstances.

I just found this article about the graphics pipeline from 2011: https://fgiesen.wordpress.com/2011/07/02/a-trip-through-the-...


I am a kernel hacker, and I have worked with io_uring, and I can safely judge that it is very good -- but the main issue is that it represents a totally different approach to I/O (and syscalls generally, which are just I/O in other words), which is going to take the ecosystem a while to reform around. Note that the sample code in the OP's article is much more complex than the traditional approach. It's also very Linux-specific, so any software which takes advantage of it will be less portable or will have to write multiple I/O backends. It's also nontrivial to understand and use effectively, so adding good io_uring support to a project is an effort.


My hope is that Linux ports of kqueue or libdispatch can use io_uring where possible and we would automatically get the benefit, but I'm not familiar enough with the guts of these projects to know how timely or even feasible that is.


interesting whitequark thread on that idea: https://twitter.com/whitequark/status/1521146654535172100


Whats old becomes new again. I worked on some Windows drivers that batched calls into the kernel because of the overhead "back in the day". At the time, we were re-working the drivers to not doing batching anymore because the overhead of syscalls had dropped (as part of a larger clean up of our driver design, not a reason in of itself).


> Whats old becomes new again.

The polled mode of io_uring resembles I/O on CDC mainframes, except that the ‘kernel’ ran on dedicated coprocessors (which incidentally ran multiple tasks in a way similar to what is now called hyperthreading).


Before that in the 60s io command batching was (like most things) done by IBM: https://en.wikipedia.org/wiki/Channel_I/O#Channel_program

These days Linux supports VFIO pass-through for native IBM/360 channel programs if you happen to be running Linux on a IBM mainframe: https://docs.kernel.org/6.1/s390/vfio-ccw.html


io_uring it's not about batching. io_uring is about a new concurrency model based on ring buffers


If it batches syscalls could libc use it in place of the normal syscall method? You’d have to kind of deliberately nerf it by waiting on calls in spite of its async nature but this might still bring a boost due to reduced kernel user mode switching and other overhead.


It absolutely a batch interface for syscalls, with the ability to sort of create little I/O specific programs.


You cant leave me hanging!! Whats the genius idea?


Imagine you have a piece of software that runs in an event loop (as many things do). On each loop, queue up all system calls you'd like to perform. At the end of the loop, do one syscall to execute the batch. At the start of the loop, check if anything has completed and continue the operation.

If you're processing a set of sockets and on any given loop N are ready, then with epoll you do N+1 syscalls. With io_uring you do 1. It's independent of N.


And the potential impact is huge! Not only are individual syscalls expensive on their own, and increasingly so (afaik) with spectre and security issues. You also need a thread to execute the call, which is several KB of memory, compound context switches, and (often overlooked) creating and destroying threads also come with syscall overhead.

Now, we’ve had epoll etc so it’s not novel in that respect. However, what’s truly novel is that it’s universal across syscalls, which makes it almost mechanical to port to a new platform. A lot of intricate questions of high-level API design simply go away and become more simple data layout questions. (I’m sure there are little devils hiding in the details, but still)


You could use the shared ring buffers scheme to replace essentially all syscalls and similar things in any context. Think gpu drivers, memory controllers, etc. It could be a universal interface for all communication that has to go through some kind of expensive security barrier.


With cross-core interrupts and user-mode interrupt handlers (as in some new intel cpus), you could even do something without polling (interrupt for submission) where the core user-mode code is running on _never_ context switches (obviously except for scheduling) and you just have a dedicated kernel core or cores off doing kernel things.


You can also use umwait on the next completion entry in the ring


yup, though that means you're wasting that core's compute; something with green threads where language runtime does a cross-core interrupt to submit syscall then continues execing other green threads until it gets a user interrupt for syscall completion would be pretty neat.

(ETA: and indeed looks like they're considering support for that in io_uring! https://lwn.net/Articles/869140/)


“General purpose syscall batching”, which drastically cuts down on context switches


One interesting thing is that io_uring can operate in different modes. One of the modes enables kernel-side polling, so that when you put data into the buffer, the kernel will pull the data out itself to do IO. That means from the application side, you can perform IO without any system calls.

Our general take was also that it has a lot of potential, but is relatively low level that most mainstream programmers aren't going to pay attention to it. Hence, it'll be a while before it permeates through various ecosystems.

For those of you that like to listen on the way to work, we cover io_uring on our podcast, The Technium.

https://www.youtube.com/watch?v=Ebpnd7rPpdI https://open.spotify.com/episode/3MG2FmpE3NP7AK7zqQFArE?si=s...


A few months back I tried using io_uring for some performance-critical network I/O and found it was slower than epoll. A bit sad because epoll has a notoriously janky API and the io_uring API is much nicer. (This part is also sad for me as a Unix/linux fanboy because io_uring’s API is very similar to what Windows was doing 20 years ago).


I've spent essentially the last year trying to find the best way to use io_uring for networking inside the NVMe-oF target in SPDK. Many of my initial attempts were also slower than our heavily optimized epoll version. But now I feel like I'm getting somewhere and I'm starting to see the big gains. I plan to blog a bit about the optimal way to use it, but the key concepts seem to be:

1) create one io_uring per thread (much like you'd create one epoll grp)

2) use the provided buffer mechanism to post a pool of large buffers to an io_uring. Bonus points for the newer ring based version.

3) keep a large (128k) async multishot recv posted to every socket in your set always

4) as recvs complete, append the next "segment" of the stream to a per-socket list.

5) parse the protocol stream. As you make it through each segment, return it to the pool*

6) aggressively batch data to be sent. You can only have one outstanding at a time per socket, so make it a big vectored write. Writes are only outstanding until they're confirmed queued in your local kernel, so it is a fairly short time until you can submit more, but it's worth batching into a single larger operation.

* If you need part of the stream to live for an extended period of time, as we do for the payloads in NVMe-oF, build scatter gather lists that point into the segments of the stream and then maintain a reference counts to the segments. Return the segments to the pool when it drops to zero.

Everyone knows the best way to use epoll at this point. Few of us have really figured out io_uring. But that doesn't mean it is slower.


> Few of us have really figured out io_uring. But that doesn't mean it is slower.

seastar.io is a high level framework that I believe has "figured out" io_uring, with additional caveats the framework imposes (which is honestly freeing).

Additionally the rust equivalent: https://github.com/DataDog/glommio


seastar uses DPDK, and spdk uses it too. the OP is a maintainer on that.


For network, I believe io_uring is used for disk access though


It's also worth noting that io_uring has had at most 10-15 engineer-years worth of performance tuning vs. the many (?) hundreds of years that epoll has received. I work with Jens, Pavel, and others and can confidently say that low-queue-depth perf parity with epoll is an important goal to the effort

As an aside, it's great to see high praise from an spdk maintainer. One of the big reasons for doing io_uring in the first place was that it was impossible to compete in terms of performance with total bypass unless you changed the syscall approach.


I'd be very interested to read that blog post. Besides your tips for maximum performance, I'm curious about the minimum you have to do to get a significant improvement. I can easily imagine someone basically using it to poll for readiness like epoll and being disappointed. But if that's enough to benefit, I'd be surprised and intrigued. More likely you need to actually use it to enqueue the op, but folks have struggled with ownership. Is doing that in a not quite optimal way (extra copies on the user side) enough? Or do you need to optimize those away? Do you need to do the buffer pooling and or multishot stuff?


Do fixed buffers help for network I/O? In August 2022 @axboe said "No benefits for fixed buffers with sockets right now, this will change at some point."


To address my own silly questions, yes, one should use the new fixed buffers described in this document: https://github.com/axboe/liburing/wiki/io_uring-and-networki...


I had a similar result with storage I/O to NVMe SSDs. io_uring was slightly slower than my optimised Linux thread pool at 4k random-access I/O at about 2.5M IOPS in my benchmarks, and this despite the syscall overhead in the thread pool version being measurable.

io_uring was only a little slower, and there are some advantages to io_uring with regard to adaptive performance (because Linux doesn't expose some information to userspace that's useful for this, so userspace has to estimate with lag - see Go's scheduler), but I was hoping it would be significantly faster. Then again it was good to have an alternative to validate the thread pool design.


IOCP was indeed ahead of its time and should have been copied by Linux a long time ago.


IOCP certainly was ahead of its time, but it only does the completion batching, not the submission batching. io_uring is significantly better than anything available on Windows right now.


> io_uring is significantly better than anything available on Windows right now.

Windows 11 copied io_uring almost 1:1 - https://windows-internals.com/ioring-vs-io_uring-a-compariso...


It comes with its own set of challenges. In the integration I've seen, it basically meant that all the latency in the system went into io_uring_enter() call which blocked then for far longer than any individual than any other IO operation we've ever seen. Your application might prefer if it pauses 50 times for 20us (+ syscall overhead) in an eventloop iteration instead of a single time for 1ms (+ less syscall overhead), because that means some IO will just sit around for 1ms and will be totally unhandled.

The only way to avoid big latencies on uring_enter is to use the submission queue polling mechanism using a background kernel thread, which also has its own set ofs pro's and con's.


This sounds abnormal, are you using io_uring_enter in a way that asks it not to return without any cqes?

I don't have much of a feel for this because I am on the "never calling io_uring_enter" plan but I expect I would have found it alarming if it took 1ms while I was using it


For many syscalls, the primary overhead is the transition itself, not the work the kernel does. So doing 50 operations one by one may take, say, 10x as much time as a single call to io_uring_enter for the same work. It really shouldn't be just moving latency around unless you are doing very large data copies (or similar) out of the kernel such that syscall overhead becomes mostly irrelevant. If syscall overhead is irrelevant in your app and you aren't doing an actual asynchronous kernel operation, then you may as well use the regular syscall interface.

There are certainly applications that don't benefit from io_uring, but I suspect these are not the norm.


You need to measure it for your application. A lot of people think „syscalls are expensive“ because that’s repeated for your years, but often it’s actually their implementation and not the overhead.

Eg a UDP syscall will do a whole lot of route lookups, iptable rule evaluations, potential eBPF program evaluations, copying data into other packets, splitting packets, etc. I measured this to be fare more than > 10x of the syscall overhead. But your mileage might vary depending on which calls you use.

As for the applications: these lessons where collected in a CDN data plane. There’s hardly any applications out there which are more async IO intense.


They are the norm if you're google, most others are low load.


Winsock RIO available since 2012 would beg to differ. It only applies to network I/O but I suspect IOCP is more than sufficient for files


Sure, io_uring is better. IOCP still captured the biggest benefit, completion triggering.


Without the submission batching you lose your system call reduction and that is far and away the biggest benefit.


The way I see it, a bigger benefit is avoiding readiness-triggered IO. That can add a variable number of syscalls, wakes up threads, etc.

Of course, batching is also great.


Still waiting on io_uring in the Go stdlib: https://github.com/golang/go/issues/31908

"Unplanned"


New queuing mechanisms are generally much bigger shocks than may meet the eye, and it's unusual to see existing systems and frameworks be able to take full advantage of them immediately. In this case, the Go runtime is playing the part of a "framework"; it has ideas about how system calls work and those ideas come from a pre-io_uring world.

IIRC, the same thing happened back when epoll was new. You couldn't just swap out your select-based framework to epoll with no changes and just magically get the new performance; it required a lot of work and years for things to finally get polished up.

It always seems so easy at the high level, but these abstractions are at the very foundation of the system and the slightest change ripples everywhere. The link you show provides a great demonstration of that; the first paragraph is an optimistic "it seems feasible", then there's literally pages of "umm, well, if you think about this it's a bit more complicated than it seems" and cross-platform issues and upgrades to the kernel mechanism that solved other people's problems maybe solving these problems too and details details details.

That's not to say it won't ever happen. Like I said, epoll and other similar things all went through the same basic process, so it can be done, has been done, and will be done again (if not in Go, then in other places). It's just that it's all so much harder than it looks from 30,000 feet in the air.


Likewise for Mio (which underlies Tokio) for Rust. The odds of that happening look better, but still seems far away.


While Mio will probably not implement uring in its current design, there's https://github.com/tokio-rs/tokio-uring if you want to use io_uring in Rust.

It's still in development, but the Tokio team seems intent on getting good io_uring support at least!

As the README states, the Rust implementation requires a kernel newer than the one that shipped with Ubuntu 20.04 so I think it'll be a while before we'll see significant development among major libraries.


Very interesting! Thanks for sharing.


We've been able to do a lot with this:

https://github.com/dshulyak/uring


why do you need io_uring from Go? It has a nicer interface than epoll but you’re not coding to its interface.


I don’t know enough to say for sure in this case, but if there’s one issue I have with Go performance, it’s the excessive syscall overhead for standard networking stuff. If you do high bandwidth or high concurrency you often waste a large fraction in syscalls, which can be reduced greatly in theory.

For instance, socket deadlines have iirc 2 syscalls per timeout extension, which is sold as best practice for network timeouts.


why would one use socket deadlines in Go? In Go you would typically just use Context.Deadline or a select and timer mechanism to implement timeouts - and the runtime manages those via timers integrated into the runtime. Socket timeouts are for blocking IO, which Go doesn't use.


io_uring is staggeringly fast, and still seeing massive improvements regularly. Their work in related subsystems is also yielding some fantastic gains:

https://www.phoronix.com/news/Linux-Pipes-IOCB_NOWAIT


Jens Axboe's Twitter is another great place to follow for improvements.

https://twitter.com/axboe


libuv -> linux: introduce io_uring support

https://github.com/libuv/libuv/pull/3952


Hat's off for posting this 2 hours after it dropped!

I've been tracking the nest of issues with anticipation! This wasn't linked to https://github.com/libuv/libuv/pull/1947 when it posted, so I didn't see it. Very glad you linked it, thanks!

Oh wowwwww.... 608 lines. It's tiny! The last go at adding io_uring to libuv was 24k lines. https://github.com/libuv/libuv/pull/2322


Finally something happens


io_uring (and eBPF for that matter) are still super concerning attack vectors and we've already seen eBPF being used in the wild by bad actors. kernel_lockdown explicitly disables eBPF and other methods of kernel injection and is turned on by secure boot. In io_uring's case sharing a userspace mapping with kernel space is simply always going to be dangerous. End of story.

The jury is also out as to what the performance benefits of io_uring are and whether the benefits outweigh the risks versus sticking to what's available. I've seen a lot of good results, but I've also seen bad results, and I haven't seen a lot of real world data at scale yet.


The io_uring API is fairly narrow. It's basically the same interface as the kernel uses to talk to network cards and modern disks. Those are untrusted just like userspace is. You can look at the packet format here (section 4.1):

https://kernel.dk/io_uring.pdf

It's not hard to parse, and nowhere near the level of exposure eBPF creates.

If you don't trust the hardware MMU to allow the kernel to safely read buffers from userspace, then there's really no way to perform I/O in the first place. (write() already does this, for example).


The kernel talks to NICs in kernel mode and can actually segregate a device's view of memory via an iommu. While there's some overlap in potential vulnerabilities, bad hardware/firmware is a different vector than userland having a shared mapping active to use in exploits that read arbitrary kernel data.

io_uring is also very complex. It's now it's own subsystem, has it's own worker pool, and even the dance of the rings themselves moving pointers around and using data structure that must be manipulated from both sides is not simple and thus probably not that secure.


> In io_uring's case sharing a userspace mapping with kernel space is simply always going to be dangerous.

Out of curiosity, have you ever used a syscall which writes or reads a userspace buffer?


It's not the same thing. io_uring remaps a contiguous chunk of pages in both the kernel and userspace vas. For read/write the data is copied from userspace into kernelspace.


You're swapping terms around to draw distinctions where there are no real differences. There's no such thing as a "kernel page" and a "userspace page". There's only pages of memory which are mapped into one or both. All pages accessible to userspace are also mapped in the kernel. That means that the ring buffers live in perfectly ordinary, mapped-in-both-places userspace pages. There is zero basis in fact for your claims that these are "kernel pages" which have been mapped to userspace. You have taken the exact same phenomenon, pages accessible to both userspace and the kernel, and called it by a new scary name "kernel pages accessible by userspace", instead of the other safe and ordinary name, "userspace pages". Now, the addresses used for the kernel mapping may be different than normal, but (a) that is completely inconsequential to security and (b) as you point out yourself, kernel address space is inaccessible to userspace so it could not possibly matter where the kernel maps the shared pages.

Now it is in general a real security bug when the kernel operates multiple times on data mapped into userspace instead of taking a one-time copy of that data and using this copy for multiple operations. However, the whole point of a ringbuffer is that it specifically operates in such a shared environment. Moreover it will be just as necessary for the kernel to perform the snapshot copy out of an SQE (and other shared structures) before beginning a sequence of operations on them. The only difference with io_uring is when is that copy performed: at the time of a syscall or during other operations. That too is completely inconsequential to security.

To sum up: It is correct to state that the kernel must be careful with shared mappings. It also would have been correct to state, had you reached this far, that io_uring is moving the userspace-copy boundary "deeper" into the kernel rather than isolating it at the syscall layer. It is, however, incorrect to state that io_uring contains a new, more dangerous kind of shared mapping. It is incorrect to state that the shared mapping used by io_uring is itself in any way a threat to kernel security.


Let's go over what I said:

io_uring remaps a contiguous chunk of pages in both the kernel and userspace vas.

This is a true statement - io_uring makes a compound page and calls remap into the usersapce vas in mmap, and I did not say that the pages were kernel or userspace page. However, you've said "userspace pages" in your own argument which by your own admonition is incorrect. You are correct in saying that pages are just pages, because a page is just a chunk of physical addresses assigned to a pfn and has no meaning in userspace or kernel space without a vma.

There is a difference between kernel and userspace mappings, and mapping userspace virtual address to point to direct mapped kernel addresses that the kernel is manipulating is dangerous and there are many CVEs that have taken advantage of these types of command buffers on other kernels.


Does the distinction between sharing VA mappings and copying buffers to/from kernel matter from a security perspective? (I assume it does, but I don't know why.)


Yes, you're looking at kernel pages through userspace virtual memory mappings, this isn't the case with copy to user. You're just copying data from a userspace page to a kernel page, but only in kernel mode. You don't get to "see" kernel pages and in fact post spectre/meltdown the kernel is unmapped in userspace.


I don’t think it is fair to call eBPF an attack vector. It does happen to currently be used by exploit developers to pivot a preliminary bug into code execution, but that’s just because it’s convenient low-hanging fruit. If you take it away authors will move on to other techniques.


I learned about io_uring from King Protty in his super informative presentation: https://youtu.be/Ul8OO4vQMTw


does anyone here know how to use this with Java? Does Java by default use io_uring?


io_uring will indeed be a perfect fit for Java: no need to pay a supplemental JNI access cost for each IO as all you need is a memory barrier to read or write the shared queues which can be properly implemented in pure Java. We're not there yet but here is where you can look:

- For network I/O, Netty has an incubating transport that is promising [1].

- For disk I/O, JDK's Loom project [2] has mentioned its plan to rely on io_uring on Linux [3], but there's no ETA AFAIK.

[1] https://github.com/netty/netty-incubator-transport-io_uring

[2] https://openjdk.org/projects/loom/

[3] https://cr.openjdk.org/~rpressler/loom/loom/sol1_part1.html


I don't believe that Java currently uses it by default, although I've seen discussion about possibly using it in the future.

However, you might find these links useful:

    - I built an ultra high performance HTTP server in Java, powered by io_uring - https://old.reddit.com/r/java/comments/12f2h79/i_built_an_ultra_high_performance_http_server_in/
    - server is here - https://github.com/bbeaupain/hella-http
    - based on these bindings - https://github.com/bbeaupain/nio_uring
I wouldn't be surprised if there are others.


No languages use io uring by default, it's pretty much work in progress at best.


The last strace output (where the author mentions logging) made me think: are there any logging frameworks that’s use io_uring? Seems like a potentially good way to lower logging overhead without adding a lot of buffering to the logging framework.


Author here.

Good point. It should be easy enough for me to change my test code to use io_uring for logging as well. It will be interesting to see the results.


There is still no good support for io_uring in Rust. AFAIK none of existing solutions fully support multishot ops, zerocopy ops with delayed buffer release, sending messages to another ring with IORING_OP_MSG_RING or convenient ops linking


Because memory is logically shared between the kernel and the application, it makes managing the ownership of memory in safe rust tricky! So unfortunately it's not as plug and play as one would hope


The Tokio discord seems to have a channel for it. That might be a place to ask. I was just going to wait to see how they do with it.


I hope there is (or will be) a way to use io_uring for the sorts of purposes that the syscall boundary is currently used for. The example I have in mind is https://rr-project.org/ which uses the syscall boundary to isolate nondeterminism in order to implement record/replay debugging. I'm not sure how it can accomplish the same thing with io_uring; it would need to be informed of all updates to the shared pages or something?


I will add this to my awesome collection: https://github.com/espoal/awesome-iouring


My initial impression of io_uring a few years ago was that it involves a lot of pointer manipulation in userspace, leaving a lot of room for alignment and overflow bugs.

...or maybe I'm just being paranoid.


io_uring is really amazing work. I've followed it's development on Twitter. I am in awe of Jens Axboe, I find him very easy to talk to and his work really amazing. If you don't know he's also the man behind the fio disk benchmark utility.

Any linux/unix wizards here have any inkling of if or when or how likely it would be for common parts of tools or subsystems etc of the unix/linux stack that we use to integrate io_uring and move away from older async frameworks?


The real killer will be when it's added as an option to iperf3.


HN’s title “fixer” takes another victim. “Why you should use io_uring for network I/O” is not semantically the same as “Use io_uring for network I/O”. You can’t just swap one out for the other.


Title fixer here. "Why you should" is a linkbait trope. We remove those. I don't see how it changed the meaning, and certainly not significantly.


Of course it changes the meaning! Titles are incredibly dense with information, and are probably orders of magnitude more important in shaping how the article is read and commented on than any sentence in the actual article. Even subtle changes will have a major impact. That's kind of why clickbait is a thing in the first place :)

"Use" is imperative; the title is outright commanding the reader to do this, with the implication that it's simply always the right choice with no tradeoffs. Since very few things in engineering come without tradeoffs, it primes the reader into a combative mood. "You don't get to tell me what to do, Red Hat! For my program, code simplicity and portability matter more than performance, we're not going to use io_uring. You're stupid for suggesting otherwise, and I'm not reading your stupid article."

A "why you should use ..." title has at least some nuance, despite being a clickbait trope. The article will be explaining the benefits of using io_uring for network IO will be useful, and the situations where those benefits actually materialize. But despite being an advocacy piece, it'll also tell you why you should not use it. Either explicitly, or by omission when the stated benefits are irrelevant to you.

The other obvious variants that would work for this article:

1. "Using ..." would promise a tutorial, a formal case study, or an informal account of somebody using it. The bulk of this article is a tutorial, so it'd be pretty reasonable.

2. "Benefits of ..." would be a straight up advocacy piece just like "use" but since it's not written as imperative, it wouldn't antagonize the reader as much.

3. "When should you use ..." would be a neutral review of the pros and cons, with emphasis on the scenarios where the tradeoffs favor each of the alternatives


I was originally going to say "using" and ironically didn't because "use" seemed closer to the original meaning. Silly me.


The title "fixing" definitely causes problems, or at least really awkward language occasionally "Stop-Motion Movies Are Animated at Aardman | WIRED" https://news.ycombinator.com/item?id=35428625

Keeping the submitted titles as-is by default and rewording with human review would be a better policy


We can't human-review everything. There's too much.

Of course the software gets it wrong sometimes (I've fixed that case now; thanks). But it does more good than harm. The trouble is that everyone takes for granted all the cases where it and/or mods get a change right. If you only count the bad cases (and you can't count what you don't notice!), any approach will seem dumb. What on earth is wrong with those admins!


Why aren't your user activity based bad content detectors sufficient? I'd rather have linkbait deranked than disguised.

In this case the title wasn't linkbait. If it was rewriting the title without rewriting the article wouldn't fix anything for me.


I'm not sure what you mean by "user activity based bad content detectors" but if we didn't edit titles, the threads would fill up with complaints about titles—vastly more than they already do. There would be other negative differences too, such as a front page filled with sensationalism. Moreover, these effects aren't static—they produce feedback loops. The way HN edits titles, which has been the case since the beginning, is fundamental to how the site works and what it is.

"Why you should" is a linkbait trope, sorry. The use of the pronoun "you" in titles is usually already a linkbait trope because it grabs the user's attention ("hey—you!") in a way that has nothing to do with the content. Headline writers figured this out a long time ago. I'm not saying the OP did it deliberately.


> I'm not sure what you mean by "user activity based bad content detectors"

I meant that if an article leads to a poor quality discussion you already have ways to detect that and derank it after the discussion starts going bad.

> "Why you should" is a linkbait trope, sorry.

"Why you should be using io_uring" communicates to me that there will be an argument for why it's better. "Using io_uring" suggests a more neutral article.

> but if we didn't edit titles, the threads would fill up with complaints about titles

In this case I'm assuming the complaints would be that you shouldn't use io_uring because it's bad. Editing the title here only fixed the issue for people who didn't read the article then.

> The way HN edits titles, which has been the case since the beginning, is fundamental to how the site works and what it is.

Fair enough. Thanks for taking the time to explain.


Thanks for the kind reply!

That title edit was obviously not a splash-free dive so you guys clearly have a point.


Thanks for changing it, i dont know why people are giving you such a pushback, personally, i think the internet could use a bit less hype titles.

Side note, i'd pick "Using" instead of "Use"


That crossed my mind and ironically I decided to keep "use" because it felt closer to the author's original intent.

I've put an "ing" up there now. Thanks!


FWIW, the submitter of a post can edit the title even after the “fixer” runs.


I've actually edited the title back to the original title after the "fixer" changed it. But it's been later "fixed" again. Looks like there's sometimes a second pass of fixing after which edits are no longer possible.


That relies on them spotting it, caring about it, knowing that they can edit it and not worried about undoing an automod operation.


Yeah — I’m not suggesting it’s perfect. Just highlighting what’s possible, since I didn’t discover that capability myself until like 3 weeks ago..!


Interesting that you call them "fixer"s, but the point to "fix" is that the meaning is not the same and they want to change it right? If they are the same, why would they have the incentive to "fix" it?


It's not intended to leave it having the exact same meaning.


I guess the difference is that there’s an implied alternative?


In practice it's fast but you better hope you've validated it on a stable image with a consistent BOM. If that doesn't match your environment, beware!


BOM?


Or, don't, and have your apps still work just fine (and be portable)


Shouldn't you be able to hide that behind a library?


You could say that about any software at all. The point is, you don't need to use io_uring. Almost nobody does.


Ideally support gets built into languages & users just get a free performance boost without even thinking about it. Wouldn't that be great?

Bun has io-uring support. Deno issue, https://github.com/denoland/deno/issues/16232 . Node.js issue with prototype, https://github.com/libuv/libuv/issues/1947.

Getting good seems interesting & compelling to me. I'm not quite sure where your dislike of this better way of doing things stems from? Even if it does mean changing how we program a bit, to actually go about adopting a better way.


> Ideally support gets built into languages [...]

Depends on how narrow you want to make your definitions of 'language'. If a library can handle things ergonomically and fast, no need to bake it into the language itself.

Eg Perl has regular expressions built in, but Python handles them via a library (arguably it's via the standard library).

More drastically: C has a few loops built in, and Haskell handles arbitrary loops via a library and you can build your own new constructs, too. See https://hackage.haskell.org/package/monad-loops-0.4.3/docs/C...

(Incidentally, that's why Haskell is one of the best imperative languages.)


What languages can we cite that have a stdlib that doesnt include either file io, network io (or if I really want to make the point: syscalls)?

I can go first actually, get one answer out of the way. Javascript. But it did evolve network io, and modern JS runtimes almost all include some file/storage io.

Im down with a big-language/small-language discussion, but it seems basically beside the point here. I said language, but whether it's the language or libraries doesn't seem materially interesting useful or to alter the basic lay of the discussion of what consuming io-uring may well be like for most (likely: abstracted).


No, you can't necessarily say that about any software at all.

Eg the style differences between blocking and non-blocking IO are big enough, that it's hard to hide them behind a library.


And so the software that wants faster IO and is prepared to make trade offs should just do what, ignore any potential advantages?


Not everbody writes "apps". Some write real software that has heavy I/O requirements, e.g. databases.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: