I think this might be the best explanation I've read of why io_uring should be better than epoll since it effectively collapses the 'tell me when this is ready' with the 'do action' part. That was the really enlightening part for me.
I have to say though, the name io_uring seems unfortunate and I think the author touches on this in the article... the name is really an implementation detail but io_uring's true purpose is a generic asynchronous syscall facility that is currently tailored towards i/o. syscall_queue or async_queue or something else...? A descriptive api name and not an implementation detail would probably go a long way in helping the feature be easier to understand. Even window's IOCP seems infinitely better named than 'uring'.
I'm still confused coz this is exactly what I always thought the difference between epoll and select was.
"what if, instead of the kernel telling us when something is ready for an action to be taken so that we can take it, we tell the kernel what action to we want to take, and it will do it when the conditions become right."
The difference between select and epoll was that select would keep checking in until the conditions were right while epoll would send you a message. That was gamechanging.
- I'm not really sure why this is seen as such a fundamental change. It's changed from the kernel triggering a callback to... a callback.
epoll: tell me when any of these descriptors are ready, then I'll issue another syscall to actually read from that descriptor into a buffer.
io_uring: when any of these descriptors are ready, read into any one of these buffers I've preallocated for you, then let me know when it is done.
Instead of waking up a process just so it can do the work of calling back into the kernel to have the kernel fill a buffer, io_uring skips that extra syscall altogether.
Taking things to the next level, io_uring allows you to chain operations together. You can tell it to read from one socket and write the results into a different socket or directly to a file, and it can do that without waking your process pointlessly at any intermediate stage.
A nearby comment also mentioned opening files, and that's cool too. You could issue an entire command sequence to io_uring, then your program can work on other stuff and check on it later, or just go to sleep until everything is done. You could tell the kernel that you want it to open a connection, write a particular buffer that you prepared for it into that connection, then open a specific file on disk, read the response into that file, close the file, then send a prepared buffer as a response to the connection, close the connection, then let you know that it is all done. You just have to prepare two buffers on the frontend, issue the commands (which could require either 1 or 0 syscalls, depending on how you're using io_uring), then do whatever you want.
You can even have numerous command sequences under kernel control in parallel, you don't have to issue them one at a time and wait on them to finish before you can issue the next one.
With epoll, you have to do every individual step along the way yourself, which involves syscalls, context switches, and potentially more code complexity. Then you realize that epoll doesn't even support file I/O, so you have to mix multiple approaches together to even approximate what io_uring is doing.
(Note: I've been looking for an excuse to use io_uring, so I've read a ton about it, but I don't have any practical experience with it yet. But everything I wrote above should be accurate.)
If you're looking for an excuse to work on io_uring, please consider helping get it implemented and tested in your favorite event loop or I/O abstraction library. Here's some open issues and PRs:
Oh, trust me… that Go issue is top of mind for me. I have the fifth comment on that issue, along with several other comments in there, and I’d love to implement it… I’m just not familiar enough with working on Go runtime internals, and motivation for volunteer work is sometimes hard to come by for the past couple of years.
Haha nice, I just noticed that :) I think supporting someone else to help work on it and even just offering to help test and review a PR is a great and useful thing to do.
Being able to open files with io_uring is important because there is no other way to do it without an unpredictable delay. Some systems like Erlang end up using separate OS threads just to be able to open files without blocking the main interpreter thread.
That's not true. There is iosubmit on Linux which is used by some database systems. It's hard to use and requires IODIRECT, which means you can't use the kernel file buffer, writes and reads have to be aligned and a multiple of the alignment size (sector? page?)
AIO has been around forever and it gives a crappy way to do async i/o on disk files that are already open, but it doesn't give a way to asynchronously open a file. If you issue an open(2) call, your thread blocks until the open finishes, which can involve disk and network waits or whatever. To avoid the block, you have to put the open in another thread or process, or use io_uring, as far as I can tell.
It is terrible, but it's worth pointing out that there has been an API for real async file IO for a while. I believe it works with ext4 as well, although maybe there are edge cases. Iouring is a big improvement.
Threads are the simplest way but I think you can also use ancillary messages on unix domain sockets, to open the files in a separate process and then send the file descriptors across.
There's one more aspect that I don't think gets enough consideration - you can batch many operations on different file descriptors into a single syscall. For example, if epoll is tracking N sockets and tells you M need to be read, you'd make M + 1 syscalls to do it. With io_uring, you'd make 1 or 2 depending on the mode. Note how it is entirely independent of either N or M.
What you're describing sounds awesome, I hadn't thought about being able to string syscall commands together like that. I wonder how well that will work in practice? Is there a way to be notified if one of the commands in the sequence fails like for instance the buffer wasn't large enough to write all the incoming data into?
> Only members inside the chain are serialized. A chain of SQEs will be broken, if any request in that chain ends in error. io_uring considers any unexpected result an error. This means that, eg, a short read will also terminate the remainder of the chain. If a chain of SQE links is broken, the remaining unstarted part of the chain will be terminated and completed with -ECANCELED as the error code.
So it sounds like you would need to decide what your strategy is. It sounds like you can inspect the step in the sequence that had the error, learn what the error was, and decide whether you want to re-issue the command that failed along with the remainder of the sequence. For a short read, you should still have access to the bytes that were read, so you're not losing information due to the error.
There is an alternative "hardlink" concept that will continue the command sequence even in the presence of an error in the previous step, like a short read, as long as the previous step was correctly submitted.
Error handling gets in the way of some of the fun, as usual, but it is important to think about.
I'm looking at the evolution in the chaining capabilities of io_uring. Right now it's a bit basic but I'm guessing in 5 or 6 kernel versions people will have built a micro kernel or a web server just by chaining things in io_uring and maybe some custom chaining/decision blocks in ebpf :-)
> The obvious place where BPF can add value is making decisions based on the outcome of previous operations in the ring. Currently, these decisions must be made in user space, which involves potential delays as the relevant process is scheduled and run. Instead, when an operation completes, a BPF program might be able to decide what to do next without ever leaving the kernel. "What to do next" could include submitting more I/O operations, moving on to the next in a series of files to process, or aborting a series of commands if something unexpected happens.
Yes exactly what I had in mind. I'm also thinking of a particular chain of syscalls [0][1][2][3] (send netlink message, setsockopt, ioctls, getsockopts, reads, then setsockopt, then send netlink message) grouped so as to be done in one sequence without ever surfacing up to userland (just fill those here buffers, who's a good boy!). So now I'm missing ioctls and getsockopts but all in good time!
BPF is going to change so many things... At the moment I'm having lots of trouble with the tooling but hey, let's just write BPF bytecode by hand or with a macro-asm. Reduce the ambitions...
Also wondering whether we should rethink language runtimes for this. Like write everything in SPARK (so all specs are checked), target bpf bytecode through gnatllvm. OK you've written the equivalent of a cuda kernel or tbb::flow block. Now for the chaining y'all have this toolbox of task-chainers (barriers, priority queues, routers...) and you'll never even enter userland? I'm thinking /many/ programs could be described as such.
epoll is based on a "readiness" model (i.e. it tells when when you can start I/O). io_uring is based on a "completion" model (i.e. it tells you when I/O is finished). The latter is like Windows IOCP, where the C stands for Completion. Readiness models are rather useless for a local disk because, unlike with a socket, the disk is more or less always ready to receive a command.
Tie difference between select and epoll is just that select sets a bitmap of file descriptors, so if you are listening on 5000 sockets, you then have to scan the resulting bitmap (80 or so 64-bit ints) to find the active ones, while epoll gives you an event-like structure telling you something has happened on file descriptor N, which you can then go and check, without having to scan a bitmap. So that is more efficient if you have a lot (thousands) of active fd's. I don't remember for sure but select may actually have a limit of 1024 fd's or something like that. So epoll was preferred over select for high levels of concurrency, before io_uring arrived. But, select worked ok for the purpose and epoll was simply a refinement, not a game changer. Scanning a few hundred words of memory is not that slow, as it will all be cached anyway.
select, poll, epoll, are all the same model of blocking and signalling for readiness.
The problem with the former occurs with large lists of file descriptors. Calling from user to kernel, the kernel needs to copy and examine N file descriptors. When user mode comes back, it needs to scan its list of file descriptors to see what changed. That's 2 O(n) scans at every syscall, one kernel side, one user side, even if only zero or one file descriptors has an event.
epoll and kqueue make it so that the kernel persists the list of interesting file descriptors between calls, and only returns back what has actually changed, without either side needing to scan an entire list.
By contrast, the high level programming model of io_uring seems pretty similar to POSIX AIO or Windows async I/O [away from readiness and more towards "actually do the thing"], but with the innovation being a new data structure that allows reduction in syscall overhead.
Separate to the complexity highlighted here, there are a number of operations where there is no readiness mechanism available. For example - file open, file stat. These syscalls can cause significant thread blocks. For example, when operating against NFS.
DNS resolution is also fiddly to do in a robust, nonblocking, async manner. An in-kernel dns resolution can block its thread for several minutes.
With io_uring, there will be no need to use multithreading or process pools to mitigate these situations. io_uring enables a pure-async programming style that has never before been practical on a mainstream os.
Io uring can in theory be built to subscribe to any syscall (though it hasn't yet). I don't believe epoll can do things like stat, opening files, closing files, and syncing though.
W/r/t a more descriptive name, I disagree—though until fairly recently I would have agreed.
I would guess that the desire for something more "descriptive" reflects the fact that you are not in the weeds with io_uring (et al), and as such a name that's tied to specifics of the terrain (io, urings) feels esoteric and unfamiliar.
However, to anyone who is an immediate consumer of io_uring or its compatriots, "io" obviously implies "syscall", but is better than "syscall", because it's more specific; since io_uring doens't do anything other than io-related syscalls (there are other kinds of syscalls), naming it "syscall_" would make it harder for its immediate audience to remember what it does.
Similarly, "uring" will be familiar to most of the immediate audience, and is better than "queue", because it also communicates some specific features (or performance characteristics? idk, I'm also not in the weeds) of the API that the more generic "_queue" would not.
So, while I agree that the name is mildly inscrutable to us distant onlookers, I think it's the right name, and indeed reflects a wise pattern in naming concepts in complex systems. The less ambiguity you introduce at each layer of indirection or reference, the better.
I recently did a toy project that has some files named in a similar fashion: `docker/cmd`, which is what the CMD directive in my Dockerfile points at, and `systemd/start`, which is what the ExecStart line of my systemd service file points at. They're mildly inscrutable if you're unfamiliar with either docker or systemd, as they don't really say much about what they do, but this is a naming pattern that I can port to just about any project, and at the same time stop spending energy remembering a unique name for the app's entry point, or the systemd script.
Some abstract observations:
- naming for grokkability-a-first-glance is at odds with naming for utility-over-time; the former is necessarily more ambiguous
- naming for utility over time seems like obviously the better default naming strategy; find a nice spot in your readme for onboarding metaphors and make sure the primary consumers of your name don't have to work harder than necessary to make sense of it
- if you find a name inscrutable, perhaps you're just missing some context
The io specificity is expected to be a temporary situation, and that part of the name may end up being an anachronism in a few years once a usefully large subset of another category of syscalls has been added to io_uring. The shared ring buffers aspect definitely is an implementation detail, but one that does explain why performance is better than other async IO methods (and also avoids misleading people into thinking that it has something to do with the async/await paradigm).
If the BSDs hadn't already claimed the name, it would probably have been fine to call this kqueue or something like that.
Thanks for your detailed thoughts on this, I am definitely not involved at all and just onlooking from the sidelines. The name does initially seem quite esoteric but I can understand why it was picked. Thinking about it more 'io' rather than 'syscall' does make sense, and Windows also does use IO in IOCP.
Now I wonder if my idea of having a muxcall or a batchcall as I thought about it a few years ago is something similar to io_uring, but on a lesser scale and without eBPF goodies.
My idea was to have a syscall like this:
struct batchvec {
unsigned long batchv_callnr;
unsigned long long batchv_argmask;
};
asmlinkage long sys_batchcall(struct batchvec *batchv, int batchvcnt,
long args[16], unsigned flags);
You were supposed to give in a batchvec a sequence of system call numbers and a little mapping to arguments you provided in args. batchv_argmask is a long long - 64 bit type, this mask is divided to 4 bit fields, every field can address a long from args table. AFAIR Linux syscalls have up to 6 arguments. 6 fields for arguments and one for return value, that gives 7 fields - 28 bits and now I don't remember why I thought I need a long long.
It would go like this pseudo code:
int i = 0;
for(; i < batchvcnt; i++) {
args[batchv[i].argmask[6]] = sys_call_table[batchv[i].callnr](args[batchv[i].argmask[0]], args[batchv[i].argmask[1]], args[batchv[i].argmask[2]], args[batchv[i].argmask[3]], args[batchv[i].argmask[4]], args[batchv[i].argmask[5]]);
if(args[batchv[i].argmask[6]] < 0) {
break;
}
}
return i;
It would return a number of successfully run syscalls. It would stop on first failed one. The user would have to pick up the error code out of args table.
I would be interested to know why it wouldn't work.
I started implementing it against Linux 4.15.12, but never went to test it. I have some code, but I don't believe it is my last version of the attempt.
I'm aware of readv/writev. This was supposed to cover much more, because it is with arbitrary syscalls. Open a file and if successful read from it - all with a single syscall. Another example: socket, setsockopt, bind, listen, accept in a single syscall.
I've seen mixed results so far. In theory it should perform better than epoll, but I'm not sure it's quite there yet. The maintainer of uWebSockets tried it with an earlier version and it was slower.
Where it really shines is disk IO because we don't have an epoll equivalent there. I imagine it would also be great at network requests that go to or from disk in a simple way because you can chain the syscalls in theory.
The main benefit to me has been in cases that previously required a thread pool on top of epoll, it should be safe to get rid of the thread pool now and only use io_uring. Socket I/O of course doesn't really need a thread pool in a lot of cases, but disk I/O does.
io_uring does win on more recent versions, but it's not like it blows epoll out of the water - it's an incremental improvement.
Specifically for websockets where you have a lot of idle connections, you don't want a bunch of buffers waiting for reads for long periods of time - this is why Go doesn't perform well for websocket servers as calling Read() on a socket requires tying up both a buffer and a goroutine stack.
I haven't looked into how registering buffers works with io_uring, if it's 1-to-1 mapping or if you can pass a pool of N buffers that can be used for reads on M sockets where M > N. The details matter for that specific case.
Again where io_uring really shines is file IO because there are no good solutions there currently.
I have to say though, the name io_uring seems unfortunate and I think the author touches on this in the article... the name is really an implementation detail but io_uring's true purpose is a generic asynchronous syscall facility that is currently tailored towards i/o. syscall_queue or async_queue or something else...? A descriptive api name and not an implementation detail would probably go a long way in helping the feature be easier to understand. Even window's IOCP seems infinitely better named than 'uring'.