Hacker News new | past | comments | ask | show | jobs | submit login
Io_uring is not an event system (despairlabs.com)
281 points by ot on June 17, 2021 | hide | past | favorite | 134 comments



I think this might be the best explanation I've read of why io_uring should be better than epoll since it effectively collapses the 'tell me when this is ready' with the 'do action' part. That was the really enlightening part for me.

I have to say though, the name io_uring seems unfortunate and I think the author touches on this in the article... the name is really an implementation detail but io_uring's true purpose is a generic asynchronous syscall facility that is currently tailored towards i/o. syscall_queue or async_queue or something else...? A descriptive api name and not an implementation detail would probably go a long way in helping the feature be easier to understand. Even window's IOCP seems infinitely better named than 'uring'.


I'm still confused coz this is exactly what I always thought the difference between epoll and select was.

"what if, instead of the kernel telling us when something is ready for an action to be taken so that we can take it, we tell the kernel what action to we want to take, and it will do it when the conditions become right."

The difference between select and epoll was that select would keep checking in until the conditions were right while epoll would send you a message. That was gamechanging.

- I'm not really sure why this is seen as such a fundamental change. It's changed from the kernel triggering a callback to... a callback.


epoll: tell me when any of these descriptors are ready, then I'll issue another syscall to actually read from that descriptor into a buffer.

io_uring: when any of these descriptors are ready, read into any one of these buffers I've preallocated for you, then let me know when it is done.

Instead of waking up a process just so it can do the work of calling back into the kernel to have the kernel fill a buffer, io_uring skips that extra syscall altogether.

Taking things to the next level, io_uring allows you to chain operations together. You can tell it to read from one socket and write the results into a different socket or directly to a file, and it can do that without waking your process pointlessly at any intermediate stage.

A nearby comment also mentioned opening files, and that's cool too. You could issue an entire command sequence to io_uring, then your program can work on other stuff and check on it later, or just go to sleep until everything is done. You could tell the kernel that you want it to open a connection, write a particular buffer that you prepared for it into that connection, then open a specific file on disk, read the response into that file, close the file, then send a prepared buffer as a response to the connection, close the connection, then let you know that it is all done. You just have to prepare two buffers on the frontend, issue the commands (which could require either 1 or 0 syscalls, depending on how you're using io_uring), then do whatever you want.

You can even have numerous command sequences under kernel control in parallel, you don't have to issue them one at a time and wait on them to finish before you can issue the next one.

With epoll, you have to do every individual step along the way yourself, which involves syscalls, context switches, and potentially more code complexity. Then you realize that epoll doesn't even support file I/O, so you have to mix multiple approaches together to even approximate what io_uring is doing.

(Note: I've been looking for an excuse to use io_uring, so I've read a ton about it, but I don't have any practical experience with it yet. But everything I wrote above should be accurate.)


If you're looking for an excuse to work on io_uring, please consider helping get it implemented and tested in your favorite event loop or I/O abstraction library. Here's some open issues and PRs:

https://github.com/golang/go/issues/31908

https://github.com/libuv/libuv/pull/2322

https://github.com/tokio-rs/mio/issues/923

https://gitlab.gnome.org/GNOME/glib/-/issues/2084

https://github.com/libevent/libevent/issues/1019


io_uring has been a game-changer for Samba IO speed.

Check out Stefan Metzmacher's talk at SambaXP 2021 (online event) for details:

https://www.youtube.com/watch?v=eYxp8yJHpik


The performance comparisons start here https://youtu.be/eYxp8yJHpik?t=1421

Looks like the bandwidth went from 3.8 GB/s to 22 GB/s, with the client being the bottleneck.


Oh, trust me… that Go issue is top of mind for me. I have the fifth comment on that issue, along with several other comments in there, and I’d love to implement it… I’m just not familiar enough with working on Go runtime internals, and motivation for volunteer work is sometimes hard to come by for the past couple of years.

Maybe someday I’ll get it done :)


Haha nice, I just noticed that :) I think supporting someone else to help work on it and even just offering to help test and review a PR is a great and useful thing to do.


Being able to open files with io_uring is important because there is no other way to do it without an unpredictable delay. Some systems like Erlang end up using separate OS threads just to be able to open files without blocking the main interpreter thread.


All async systems use threads for files. There is no other option. Until io_uring.


That's not true. There is iosubmit on Linux which is used by some database systems. It's hard to use and requires IODIRECT, which means you can't use the kernel file buffer, writes and reads have to be aligned and a multiple of the alignment size (sector? page?)

Iouring is a huge improvement here.


AIO has been around forever and it gives a crappy way to do async i/o on disk files that are already open, but it doesn't give a way to asynchronously open a file. If you issue an open(2) call, your thread blocks until the open finishes, which can involve disk and network waits or whatever. To avoid the block, you have to put the open in another thread or process, or use io_uring, as far as I can tell.


Io_submit is so terrible and seldom used you are basically nitpicking for even bringing it up. It only actually works on xfs.


It is terrible, but it's worth pointing out that there has been an API for real async file IO for a while. I believe it works with ext4 as well, although maybe there are edge cases. Iouring is a big improvement.


Threads are the simplest way but I think you can also use ancillary messages on unix domain sockets, to open the files in a separate process and then send the file descriptors across.


There's one more aspect that I don't think gets enough consideration - you can batch many operations on different file descriptors into a single syscall. For example, if epoll is tracking N sockets and tells you M need to be read, you'd make M + 1 syscalls to do it. With io_uring, you'd make 1 or 2 depending on the mode. Note how it is entirely independent of either N or M.


What you're describing sounds awesome, I hadn't thought about being able to string syscall commands together like that. I wonder how well that will work in practice? Is there a way to be notified if one of the commands in the sequence fails like for instance the buffer wasn't large enough to write all the incoming data into?


According to a relevant manpage[0]:

> Only members inside the chain are serialized. A chain of SQEs will be broken, if any request in that chain ends in error. io_uring considers any unexpected result an error. This means that, eg, a short read will also terminate the remainder of the chain. If a chain of SQE links is broken, the remaining unstarted part of the chain will be terminated and completed with -ECANCELED as the error code.

So it sounds like you would need to decide what your strategy is. It sounds like you can inspect the step in the sequence that had the error, learn what the error was, and decide whether you want to re-issue the command that failed along with the remainder of the sequence. For a short read, you should still have access to the bytes that were read, so you're not losing information due to the error.

There is an alternative "hardlink" concept that will continue the command sequence even in the presence of an error in the previous step, like a short read, as long as the previous step was correctly submitted.

Error handling gets in the way of some of the fun, as usual, but it is important to think about.

[0]: https://manpages.debian.org/unstable/liburing-dev/io_uring_e...


I'm looking at the evolution in the chaining capabilities of io_uring. Right now it's a bit basic but I'm guessing in 5 or 6 kernel versions people will have built a micro kernel or a web server just by chaining things in io_uring and maybe some custom chaining/decision blocks in ebpf :-)


BPF, you say? https://lwn.net/Articles/847951/

> The obvious place where BPF can add value is making decisions based on the outcome of previous operations in the ring. Currently, these decisions must be made in user space, which involves potential delays as the relevant process is scheduled and run. Instead, when an operation completes, a BPF program might be able to decide what to do next without ever leaving the kernel. "What to do next" could include submitting more I/O operations, moving on to the next in a series of files to process, or aborting a series of commands if something unexpected happens.


Yes exactly what I had in mind. I'm also thinking of a particular chain of syscalls [0][1][2][3] (send netlink message, setsockopt, ioctls, getsockopts, reads, then setsockopt, then send netlink message) grouped so as to be done in one sequence without ever surfacing up to userland (just fill those here buffers, who's a good boy!). So now I'm missing ioctls and getsockopts but all in good time!

[0] https://github.com/checkpoint-restore/criu/blob/7686b939d155...

[1] https://github.com/checkpoint-restore/criu/blob/7686b939d155...

[2] https://github.com/checkpoint-restore/criu/blob/7686b939d155...

[3] https://www.infradead.org/~tgr/libnl/doc/api/group__qdisc__p...


BPF is going to change so many things... At the moment I'm having lots of trouble with the tooling but hey, let's just write BPF bytecode by hand or with a macro-asm. Reduce the ambitions...


Also wondering whether we should rethink language runtimes for this. Like write everything in SPARK (so all specs are checked), target bpf bytecode through gnatllvm. OK you've written the equivalent of a cuda kernel or tbb::flow block. Now for the chaining y'all have this toolbox of task-chainers (barriers, priority queues, routers...) and you'll never even enter userland? I'm thinking /many/ programs could be described as such.


One obvious limit is that usually chaining the open op is not very straightforward as following commands tend to need the fd.

But to answer your actual question, you can link requests and either abort or continue on failure.


Yes, check the documentation for the IOSQE_IO_LINK flag to see exactly how this works.


This makes it much clearer. Thanks!


epoll is based on a "readiness" model (i.e. it tells when when you can start I/O). io_uring is based on a "completion" model (i.e. it tells you when I/O is finished). The latter is like Windows IOCP, where the C stands for Completion. Readiness models are rather useless for a local disk because, unlike with a socket, the disk is more or less always ready to receive a command.


Tie difference between select and epoll is just that select sets a bitmap of file descriptors, so if you are listening on 5000 sockets, you then have to scan the resulting bitmap (80 or so 64-bit ints) to find the active ones, while epoll gives you an event-like structure telling you something has happened on file descriptor N, which you can then go and check, without having to scan a bitmap. So that is more efficient if you have a lot (thousands) of active fd's. I don't remember for sure but select may actually have a limit of 1024 fd's or something like that. So epoll was preferred over select for high levels of concurrency, before io_uring arrived. But, select worked ok for the purpose and epoll was simply a refinement, not a game changer. Scanning a few hundred words of memory is not that slow, as it will all be cached anyway.


select, poll, epoll, are all the same model of blocking and signalling for readiness.

The problem with the former occurs with large lists of file descriptors. Calling from user to kernel, the kernel needs to copy and examine N file descriptors. When user mode comes back, it needs to scan its list of file descriptors to see what changed. That's 2 O(n) scans at every syscall, one kernel side, one user side, even if only zero or one file descriptors has an event.

epoll and kqueue make it so that the kernel persists the list of interesting file descriptors between calls, and only returns back what has actually changed, without either side needing to scan an entire list.

By contrast, the high level programming model of io_uring seems pretty similar to POSIX AIO or Windows async I/O [away from readiness and more towards "actually do the thing"], but with the innovation being a new data structure that allows reduction in syscall overhead.


Separate to the complexity highlighted here, there are a number of operations where there is no readiness mechanism available. For example - file open, file stat. These syscalls can cause significant thread blocks. For example, when operating against NFS.

DNS resolution is also fiddly to do in a robust, nonblocking, async manner. An in-kernel dns resolution can block its thread for several minutes.

With io_uring, there will be no need to use multithreading or process pools to mitigate these situations. io_uring enables a pure-async programming style that has never before been practical on a mainstream os.


Io uring can in theory be built to subscribe to any syscall (though it hasn't yet). I don't believe epoll can do things like stat, opening files, closing files, and syncing though.


W/r/t a more descriptive name, I disagree—though until fairly recently I would have agreed.

I would guess that the desire for something more "descriptive" reflects the fact that you are not in the weeds with io_uring (et al), and as such a name that's tied to specifics of the terrain (io, urings) feels esoteric and unfamiliar.

However, to anyone who is an immediate consumer of io_uring or its compatriots, "io" obviously implies "syscall", but is better than "syscall", because it's more specific; since io_uring doens't do anything other than io-related syscalls (there are other kinds of syscalls), naming it "syscall_" would make it harder for its immediate audience to remember what it does.

Similarly, "uring" will be familiar to most of the immediate audience, and is better than "queue", because it also communicates some specific features (or performance characteristics? idk, I'm also not in the weeds) of the API that the more generic "_queue" would not.

So, while I agree that the name is mildly inscrutable to us distant onlookers, I think it's the right name, and indeed reflects a wise pattern in naming concepts in complex systems. The less ambiguity you introduce at each layer of indirection or reference, the better.

I recently did a toy project that has some files named in a similar fashion: `docker/cmd`, which is what the CMD directive in my Dockerfile points at, and `systemd/start`, which is what the ExecStart line of my systemd service file points at. They're mildly inscrutable if you're unfamiliar with either docker or systemd, as they don't really say much about what they do, but this is a naming pattern that I can port to just about any project, and at the same time stop spending energy remembering a unique name for the app's entry point, or the systemd script.

Some abstract observations:

  - naming for grokkability-a-first-glance is at odds with naming for utility-over-time; the former is necessarily more ambiguous
  - naming for utility over time seems like obviously the better default naming strategy; find a nice spot in your readme for onboarding metaphors and make sure the primary consumers of your name don't have to work harder than necessary to make sense of it
  - if you find a name inscrutable, perhaps you're just missing some context


The io specificity is expected to be a temporary situation, and that part of the name may end up being an anachronism in a few years once a usefully large subset of another category of syscalls has been added to io_uring. The shared ring buffers aspect definitely is an implementation detail, but one that does explain why performance is better than other async IO methods (and also avoids misleading people into thinking that it has something to do with the async/await paradigm).

If the BSDs hadn't already claimed the name, it would probably have been fine to call this kqueue or something like that.


Thanks for your detailed thoughts on this, I am definitely not involved at all and just onlooking from the sidelines. The name does initially seem quite esoteric but I can understand why it was picked. Thinking about it more 'io' rather than 'syscall' does make sense, and Windows also does use IO in IOCP.


Now I wonder if my idea of having a muxcall or a batchcall as I thought about it a few years ago is something similar to io_uring, but on a lesser scale and without eBPF goodies.

My idea was to have a syscall like this:

  struct batchvec {
   unsigned long batchv_callnr;
   unsigned long long batchv_argmask;
  };
  asmlinkage long sys_batchcall(struct batchvec *batchv, int batchvcnt,
     long args[16], unsigned flags);
You were supposed to give in a batchvec a sequence of system call numbers and a little mapping to arguments you provided in args. batchv_argmask is a long long - 64 bit type, this mask is divided to 4 bit fields, every field can address a long from args table. AFAIR Linux syscalls have up to 6 arguments. 6 fields for arguments and one for return value, that gives 7 fields - 28 bits and now I don't remember why I thought I need a long long.

It would go like this pseudo code:

  int i = 0;
  for(; i < batchvcnt; i++) {
    args[batchv[i].argmask[6]] = sys_call_table[batchv[i].callnr](args[batchv[i].argmask[0]], args[batchv[i].argmask[1]], args[batchv[i].argmask[2]], args[batchv[i].argmask[3]], args[batchv[i].argmask[4]], args[batchv[i].argmask[5]]);
    if(args[batchv[i].argmask[6]] < 0) {
      break;
    }
  }
  return i;
It would return a number of successfully run syscalls. It would stop on first failed one. The user would have to pick up the error code out of args table.

I would be interested to know why it wouldn't work.

I started implementing it against Linux 4.15.12, but never went to test it. I have some code, but I don't believe it is my last version of the attempt.


Fwiw posix specifies readv and writev already; you may want to take a look at those.

It's not really the same, though; readv and writev are still synchronous APIs, they just do more at once.


I'm aware of readv/writev. This was supposed to cover much more, because it is with arbitrary syscalls. Open a file and if successful read from it - all with a single syscall. Another example: socket, setsockopt, bind, listen, accept in a single syscall.


I am hoping io_uring goes this way (maybe just call it uring). It would make people design APIs with a thought to how they could be async'd.


> io_uring's true purpose is a generic asynchronous syscall facility

Exactly. It's such an awesome design. I wonder how many system calls will be supported in the future.


Well, they have space for 255 ops :)


I've seen mixed results so far. In theory it should perform better than epoll, but I'm not sure it's quite there yet. The maintainer of uWebSockets tried it with an earlier version and it was slower.

Where it really shines is disk IO because we don't have an epoll equivalent there. I imagine it would also be great at network requests that go to or from disk in a simple way because you can chain the syscalls in theory.


The main benefit to me has been in cases that previously required a thread pool on top of epoll, it should be safe to get rid of the thread pool now and only use io_uring. Socket I/O of course doesn't really need a thread pool in a lot of cases, but disk I/O does.


I dug up the thread with the benchmarks: https://github.com/axboe/liburing/issues/189

io_uring does win on more recent versions, but it's not like it blows epoll out of the water - it's an incremental improvement.

Specifically for websockets where you have a lot of idle connections, you don't want a bunch of buffers waiting for reads for long periods of time - this is why Go doesn't perform well for websocket servers as calling Read() on a socket requires tying up both a buffer and a goroutine stack.

I haven't looked into how registering buffers works with io_uring, if it's 1-to-1 mapping or if you can pass a pool of N buffers that can be used for reads on M sockets where M > N. The details matter for that specific case.

Again where io_uring really shines is file IO because there are no good solutions there currently.


syscall_uring would be my preference.


Something I've been thinking about that maybe the HN hivemind can help with; I know enough about io_uring and GPU each to be dangerous.

The roundtrip for command buffer submission in GPU is huge by my estimation, around 100µs. On a 10TFLOPS card, which is nowhere near the top of the line, that's 1 billion operations. I don't know exactly where all the time is going, but suspect it's a bunch of process and kernel transitions between the application, the userland driver, and the kernel driver.

My understanding is that games mostly work around this by batching up a lot of work (many dozens of draw calls, for example) in one submission. But it's still a problem if CPU readback is part of the workload.

So my question is: can a technique like io_uring be used here, to keep the GPU pipeline full and only take expensive transitions when absolutely needed? I suspect the programming model will be different and in some cases harder, but that's already part of the territory with GPU.


The communication between the host and the GPU already works on a ring buffer on any modern GPU I believe.

It’s why graphics APIs are asynchronous until you synchronize f.e. by flipping a frame buffer or reading something back.

APIs like Vulkan are very explicit about this and have fences and semaphores. Older APIs will just block if you do something that requires blocking.


What's funny about io_uring is that the blocking syscall interface was always something critics of Unix pointed out as being a major shortcoming. We had OSes that did this kind of thing way back in the 1980s and 1990s, but Unix with its simplicity and generalism and free implementations took over.

Now Unix is finally, in 2021, getting a syscall queue construct where I can interact with the kernel asynchronously.


Maybe Linux will get scheduler activations in the near future, another OS feature from the 90s that ended up in Solaris and more or less nowhere else. "Let my user space thread scheduler do its work!"


We talked about adding that to Linux in the 90s too. A simple, small scheduler-hook system call that would allow userspace to cover different asynchronous I/O scheduling cases efficiently.

The sort of thing Go and Rust runtimes try to approximate in a hackish way nowadays. They would both by improved by an appropriate scheduler-activation hook.

Back then the idea didn't gain support. It needed a champion, and nobody cared enough. It seemed unnecessary, complicated. What was done instead seemed to be driven by interests that focused on one kind of task or another, e.g. networking or databases.

It doesn't help that the understandings many people have of performance around asynchronous I/O, stackless and stackful coroutines, userspace-kernel interactions, CPU-hardware interactions and so on are not particularly deep. For example I've met a few people who argued that "async-await" is the modern and faster alternative to threads in every scenario, except for needing N threads to use N CPU cores. But that is far from correct. Stackful coroutines doing blocking I/O with complex logic (such as filesystems) are lighter than async-await coroutines doing the same thing, and "heavy" fair scheduling can improve throughput and latency statistics over naive queueing.

It's exciting to see efficient userspace-kernel I/O scheduling getting attention, and getting better over the years. Kudos to the implementors.

But it's also kind of depressing that things that were on the table 20-25 years ago take this long to be evaluated. It's almost as if economics and personal situations governs progress much more than knowledge and ideas...


Actually, I think the biggest obstacle is that as cool as scheduler activations are, it turns out that not many applications are really in a position to benefit from them. The ones that can found other ways ("workarounds") to address the fact that the kernel scheduler can't know which user space thread to run. They did so because it was important to them.


    It's almost as if economics and personal situations 
    governs progress much more than knowledge and ideas...
That has always been the case and will probably always be the case.


There's already plans for a new futex-based swap_to primitive, for improving userland thread scheduling capabilities. There was some work done on it last year, but it was rejected on LKML. At this rate, it looks like it will not move forward until the new futex2 syscall is in place, since the original API is showing its age.

So, it will probably happen Soon™, but you're probably still ~2 years out before you can reliably depend on it, I'd say.


Scheduler activations don't require swap_to.

The kernel wakes up the user space scheduler when it decides to put the process onto a cpu. The user space scheduler decides which user space thread executes in the kernel thread context that it runs in, and does a user space thread switch (not a full context switch) to it. It's a combination of kernel threads and user space (aka "green") threads.


It might! I'm not sure if it's an exact fit or not but the User Managed Concurrency Groups work[1] Google is trying to upstream with their Fibers userland-scheduling library sounds like it could be a match, and perhaps it could get the upstreaming itcs seeking.

[1] https://www.phoronix.com/scan.php?page=news_item&px=Google-F...


I think some of the *BSDs have (or had) it. Linux almost got it at the turn of the millennium, with the Next Generation Posix Threading project, but then the much simpler and faster NPTL won.


Solaris had this over a decade ago now with event ports which are basically a variant of Windows IOCP:

https://web.archive.org/web/20110719052845/http://developers...

So at least one UNIX system had them a while ago.


People keep saying this but IME that's not how the Solaris Event Ports API works at all. The semantics of Solaris Event Ports is nearly identical to both epoll+family and kqueue. And like with both those others (ignoring io_uring), I/O completion is done using the POSIX AIO interface, which signals completion through the Event Port descriptor.

I've written at least two wrapper libraries for I/O readiness, POSIX signal, file event, and user-triggered event polling that encompass epoll, kqueue, and Solaris Event Ports. Supporting all three is relatively trivial from an API perspective because they work so similarly. In fact, notably all three let you poll on the epoll, kqueue, or Event Port descriptor itself. So you can have event queue trees, which is very handy when writing composable libraries.


I remember IPX has similar communication model. To read from the network, you post an array of buffer pointers to the IPX driver and can continue to do whatever. When the buffers are filled, the driver calls your completion function.


Do you have some references?


io_uring is Linux specific. BSD offers kqueue instead, introduced in 2000.

I believe both are limited to specific operations, mostly IO, and aren't fully general asynchronous syscall interfaces.


The entire point of the article is that it is not just for IO.


> The entire point of the article is that it is not just for IO

I went through the list of operations supported by `io_uring_enter`. Almost all of them are for IO, the remainder (NOP, timeout, madvise) are useful for supporting IO, though madvise might have some non-IO uses as well. While io_uring could form the basis of a generic async syscall interface in the future, in its current state it most certainly is not.

The article mostly talks about io_uring enabling completion based IO instead of readiness based IO.

AFAIK kqueue also supports completion based IO using aio_read/aio_write together with sigevent.


> AFAIK kqueue also supports completion based IO using aio_read/aio_write together with sigevent.

If you can point to a practical example of a program doing it this way and seeing a performance benefit, I would be curious to see it. I did some googling and didn't really even find any articles mentioning this as possibility.

kqueue is widely considered to be readiness based, just like epoll, not completion based.

What you wrote sounds like an interesting hack, but I'm not sure it counts for much if it is impractical to use.


POSIX AIO uses `struct sigevent` to notify completion. On FreeBSD you can set ->sigev_notify to SIGEV_KEVENT, and ->sigev_notify_kqueue to a kqueue descriptor.

As for your "performance benefit" question and nobody using it over readiness based approaches: Well yes. No idea how it compares with a traditional readiness based socket program. But it's kind of not the point either. AFAIK the scenario where POSIX AIO makes most sense is files on disk.


Afaik, Nginx used this hack for aio on FreeBSD many years. It is superrior to aio on Linux. That is why Netflix used nginx together with FreeBSD.

Now nginx could use threads for file io, and linux file performance became ok as well.


I don't know if it offers any practical benefits for sequential/socket IO. But AFAIK it's the way to go if you want to do async random-access/file IO.


It feels like it's been ten years since I fiddled with this on FreeBSD but EVFILT_AIO worked surprisingly well with aio and kqueue on spinning disk.


The OP said mostly IO not only.


There's nothing stopping io_uring from gaining more non-IO operations, and indeed that appears to be the plan.


kqueue is not really a replacement for io_uring. That said, FreeBSD can make a POSIX AIO request signal completion to a kqueue, which is a similar model of async I/O.

But as has been pointed out elsewhere, io_uring is also about making other syscalls asynchronous.


... So it has the potential to make a lot of programs much simpler.

More efficient? Yes. Simpler? Not really. A synchronous program would be simpler, everyone who has done enough of these know it.


I’m guessing they mean that programs which did not want to block on syscalls and had to deploy workarounds can now just… do async syscalls.


Which is every program that wants to be fast, on modern computers. So...


Nonsense.

Let's say your program wants to list a directory, if it has nothing to do during that time then there is no point to using an asynchronous model, that only adds costs and complexity.


Let's say your program wants to list a directory and sort by mtime, or size. You need to stat all those files, which means reading a bunch of inodes. And your directory is on flash, so you'll definitely want to pipeline all those operations.

How do you do that without an async API? Thread pool and synchronous syscalls? That's not simpler.


At no point did I say that these APIs were not useful? I’m literally the person who explained what kind of uses would be simplified by async syscalls.

And you can keep your strawman to yourself. I objected to the statement that it would make all programs faster / simpler, the argument that it would make some simpler or faster is not something I ever argued against.


> I objected to the statement that it would make all programs faster / simpler

And I pointed out that even the simplest example you could come up with can in fact be made faster (than implementing sequentially) or simpler (than implementing with threads, which require synchronization) with an async API. So I don't see the straw man.

Pretty much anything that needs to interact with the kernel can benefit from this.


> And I pointed out that even the simplest example you could come up with can in fact be made faster (than implementing sequentially) or simpler (than implementing with threads, which require synchronization) with an async API.

No, you pointed out that a different example could be made faster (and almost certainly at the cost of simplicity, the mention of which you carefully avoided).

> So I don't see the straw man.

That doesn't surprise me.


True. But as soon as your program wants to list two directories and knows both names ahead of time, you have an opportunity to fire off both operations for the kernel to work on simultaneously.

And even if your program doesn't have any opportunity for doing IO in parallel, being able to chain a sequence of IO operations together and issue them with at most one syscall may still get you improved latency.


Yes and even clustering often-used-together syscalls... An interesting 2010 thesis https://os.itec.kit.edu/deutsch/2211.php and https://www2.cs.arizona.edu/~debray/Publications/multi-call.... for something called 'multi-calls'

Interesting times.


My read of that paragraph was that they meant existing asynchronous programs can be simplified, due to the need for less workarounds for the Linux I/O layer (e.g. thread pools to make disk operations appear asynchronous are no longer necessary.) And I agree with that; asynchronous I/O had a lot of pitfalls on Linux until io_uring came around, making things much worse than strictly necessary.

In general I totally agree that a synchronous program will be way simpler than an equivalent asynchronous one, though.


You can use io_uring as a synchronous API too. Put a bunch of commands on the submission queue (think of it as a submission "buffer"), call io_uring_enter() with min_complete == number of commands and once the syscall returns extract results from the competion buffer^H^H^Hqueue. Voila, a perfectly synchronous batch syscall interface.

You can even choose between executing them sequentially and aborting on the first error or trying to complete as many as possible.


IIRC the author of io_uring wants to support using it with any system call, not just the current supported subset. Not sure how these efforts are going.


Supporting exec sounds like it could be a security issue. I also have a hard time imagining why on earth you would call exec that way.


> Supporting exec sounds like it could be a security issue.

Why would it be any more of an issue than calling a blocking exec?

> why on earth you would call exec that way

The same reason as why you would want to do anything else with io_uring? In an async runtime you have to delegate blocking calls to a thread pool. Much nicer if the runtime can use the same execution system as for other IO.


If you can emulate blocking calls in user space for backward compatibility, that is probably best. But I wonder about process scheduling. Blocked threads don't get rescheduled until the kernel unblocks them. Can you tell the kernel to block a thread until an io_uring state change occurs?


> Can you tell the kernel to block a thread until an io_uring state change occurs?

The same syscall used to inform the kernel that you're submitting N new operations can also specify that it should block waiting for M completions. One of those completions could be for a timeout command that was previously submitted.


>Why would it be any more of an issue than calling a blocking exec?

Now you turn everything doing fast I/O into something that's essentially an easier rwx situation WRT memory corruption.

>Much nicer

In other words: the runtime should handle it because the kernel doesn't need to.


I really don't follow your thought process here. Can you expand on your objections?

io_uring can really be thought of as a generic asynchronous syscall interface.

It uses a kernel thread pool to run operations. If something is blocking on a kernel-level it can just be run as a blocking operation on that thread pool.


I think the concern is that if you can manage an exploit that injects data into an io_uring submission queue, you may be able to plant an exec call or similar. But such an exploit would need to rely on a well-formed malicious SQE being submitted to the kernel without being overwritten by the SQE the application intended to submit, which probably requires also gaining control over the application's pointers for where in the queue it will write the next SQE it constructs.


Meh, I'm a bit skeptic. Memory corruption of ring buffer could make it possible to inject exec syscalls. So, like any other memory corruption ? Also, io_uring already supports open/read/write which is often enough to convert to execution.


Say more about this? You're probably right but I haven't thought about it before and I'd be happy to have you do that thinking for me. :)


Start a user-provided compression background task at file close? Think tcpdump with io_uring 'take this here buffer you've just read and send it to disk without bothering userland.


I found this presentation to be quite informative https://www.youtube.com/watch?v=-5T4Cjw46ys


> io_uring is not an event system at all. io_uring is actually a generic asynchronous syscall facility.

I would call it a message passing system, not an asynchronous syscall facility. A syscall, even one that doesn't block indefinitely, transfers control to the kernel. io_uring, once setup, doesn't. Now that we have multiple cores, there is no reason to context switch if the kernel can handle your request on some other core on which it is perhaps already running.


It becomes a question of how expensive the message passing is between cores versus the context switch overhead - especially in a world where switching privilege levels has been made expensive by multiple categories of attack against processor speculation.

Is there a middle ground where io_uring doesn't require the mitigations but other syscalls do?


This is how I think about it as well.

Which makes me wonder if Mach wasn't right all along.


I've been intending to try process-to-process message passing using a lock-free data structure very much like io_uring where the processes share pages (one page read-write, the other page readonly, and inverse for the other process) on hardware with L2 tightly-integrated memory (SiFive FU740). The result being zero-copy message passing at L2 cache speeds. [for an embedded realtime custom OS, not for linux]


Currently serving Bandwidth Restricted page - 451.

Cached version of the write up https://archive.is/VgHkW



It seems that their server is rewriting the 451 error to a 403, which caused Archive.org to drop the page from its archives. Unfortunate...


I'm curious how this handles the case where the calling program dies or wants to cancel the request before it's actually happened.


Death: same way it handles a program with an open socket and data arriving and unread and the program dies. It's just part of the overall resource set of the process and has to be cleaned up when the process goes away.


The do_exit() function in kernel/exit.c is responsible for general cleanup on a process [1]. Whether a process dies gracefully or abruptly, the kernel calls do_exit() to clean up all the resources owned by the process, like opened files or acquired locks. I would imagine the io_uring related stuff is cleaned up there as well.

[1] https://elixir.bootlin.com/linux/v5.13-rc6/source/kernel/exi...

Edit: I just looked up at the latest version of the source [1]. Yes, it does clean up io_uring related files.


I don't know the answer but I would assume if you have submitted an operation to the kernel, you should assume it's in an indeterminate state until you get the result. If the program dies then the call may complete or not.

For cancellation there is an API. Example call: https://github.com/axboe/liburing/blob/c4c280f31b0e05a1ea792...


So linux people find that the model of windows iocp used is better?


Yes, and now things are going full circle and windows is adopting the ring model too.

https://windows-internals.com/i-o-rings-when-one-i-o-operati...


huh, interesting. As far as I can tell the main advantage this has over IOCP is that you can get one completion for multiple read requests.

Looks like they took a lot of the concepts from Winsock RIO and applied them to file I/O. Which is fascinating because with network traffic you can't predict packet boundaries and thus your I/O rate can be unpredictable. RIO helps you get the notification rate under control, which can help if your packet rate is very high.

With files, I would think you can control the rate at which you request data, as well as the memory you allocate for it.

The other thing it saves just like RIO is the overhead of locking/unlocking buffers by preregistering them. Is that the main reason for this API then ?

I would be very interested to hear from people who have actually run into limits with overlapped file reads and are therefore excited about IoRings


As a Linux person: Yes, and I have thought so for a long time.

This is why many of us are excited about io_uring.


I think it's long been understood that it's better for disk I/O. I'm not sure the consensus is as clear for sockets.


Yes.

Unfortunately Rust went the exact other way.


Now if they could allocate the memory for you when needed rather than having a bunch of buffers allocated when they aren't yet needed.


That's what IOSQE_BUFFER_SELECT is for.


Awesome. I didn't see that when people were describing how it works. Certainly better than the Windows version.


If my process has async/await and uses epoll, wouldn't this have about the same performance as io_uring?

E.g. with io_uring, the event triggers the kernel to read into a buffer.

With async/await epoll wakes up my process which does the syscall to read the file.

In both cases you still need to read from the device and get the data to the user process?


Imagine your process instead of getting woken up to make a syscall gets woken up with a pointer to a filled data buffer.


But some process (kernel or user space) still needs to spend the computers finite resources to read it?

If the work happens in the kernel process or user process it still costs the same?


The transition from userspace to the kernel and back takes a similar amount of time to actually doing a small read from the fastest SSDs, or issuing a write to most relatively fast SSDs. So avoiding one extra syscall is a meaningful performance improvement even if you can't always eliminate a memcpy operation.


That sounds unlikely. syscall hardware overhead (entry + exit) on a modern x86 is only on the order of 100 ns which is approximately main memory access latency. I am not familiar with Linux internals or what the fastest SSDs are capable of these days, but I am fairly sure that for your statement to be true Linux would need to be adding 1 to 2 orders of magnitude in software overhead. This occurs in the context switch pathway due to scheduling decisions, but it is fairly unlikely it occurs in the syscall pathway unless they are doing something horribly wrong.


My understanding is that's the naiive measurement of the cost of just the syscall operation (i.e. if you measure issue to kernel is executing). Does this actually account for the performance loss of cache innefficiency? If I'm not mistaken at a minimum the CPU needs to flush various caches to enter the kernel, fill them up as the kernel is executing, & then repopulate them when executing back in userspace. In that case (even if it's not a full flush), you have a hard to measure slowdown on the code processing the request in the kernel & in userspace after the syscall because the locality assumptions that caches rely on are invalidated. With an io_uring model, since there's no context switches, temporal & spatial locality should provide an outsized benefit beyond just removing the syscall itself.

Additionally, as noted elsewhere, you can chain syscalls pretty deeply so that the entire operation occurs in the kernel & never schedules your process. This also benefits spatial & temporal locality AND removes the cost of needing to schedule the process in the first place.


I overstated things. Sorry.

A syscall that does literally nothing can still have a latency as observed from userspace that is vastly faster than the hardware latency of a fast SSD, though Spectre and friends have slowed this down a bit.

However, when comparing the latency as observed from userspace of syscalls that actually do significant work and cause stuff to happen, such as shepherding an IO request through the various layers to get down to the actually submitting commands to the hardware, then you usually do start talking about latencies that are on a similar order of magnitude to fast SSDs.

In particular, if you're trying to minimize IO latency, the kernel can no longer afford to put a CPU core to sleep or switch to run another task while waiting for the IRQ signalling that the SSD is done. The interrupt latency followed by the context switch back to the kernel thread that handles the IO completion is a substantial delay compared to spinning in that thread until the SSD responds.


Yeah, I would agree that is a fair take on standard kernel syscall latency. My original point was just clarifying that batching the syscalls only saves the overhead, not the shepherding and actual work of the syscall, and is actually much smaller than people generally believe.

An interesting thing about your second point is that those overhead costs are almost entirely driven by the software scheduler implementation. The actual hardware costs are actually quite low and in theory it should be possible to meaningfully switch if you could just determine what to run next quickly as discussed here [1].

[1] https://lkml.org/lkml/2020/7/22/1202


Yes, the syscall work still happens and the raw work has not changed. The overhead has dropped dramatically though since you've eliminated at least 2 context switches per data fill cycle(and probably more).


The answer is "go ahead and benchmark". There are some theoretical advantages to uring, like having to do less syscalls and have less userspace <-> kernelspace transitions.

In practice implementation differences can offset that advantage, or it just might not make any difference for the application at all since its not the hotspot.

I think for socket IO a variety of people did some synthetic benchmarks for epoll vs uring, and got all kinds of results from either one being a bit faster to both being roughly the same.


It also depends heavily on your kernel version. There are really wide swings.


Not necessarily, as you can use io_uring in a zero-copy way, avoiding copies of network/disk data from kernel space to user space.


Isn't locking a problem with io_uring? Won't you block the kernel when it's trying to do the event completion stuff and the completion work tries to take a lock? Or is the completion stuff done entirely in user-space and blocking is not a problem? Maybe I need to read up on this a bit more...


The submissions and completions are stored in a lock-free ring buffer, hence the name "uring."


lock-free ring buffers still have locks for the case when they are full, so it would be interesting to see how the kernel behaves when you never read from the completion ring.


As long as command submission is only permitted (or considered valid) when the submitter ensures there is sufficient space in the completion ring for the replies, writing to the completion ring can't legitimately block, and the completion writer can assume this.

This asymmetry is because the command submitter is able to block and is in control. It even works with elastic ringbuffers that can be expanded as needed. All memory allocation can be on the command submitter side.

This trick works in other domains, not just kernels. For example request-reply ringbuffer pairs can be used to communicate with interrupt handlers, in OSes and also in bare metal embedded systems, and the interrupt handler does not need to block when writing to the reply ringbuffer. It's a nice way to queue things without locking issues on the interrupt side.

Similarly with signal handlers and attached devices (e.g. NVMe storage, GPUs, network cards), those can can always write to the reply ringbuffer too.


You will get an error for that upon submit. The application can then buffer the submissions somewhere else and wait for a few completions to finish.


Blocking is not a problem. However, futex ops are not supported yet so how exactly any lock op would work is not determined yet.


Very very interesting.


To me this sounds like more kernel, I was hoping for less or no kernel instead.

On my live machines the kernel is taking 30% CPU copying memory.

I'm waiting for userspace networking... file IO can be offset with async-to-async parallelism.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: