Hacker News new | past | comments | ask | show | jobs | submit login
Io uring (nick-black.com)
289 points by sohkamyung on May 22, 2023 | hide | past | favorite | 70 comments



One interesting thing is that io_uring can operate in different modes. One of the modes enables kernel-side polling, so that when you put data into the buffer, the kernel will pull the data out itself to do IO. That means from the application side, you can perform IO without any system calls.

Our general take was also that it has a lot of potential, but is relatively low level that most mainstream programmers aren't going to pay attention to it. Hence, it'll be a while before it permeates through various ecosystems.

For those of you that like to listen on the way to work, we cover io_uring on our podcast, The Technium.

https://www.youtube.com/watch?v=Ebpnd7rPpdI

https://open.spotify.com/episode/3MG2FmpE3NP7AK7zqQFArE?si=s...


> One interesting thing is that io_uring can operate in different modes. One of the modes enables kernel-side polling [...]

On a related note, I recently saw this presentation[1] where they show some benchmarks of the various modes.

One gotcha of sorts, though obvious when you think about it, is that the kernel-side polling mode requires a free CPU core for the polling thread. Meaning you'll get very poor performance if you're not leaving enough CPU for the kernel to do its polling.

https://www.youtube.com/watch?v=5jKKVdJJqKY


> Our general take was also that it has a lot of potential, but is relatively low level that most mainstream programmers aren't going to pay attention to it. Hence, it'll be a while before it permeates through various ecosystems.

Isn't that what libraries are for?


> Hence, it'll be a while before it permeates through various ecosystems.

this may take a while as it's a completely different IO model

it took us 30 odd years to get from select/epoll to async/coroutines being popular


Windows has been using this mechanism for well over a decade now. It’s called IOCP (IO completion ports)


IOCP is syscall free only on one side i.e. completion. io_uring can be syscall free on submission side as well.


> well over a decade now.

3 decades.


and Linux has had POSIX async IO since 2003

(which no-one uses either because the API doesn't compose well onto existing application structures)


If you had mentioned Solaris, I’d have agreed.

But POSIX async I/O (the aio_* functions) in Linux is basically worthless performance-wise AFAIU, because Glibc implements it in userspace by spawning threads to do standard sync I/O. Now Linux also has non-POSIX async I/O (the io_* functions), but it’s very situational because it works only if you bypass the cache (O_DIRECT) and can still randomly block on metadata operations (so can Win32, to be fair). There’s select/poll/epoll with O_NONBLOCK of course, which is what people normally use, but those do not really work with files on disk (neither do their WinSock equivalents). Hell, signal-driven IO (O_ASYNC) exists, I’ve used it to make a single-threaded emulator (CPU-bound unlike a network server) interact with the terminal. But asynchronous I/O of normal, cached files is only possible on Linux through the use of io_uring, as far as I’ve been able to figure out.

That said, I’ve read people here saying[1] that overlapped I/O on Windows also works by scheduling operations on a thread pool, even referencing KB articles[2]. This does not mesh with everything I’ve read about I/O in the NT kernel, which is supposed to be natively async to the point where the I/O request datastructure (the IRP) has what’s essentially an emulated call stack inside of it, in order to allow the I/O subsystem to juggle continuations. What am I missing? Does the Win32 subsystem need to dumb things down that much even inside its own implementation?

(Windows 8 also introduced a ringbuffer-based, no-syscalls thing called Registered I/O that looks very much like io_uring.)

[1] https://news.ycombinator.com/item?id=11867351

[2] https://support.microsoft.com/kb/156932


> That said, I’ve read people here saying[1] that overlapped I/O on Windows also works by scheduling operations on a thread pool

The _kernel_ thread pool. Eventually, most work has to be done in an actual thread, after all.

> [2] https://support.microsoft.com/kb/156932

It's a bit misleading. What they mean is that some operations can act as barriers for further operations. E.g. async calls to ReadFile won't run until the call to WriteFile finishes (if it's writing past the end of the file).


Why you think you can't use epoll with disk files on linux?


Per open(2) [1], you can’t really ask the kernel to not block on regular files:

> O_NONBLOCK [...] has no effect for regular files and will (briefly) block when device activity is required, regardless of whether O_NONBLOCK is set. []O_NONBLOCK semantics might eventually be implemented[.]

I’m actually not sure if the reported readiness for them is of any use, but the documentation for select(2) [2] doesn’t give me a lot of hope:

> A file descriptor is ready for writing if a write operation will not block. However, even if a file descriptor indicates as writable, a large write may still block.

This for data operations; if you want open() itself to avoid spelunking through NFS or spinning up optical drives or whatnot, before io_uring you simply had no way to tell that to the kernel—you call open*() or perhaps creat(), which must give you a fd, thus must block until they can do so.

(As far as I’ve seen, tutorial documentation usually rounds this down to “you can’t do nonblocking I/O on disk files”.)

[1] https://man7.org/linux/man-pages/man2/open.2.html

[2] https://man7.org/linux/man-pages/man2/select.2.html


You use io_uring to spin optical drives?


Yeah, but everything on Windows uses IOCP.


I am looking forward to io_uring support in libuv



As long as you are willing to accept the latency and/or busy looping that the kernel needs in order to poll.


Does this imply that the caller has to poll the result buffer to know when the kernel has processed the initial data?


Sort of. In the strictest sense, yes - the cq is a ring buffer (implemented with fancy atomic stuff), so you have to check if there is a completion on the queue before you read the entry. However, this doesn't need a syscall to do polling, if more completions come in while you're processing, they will be available to you.

There's also a syscall (io_uring_enter) that will do a context switch and wake you up when completions are available (it's a complicated syscall, that has a lot of knobs and switches and levers - just be ready for a LOT of information if you go read the man page).


Can you use the kernel polling mode w/o running as root?


The already insanely fast (probably the fastest) http/websocket server uWebSockets is seeing a 40% improvement with io_uring in initial tests!

https://github.com/uNetworking/uWebSockets/issues/1603#issue...


While impressive, that is a project that is basically perfectly suited to gain the most from this kind of optimization, and the fact that it already was incredibly fast means that there are fewer low hanging fruits for performance, and other projects would be less likely to have as big wins (as they're more likely to already have bigger bottlenecks elsewhere).


That’s really promising! uSockets/uWebSockets is an almost perfect use case for io_uring. That library is also insanely performant and well written. I ran some benchmarks mostly for low memory overhead and it’s much better than even Go std, nginx+nchan etc etc.


I recently tinkered around with io_uring for a presentation. Here is a toy echo server I made with it: https://gist.github.com/martinjacobd/be50a93744b94749339afe8... (non io_uring example for comparison: https://gist.github.com/martinjacobd/feea261d2fafe5e7332e37d...)

A big todo I have for this is to modify it to accept multishot_accept but I haven't been able to get a recent enough kernel configured properly to do it. (Only >6.0 kernels support multishot accept).

(edit) you need to s/multishot_accept/recv_multishot/g above :)


I did an echo server with multi-shot here (using the io_uring rust bindings) https://github.com/thinkharderdev/io-uring/blob/ring-mapped-.... My biggest issue with io_uring is figuring out what is available in which kernel version :)


I meant recv_multishot not multishot_accept. The one I linked does use multishot_accept. Just a think-o. Looks like your version uses both though, so that's cool.


this might help: https://sourcedigger.io


Oh that's awesome! Thanks!


I might have to use your code again, I used your Just in time compilation code to execute generated machine code.

Thank you so much Martin Jacob!


Absolutely my pleasure. I tracked down the problem you commented about: I used a later version of liburing than ships with Ubuntu, so if you want to compile this as-is, you'll have to compile and install liburing from source probably. Glad you enjoyed it!


I'll be sure to follow everything you work on because you work on self-contained easy to understand code that shows APIs effectively.

The things you work on are foundations to extremely useful systems.

(My JIT compiler is based on Martin Jacob's JIT code https://github.com/samsquire/compiler )

I would love to change my epoll server to use liburing and if I feel confident enough to I'll try integrate your code into it.

https://github.com/samsquire/epoll-server


Could you also do an udp based server? It’d be interesting


I don't see why not. You'd open the socket on line 124 as a SOCK_DGRAM, 0 rather than what it is now for TCP. For an echo server, for instance, you wouldn't listen on the socket or accept requests; instead, you'd use io_uring_prep_recvmsg and use the prepended msghdr to send the identical packet back out with io_uring_prep_sendmsg.

At least, that's my untested understanding.


Got a huge is that who I think it is when I saw HN homepage this morning. Author is a well-known legend in Atlanta tech scene, especially amongst Georgia Tech alum in their 30s and 40s. His stories and rants on Facebook are legendary


Lots of other entertaining entries on that wiki if you look for them.

“this is the wiki of nick black (aka dank), located at 33°46′44.4"N, 84°23'2.4"W (33.779, 85.384) in the heart of midtown atlanta. dankwiki's rollin' wit' you, though I make no guarantees of its correctness, relevance, nor timeliness. track changes using the recent changes page. I've revived DANKBLOG, this wiki and grad school having not satisfied ye olde furor scribendi.

hack the planet! don't mistake my kindness for weakness.

i primarily write to force my own understanding, and remember things (a few entries are actually semi-authoritative). i'm just a disreputable Mariner on your way to the Wedding. if you derive use from this wiki, consider yourself lucky, and please get confirmation before relying on my writeups to perform surgery, design planes, determine whether a graph G is an Aanderaa–Rosenberg scorpion, or feed your pet rhinoceros. do not proceed if allergic to linux, postmodern literature, nuclear physics, or cartoonish supervillainy. “


whoa holdonaminute, dank is dan kegel's nick…


also dan kaminsky's and to a lesser degree dan knapp's. i was put on the spot as a sixteen year old Georgia Tech freshman acquiring his college of computing account, and between a love of good weed and childish stupidity, i chose "dank@". twenty-five years later, i'm not about to change it. no insults are meant to any of the aforementioned hax0rs.


The broader world probably knows him best for the terminal handling library Notcurses[1] and a lot of telling terminal emulator authors to get their shit together.

I’ve had his grad-school project libtorque[2] (HotPar ’10), an event-handling and scheduling library, on my to-read list for years, but I can’t seem to figure out how it accomplishes the interesting things it does.

[1] https://nick-black.com/dankwiki/index.php/Notcurses, https://github.com/dankamongmen/notcurses/

[2] https://nick-black.com/dankwiki/index.php/Libtorque


Once upon a time, I was sitting in my shack in Home Park when I got a random call from an independent tech recruiter. During the discussion, I mentioned my involvement in the Linux Users Group at Georgia Tech, at which point suddenly we were discussing Nick Black's life story, which was kind of weird (I guess I was mostly listening and not talking). I don't know why the conversation took that turn, I didn't initiate the topic change. We never actually talked about the job the recruiter was calling me about, and I never heard from the recruiter again after that. To this day I think I still don't really understand what happened.


Ah Home Park. Part of the circle of life for tech students.

1. enroll.

2. get tired of dorm life.

3. move to home park because it's likely all you can afford.

4. get broken into 5-10 times.

5. move away


The trick was to have a broken window pane that you patched with duct tape and cardboard.

Not one break in!

Figured no one thought we were worth it. Although we were two houses across the street from Kool Korner, so maybe that was the "nicer" part of Home Park?


Maybe it's cleaned up since the early 00's. We were over near Rocky Mountain.

I should have been more precise re: break-ins, meaning more car break-ins. No idea what the home rate was, but every other week there were 1-2 cars missing windows around Lynch.


Oof. Hemphill was pretty rough, pre-Antico metastasis cleaning up the far end.

The west & north sides of GT are unrecognizable to me now though -- condos and yuppie shops.


This made me chuckle, as part of the cohort described above. I read your comment, thought to myself "what on earth is he talking about? there's nobody on my radar who might fit that bill", checked the name, and immediately thought "oh yeah, that guy." :-) I'm not on Facebook, so I haven't seen any of those stories though.


Quite a formidable trivia player as well.


can't believe we're this far into a thread about Nick without someone mentioning Tab.


He's also famous for smoking Newports outside the NYC Google office.


That's sort of a wall of text.

I'd be curious to see

1. A Hello World, absolute minimum example

and

2. Doing something it's designed to do well, cut down to as small an example as possible.

Sheez, downvoters, I'm curious about it, and want to see an example. You don't learn to drive a car by reading the engine specs, either.


You can find some hello world level examples here https://github.com/axboe/liburing/tree/master/examples.

Right now, what it can do really well is non-blocking file IO. My (limited) understanding is that as of now, the benefits of io_uring over epoll for network IO is a bit more ambiguous. That said, io_uring is adding new features (already available in linux 6 kernel) that are really promising. See https://github.com/axboe/liburing/wiki/io_uring-and-networki....


You can find some "hello-world" style examples here, along with fantastic tutorial materials:

https://unixism.net/loti/tutorial/index.html


Testing at ScyllaDB showed that io_uring could be >3x the throughput of posix-aio, and 12% faster than linux-aio. This is back in 2020, so I am sure that there's been a lot of work done since then.

https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-wi...


Modern hardware does already work like that.

Look at AMD gpu command buffers, and xHCI (and NVMe).

But among those "ring buffers", which one is the best? (from a consumer/producer concurrent access, on a modern CPU ofc).

If the commands are not convoluted, the programming is soooo much simpler and cleaner.


The article mentions the names for the queues are based on the NVMe standard.


huh, ofc I only have experience with ring buffers from AMD gpus and USB xHCI controllers, and only heard that NVMe has ring buffers.

It seems USB xHCI is the "most concurrent friendly" and hardwarely friendly (IOMMU and non-IOMMU) as it supports "sparse ring buffers" (non continous in bus address space). AMD gpu ring buffers are atomic read/write pointers (command ring buffers and "interrupt"/completion ring buffers) with doorbells to notify pointer updates.

I should have a look at linux IO uring to see how far they did go and which model they did choose.

I wonder how modern hardware could use anything else than command ring buffers.


I've tinkered around with io_uring on and off for the last couple years. But I think it's really becoming quite cool (not that it wasn't cool before... :)). This was a really interesting post on what's new https://github.com/axboe/liburing/wiki/io_uring-and-networki.... The combination of ring-mapped buffers and multi-shot operations has some really interesting applications for high-performance networking. Hoping over the next year or two we can start to see really bleeding edge networking perf without having to resort to using DPDK :)


I’ve achieved great preliminary results with this in https://github.com/greatest-ape/aquatic#io_uring


Really cool. I've spent a lot of time digging through your code in aquatic :) It's a really cool project!


Shameless plug, I just finished 30+ hours of research into io_uring and posted a more high-level summary here: https://unzip.dev/0x013-io_uring/

Including a tl;dr, pros, cons, tools & players, and a forecast.

The creator of io_uring Jens was super nice and also reviewed it (I'm still a bit in shock).


Great summary! thanks for sharing!


Happy to help!


Very nicely done! I like this format


Thanks tomgs! I have this format for every new dev trend I find, keeps it structured.


Would be great if this was integrated into runtimes like tokio. I can totally see graphs of promise chains being translated to calls to io_uring representing the underlying data dependencies and thus saving lots of overhead context switches.

Pretty cool to see how far IO APIs have come with this.


There already is a tokio_uring. I'm still a bit dubious how much help it is for application level stuff outside of normal file operations. E.g the data dependencies stuff is a very granular scheduling operation to generalise in an easy to use way. But it's definitely real and worth watching.


As fantastic as Tokio uring is, it only uses the uring interface for locals files, and not for the network access parts.

Glommio is a thread-per-core async executor/framework that supports io_uring however.


Maintainer of tokio-uring here. It uses io_uring for the network APIs as well.


Oh rad, my bad!

I read a while back that it only did fs, that was clearly incorrect/outdated/etc.


io_uring is the single coolest thing to come out of facebook


This is the future of linux syscalls. Get on board with this or get left behind.


I'm not sure if you're being sarcastic or not. This is a specialized case for improving performance. 99% of applications you use every day will not benefit from io_uring in any measurable way.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: