Efficient IO in Linux with io_uring [pdf] (2019)

minraws · on Oct 16, 2022

For people who are curious before/after reading it, this is from 2019, there are updates the io_uring interface in Linux, quite a few actually, here's a changelog of liburing kind of the defacto library for io_uring. https://github.com/axboe/liburing/compare/liburing-0.4...lib...

Actually, io_uring has received around 500 commits just in 2022. Here's a list of interesting feature additions(this year),

- https://www.phoronix.com/news/Linux-LPC2022-io_uring_spawn

- https://www.phoronix.com/news/Linux-520-XFS-uring-Async-Buff

- https://www.phoronix.com/news/Linux-5.20-IO_uring-ZC-Send

- https://www.phoronix.com/news/Linux-5.19-IO_uring

io_uring like API was also adopted by Windows recently. Lots of fun.

dijit · on Oct 16, 2022

That last line is interesting to me, one of the (admittedly few) things Windows had over Linux when it comes to being able to write extremely high performant software in an easy way was IOCP (IO completion ports), epoll on Linux was awkward to use and hard to get right.

I’ve never managed to do a comparison of io_uring vs IOCP but I guess there was some benefit here, I left that job before io_uring was mature.

Would anyone be willing to share a comparison of the two systems?

benlwalker · on Oct 16, 2022

Windows did already have async ("overlapped") IO, and a completion aggregator (IOCP) kind of like io_uring. What Windows didn't have, and the reason they're now adding their own IORing, is the ability to submit batches of operations in a single system call. Batching operations to reduce system calls on the submission side is one of the most important features of io_uring.

The Windows IORing is only storage today, but hopefully becomes a generic system for making batched, async system calls just like on Linux.

muststopmyths · on Oct 16, 2022

Windows already had a uring like API for sockets called registered I/O. It was specifically designed to overcome IOCP limitations

Edit: you can read about it by googling for the original RIO presentation, but the gist is that at very high packet rates the completion signaling and locking/unlocking buffers for I/O cause a lot of overhead and RIO mitigates that by preregistering buffers

tripwanger · on Oct 16, 2022

I think that highlights one of the issues with Linux being essentially just another clone of Unix, and thus being constrained by its limitations - both technically and in mindset of its developers.

Whereas NT was a fresh new design, by seasoned operating systems designers and implementers, who had seen the shortcomings of the Unix model (and others) and not only knew what to do differently to improve on these, but actually wanted to and were permitted to as well.

speed_spread · on Oct 16, 2022

Windows NT was not a fresh new design but a direct evolution of VMS, which had it's own history independent of Unix. That it steered clear of Unix's shortcomings is a result of a design process that's much less napkin-based and more "engineered" for better and worse.

v3ss0n · on Oct 17, 2022

nice , it seems a lot of improvements going in recent kernels. Would be nice if it is fully adopted by python asyncio and js (PR for node was 4 years already , it seems stucked there)

kazinator · on Oct 16, 2022

I did this years before io_uring, circa 2006. Working on a Linux-based networking startup called Zeugma Systems, I implemented a kernel-based logging system for a multi-process, multi-node distributed application. I started on Linux 2.6.14. When that startup folded, I think we were on 2.6.27.

The logging system was implemented in a module which was inserted into the kernel and then used the calling thread to run a service. The module provided a device and some ioctls. Processes used the ioctls to attach circular buffers to the device, which was mapped into kernel space using get_user_pages.

Processes would just place messages into their circular buffers and update an index variable in the buffer header. The kernel would automatically pick up the message, without any system call. There was a wakeup ioctl to poke the kernel thread, which was used upon hitting a high water mark (buffer getting near full). This is the basic intuition behind io_uring.

The kernel thread collected messages from multiple buffers and sent them into several destination (files and sockets).

I do not have most of this code, but some of it survived, including a kernel mutex and condition variable library featuring a function that lets you give up a mutex to wait on a condition variable, while also polling any mixture of kernel file and socket handles, with a timeout. This function at the core of the kernel thread's loop.

The nice thing was that when processes crashed, their buffer would not go away immediately. Of course their own address space would be gone, but the kernel's mapping of the shared buffer mapping would gracefully persist, until the thread emptied the buffer; everything put into the buffer before the crash was safe. Empty buffers belonging to processes that had died would then be cleaned away.

I had a utility program that would list the buffers and the PIDs of their processes, and provide stats, like outstanding bytes, and is that process still alive.

(The one inefficiency in logging was that log messages need time stamps. Depending on where you get a time stamp from, that requires a trip to the kernel. I can't remember what I did about that.)

A bit of a difficulty in the whole approach is that I wasn't getting a linear mapping of the user pages in the kernel. So I wrote the grotty C code (kernel side) to pull the messages correctly from a scrambled buffer whose pages are out of order, without making an extra copy.

javajosh · on Oct 16, 2022

I find myself wondering what a truly high performance WebSocket server would look like, and if it would require loading a custom kernel module. Consider the worst-case, 1-msg-in N-msg-out for N "connections" (aka fan-out ratio). My understanding is that at the physical level subnets time slice a shared, serialized medium. The units there are a network frame (either wifi or ethernet). These frames are organized into IP and then TCP and finally give your process a "connection" from which data comes and into which data goes. WebSockets, to me, simply make the TCP socket abstraction accessible to browsers, with some extra setup cost but no runtime cost.

To be honest, ordinary "naive" programming methods are good enough to run a sizable single node WebSocket server. I'd be curious how much performance you can get out of a single server or, perhaps more broadly useful, a single core in a single Linux VPS, using different languages and relatively esoteric techniques like this.

moonchild · on Oct 16, 2022

And I thought I was original for coming up with this scheme after I heard about vdsos, a year or two before io_uring became a thing :)

flohofwoe · on Oct 16, 2022

3D APIs used the same idea for a long time, probably going all the way back to the mid-90's with D3D2 'execute buffers' (GL display lists are even older but just similar, they're expected to be recorded once and executed many times instead of being rebuilt each frame).

benlwalker · on Oct 16, 2022

And io_uring itself was more directly inspired by NVMe and RDMA, which of course work with these same queues as GFX cards. The original io_uring patch compares itself to SPDK, whose premise is "what if we expose an abstraction for a hardware queue per thread to an application " - basically the same programming model as io_uring. And SPDK was just taking techniques from networking (DPDK) and applying them to storage.

source - I helped create SPDK.

gavinray · on Oct 16, 2022

Thanks for SPDK, it's valuable and has a wide range of applications.

ajross · on Oct 16, 2022

IBM System/360 machines had programmable IO Channel processors back in the 1960's. Everything new is old.

Nor is io_uring the first or only asynchronous I/O API available to Linux userspace. It's just the best.

hinkley · on Oct 16, 2022

> Processes would just place messages into their circular buffers and update an index variable in the buffer header.

Did you implement a write barrier on this to ensure the compiler or CPU doesn’t run these two writes out of order?

kazinator · on Oct 16, 2022

I must have had barriers in there. I knew what they are. I had worked on glibc threads some five years before that (under the maintenance of Ulrich Drepper at the time), using lock-free algorithms, where we had operation like "compare and swap with acquire semantics" (meaning doing the right kind of barrier for acquiring a mutex).

The hardware this was running on supported cache-coherent NUMA, but we didn't use it in that mode; the blades that had multiple nodes ran multiple Linux instances that didn't share memory. Reordering effects at the hardware level weren't observed in the separate mode; if one core wrote to a location A and then B another core would not see the B update before A. Under ccNUMA, had we used it, I believe that would have been an issue.

Compiler reordering is always a threat, so you need at least that __volatile__ __asm__(": : memory"). In the user space infrastructure of that project, I made a library of atomic operations; there must have been barrier macros in there; I can't imagine providing atomic primitives without barriers.

cash22 · on Oct 16, 2022

On page 5:

> To find the index of an event, the application must mask the current tail index with the size mask of the ring. This commonly looks something like the below:

  unsigned head;
  head = cqring→head;
  read_barrier();
  if (head != cqring→tail) {
    struct io_uring_cqe *cqe;
    unsigned index;
    index = head & (cqring→mask);
    cqe = &cqring→cqes[index];
    /* process completed cqe here */
    ...
    /* we've now consumed this entry */
    head++;
  }
  cqring→head = head;
  write_barrier();

Am I misunderstanding, or is "current tail index" supposed to be "current head index?"

acquacow · on Oct 16, 2022

Jens Axboe has been working on this for a long time. I'll stand behind anything he works on =)

espoal · on Oct 16, 2022

glad io_uring is gaining traction.

If you plan on using it, or just want to learn more about it, check this out:

https://github.com/espoal/awesome-iouring

russdill · on Oct 16, 2022

Hopefully in a few years most people will be using it and have no idea due to framework adoption

ephaeton · on Oct 16, 2022

needs a [2019]

v3ss0n · on Oct 16, 2022

Needs asynchronous libs to adopt it

ioquatix · on Oct 16, 2022

It's adopted by Ruby: <https://github.com/socketry/io-event> which is used by <https://github.com/socketry/async> which is part of the Ruby 3+ Fiber Scheduler for light weight concurrency. It shows promising performance.

v3ss0n · on Oct 16, 2022

Interesting, but I read some issues with Benchmade showing it's not actually reproducible regarding performance improvement over epoll and epoll still faster.

rektide · on Oct 16, 2022

There was some very excellent work done in Node.js's libuv, but it never got across the line & has sat around getting more out of date for a while.

https://github.com/libuv/libuv/issues/1947#issuecomment-4852...

v3ss0n · on Oct 16, 2022

4 years already..

yxhuvud · on Oct 16, 2022

Which in turn need that Linuxes that they support also support io_uring. This world turns slowly.

v3ss0n · on Oct 16, 2022

It's since very long ago right? Most major server distro should have it by now

yxhuvud · on Oct 17, 2022

It is also improving a lot over time. What little there was support for back in Linux 5.3 is not sufficient.

Zababa · on Oct 16, 2022

It's used by Eio in OCaml for the Linux backend: https://github.com/ocaml-multicore/eio.

v3ss0n · on Oct 17, 2022

thanks , shame to see pull request waiting to be merged. Would be nice if python asyncio adopt it too.

Sirened · on Oct 16, 2022

just need some conditionals and branch operations and then we'll be able to write full software in io_uring :P