I did this years before io_uring, circa 2006. Working on a Linux-based networkin...

javajosh · on Oct 16, 2022

I find myself wondering what a truly high performance WebSocket server would look like, and if it would require loading a custom kernel module. Consider the worst-case, 1-msg-in N-msg-out for N "connections" (aka fan-out ratio). My understanding is that at the physical level subnets time slice a shared, serialized medium. The units there are a network frame (either wifi or ethernet). These frames are organized into IP and then TCP and finally give your process a "connection" from which data comes and into which data goes. WebSockets, to me, simply make the TCP socket abstraction accessible to browsers, with some extra setup cost but no runtime cost.

To be honest, ordinary "naive" programming methods are good enough to run a sizable single node WebSocket server. I'd be curious how much performance you can get out of a single server or, perhaps more broadly useful, a single core in a single Linux VPS, using different languages and relatively esoteric techniques like this.

moonchild · on Oct 16, 2022

And I thought I was original for coming up with this scheme after I heard about vdsos, a year or two before io_uring became a thing :)

flohofwoe · on Oct 16, 2022

3D APIs used the same idea for a long time, probably going all the way back to the mid-90's with D3D2 'execute buffers' (GL display lists are even older but just similar, they're expected to be recorded once and executed many times instead of being rebuilt each frame).

benlwalker · on Oct 16, 2022

And io_uring itself was more directly inspired by NVMe and RDMA, which of course work with these same queues as GFX cards. The original io_uring patch compares itself to SPDK, whose premise is "what if we expose an abstraction for a hardware queue per thread to an application " - basically the same programming model as io_uring. And SPDK was just taking techniques from networking (DPDK) and applying them to storage.

source - I helped create SPDK.

gavinray · on Oct 16, 2022

Thanks for SPDK, it's valuable and has a wide range of applications.

ajross · on Oct 16, 2022

IBM System/360 machines had programmable IO Channel processors back in the 1960's. Everything new is old.

Nor is io_uring the first or only asynchronous I/O API available to Linux userspace. It's just the best.

hinkley · on Oct 16, 2022

> Processes would just place messages into their circular buffers and update an index variable in the buffer header.

Did you implement a write barrier on this to ensure the compiler or CPU doesn’t run these two writes out of order?

kazinator · on Oct 16, 2022

I must have had barriers in there. I knew what they are. I had worked on glibc threads some five years before that (under the maintenance of Ulrich Drepper at the time), using lock-free algorithms, where we had operation like "compare and swap with acquire semantics" (meaning doing the right kind of barrier for acquiring a mutex).

The hardware this was running on supported cache-coherent NUMA, but we didn't use it in that mode; the blades that had multiple nodes ran multiple Linux instances that didn't share memory. Reordering effects at the hardware level weren't observed in the separate mode; if one core wrote to a location A and then B another core would not see the B update before A. Under ccNUMA, had we used it, I believe that would have been an issue.

Compiler reordering is always a threat, so you need at least that __volatile__ __asm__(": : memory"). In the user space infrastructure of that project, I made a library of atomic operations; there must have been barrier macros in there; I can't imagine providing atomic primitives without barriers.