> through io_uring instead of having to use different async options or threadpools for everything
io_uring is a threadpool, and to get performance above epoll, it probably needs to expose the pools and the scheduling choices. eBPF for sceduling? All the simplicity of io_uring basically goes out the window wit that. I added ZEROCOPY to an io_uring instance and it wasn't a fun experience.
It is a very strange interface for networking code. Sending is already mostly async for TCP, and for UDP almost useless (esp without send/recv mmsg calls - you have to make all the packet ordering explicit when create the requests - complete pain in the ass. On the receiving end you have to have an outstanding read request even though you have no idea how much you are going to read (reminds me of the HTTP async hack). The Mmap Packet interface on that is better, just dropping packets off in your address space - for the most part read is a useless call for high perforamnce stuff - you're always reading and buffering. Reading UDP sucks also since no mmsg call and you feel like you're pickign up thousands of pennies off the ground because the asshole wouldn't give you a $10 bill.
It doesn't compose well, hard to debug, etc. There are other interfaces (besides the old, decrepit 200+ verb RDMA/RoCe/iWarp crapfest coming out of the latest standards running jokes). Do something besides BSD sockets on the high end. Anybody rememebr SysV IPC - that might be intereting to try again. BSD netowrking is not the end all of networking APIs.
If you are serious about network throughput, you are doing kernel bypass. Packets show up in your process memory in a ring buffer via DMA, asynchronously, with your code just polling for updates to a head index also updated via DMA from the network hardware. Packets are sent by appending them to another ring buffer and then writing to a device register also mapped.
There isn't a significant difference between this and io_uring, both in terms of performance and API.
With io_uring the kernel is doing that for you (you have control of the affinity and how aggressively the kernel thread polls), and copying it to another ring buffer, which is then picked up by userland.
You still get sub-microsecond latency even with this two-level approach, and the advantage is that it works on every NIC that linux has a driver for.
There is a significant amount of difference for UDP. If you go ahead and profile any UDP based high pps or bps application, you will notice that a lot of time is spent in the kernels network stack - resolving routes, applying iptable rules, evaluating bpf scripts, copying data, and ultimately handing it over to the driver. On something like a QUIC server - this can easily account for 30% time - or even > 50% if optimizations like GSO (generic segmentation offload) are not used.
io_uring does pretty much nothing to help reducing this number. It reduces the syscall overhead, but that one actually isn't that high in those applications. What it also will do is moving the actual cost and latency of system calls from when the actual IO is attempted to the time when `io_uring_enter` is called, which essentially batches all IO. For some applications that might be useful to reduce overhead - for others it will just mean `io_uring_enter` becomes an extremely high latency operation which stalls the eventloop. This symptom can be avoided by using kernel-side polling for IO operations which doesn't require `io_uring_enter` anymore. But due to polling overhead this will only be a viable way for a certain set of operations too.
Kernel bypass (AF_XDP/dpdk/etc) will directly avoid the 30% overhead, at the cost of a reduced amount of tooling and observability.
For TCP the story might be slightly different, since the kernel overhead there is usually lower due to more offloads being available. But I think even there, there hasn't been a lot of proof that this actually provides a much different performance profile than existing nonblocking APIs.
first, if you're using io_uring_enter, you're probably not using io_uring in any way that truly minimizes latency. Kernel-side polling is required not just for that, see the latest work on how it integrates with the NAPI driver infrastructure.
second, there should be no spurious copies if you set it up right; it is designed with zero-copy capabilities in mind.
third, whether or not it evalutes ebpf or wastes other time in the kernel stack is a matter of configuration and tuning.
fourth, there are several articles that already demonstrate that it achieves performance close to that of DPDK, with a much easier and more flexible setup.
io_uring is a threadpool, and to get performance above epoll, it probably needs to expose the pools and the scheduling choices. eBPF for sceduling? All the simplicity of io_uring basically goes out the window wit that. I added ZEROCOPY to an io_uring instance and it wasn't a fun experience.
It is a very strange interface for networking code. Sending is already mostly async for TCP, and for UDP almost useless (esp without send/recv mmsg calls - you have to make all the packet ordering explicit when create the requests - complete pain in the ass. On the receiving end you have to have an outstanding read request even though you have no idea how much you are going to read (reminds me of the HTTP async hack). The Mmap Packet interface on that is better, just dropping packets off in your address space - for the most part read is a useless call for high perforamnce stuff - you're always reading and buffering. Reading UDP sucks also since no mmsg call and you feel like you're pickign up thousands of pennies off the ground because the asshole wouldn't give you a $10 bill.
It doesn't compose well, hard to debug, etc. There are other interfaces (besides the old, decrepit 200+ verb RDMA/RoCe/iWarp crapfest coming out of the latest standards running jokes). Do something besides BSD sockets on the high end. Anybody rememebr SysV IPC - that might be intereting to try again. BSD netowrking is not the end all of networking APIs.