> through io_uring instead of having to use different async options or threadpoo...

ncmncm · on July 30, 2022

If you are serious about network throughput, you are doing kernel bypass. Packets show up in your process memory in a ring buffer via DMA, asynchronously, with your code just polling for updates to a head index also updated via DMA from the network hardware. Packets are sent by appending them to another ring buffer and then writing to a device register also mapped.

mgaunard · on July 30, 2022

There isn't a significant difference between this and io_uring, both in terms of performance and API.

With io_uring the kernel is doing that for you (you have control of the affinity and how aggressively the kernel thread polls), and copying it to another ring buffer, which is then picked up by userland.

You still get sub-microsecond latency even with this two-level approach, and the advantage is that it works on every NIC that linux has a driver for.

Matthias247 · on July 30, 2022

There is a significant amount of difference for UDP. If you go ahead and profile any UDP based high pps or bps application, you will notice that a lot of time is spent in the kernels network stack - resolving routes, applying iptable rules, evaluating bpf scripts, copying data, and ultimately handing it over to the driver. On something like a QUIC server - this can easily account for 30% time - or even > 50% if optimizations like GSO (generic segmentation offload) are not used.

io_uring does pretty much nothing to help reducing this number. It reduces the syscall overhead, but that one actually isn't that high in those applications. What it also will do is moving the actual cost and latency of system calls from when the actual IO is attempted to the time when `io_uring_enter` is called, which essentially batches all IO. For some applications that might be useful to reduce overhead - for others it will just mean `io_uring_enter` becomes an extremely high latency operation which stalls the eventloop. This symptom can be avoided by using kernel-side polling for IO operations which doesn't require `io_uring_enter` anymore. But due to polling overhead this will only be a viable way for a certain set of operations too.

Kernel bypass (AF_XDP/dpdk/etc) will directly avoid the 30% overhead, at the cost of a reduced amount of tooling and observability.

For TCP the story might be slightly different, since the kernel overhead there is usually lower due to more offloads being available. But I think even there, there hasn't been a lot of proof that this actually provides a much different performance profile than existing nonblocking APIs.

mgaunard · on July 30, 2022

first, if you're using io_uring_enter, you're probably not using io_uring in any way that truly minimizes latency. Kernel-side polling is required not just for that, see the latest work on how it integrates with the NAPI driver infrastructure.

second, there should be no spurious copies if you set it up right; it is designed with zero-copy capabilities in mind.

third, whether or not it evalutes ebpf or wastes other time in the kernel stack is a matter of configuration and tuning.

fourth, there are several articles that already demonstrate that it achieves performance close to that of DPDK, with a much easier and more flexible setup.