My understanding is that's the naiive measurement of the cost of just the syscal...

My understanding is that's the naiive measurement of the cost of just the syscall operation (i.e. if you measure issue to kernel is executing). Does this actually account for the performance loss of cache innefficiency? If I'm not mistaken at a minimum the CPU needs to flush various caches to enter the kernel, fill them up as the kernel is executing, & then repopulate them when executing back in userspace. In that case (even if it's not a full flush), you have a hard to measure slowdown on the code processing the request in the kernel & in userspace after the syscall because the locality assumptions that caches rely on are invalidated. With an io_uring model, since there's no context switches, temporal & spatial locality should provide an outsized benefit beyond just removing the syscall itself.

Additionally, as noted elsewhere, you can chain syscalls pretty deeply so that the entire operation occurs in the kernel & never schedules your process. This also benefits spatial & temporal locality AND removes the cost of needing to schedule the process in the first place.