Completion latency is one thing, bandwidth would be another. There's apparently a whole world of Alveo SmartNIC's and related FPGA platforms, and it can totally get in nanosecond range for whatever nails that may fit the compute-in-network hammer, even if bound by latency of the consuming system / RDMA interface. Also: https://github.com/corundum/corundum is really popular with the Chinese!