I doubt IP/TCP offloading even makes that big of a difference, if SIMD (SSE2+ or...

maxhou · on Oct 18, 2015

You forgot the cost of memory access.

The L3 layer checksum is useless because IP packet is small and the kernel has to read/write all the fields anyway.

The L4 checksum covers TCP/UDP packet data, which the kernel can avoid touching if necessary.

When a TCP sender uses sendfile(), the kernel does a DMA read from storage to a page if the data is not already in memory (in the so called page cache), and just ask the network card to send this page, prepended with a ETH/IP/TCP header. That only works if the NIC can checksum the TCP packet content and update the header.

If the network card can do TCP segmentation offload, the kernel does not have to repeat this operation for each 1500 bytes packets, it can fetch a large amount of data from disk, and the NIC will split the data in smaller packets by itself.

matheweis · on Oct 18, 2015

The benchmarking I had done had the [non-offloaded] bandwidth peak out around 2.5-3Gbit/s. Could have been trouble with the drivers, or a naive implementation, or any of a number of things. Didn't dig into it too deeply at the time as the offloading drivers worked fine.