I doubt IP/TCP offloading even makes that big of a difference, if SIMD (SSE2+ or AVX2+) can be used. A single CPU core is probably capable of TCP checksumming more than 100 Gbps.
Of course it's a completely another story without SIMD. A naive traditional checksum loop with a register dependency stall is just not going to be fast.
The L3 layer checksum is useless because IP packet is small and the kernel has to read/write all the fields anyway.
The L4 checksum covers TCP/UDP packet data, which the kernel can avoid touching if necessary.
When a TCP sender uses sendfile(), the kernel does a DMA read from storage to a page if the data is not already in memory (in the so called page cache), and just ask the network card to send this page, prepended with a ETH/IP/TCP header. That only works if the NIC can checksum the TCP packet content and update the header.
If the network card can do TCP segmentation offload, the kernel does not have to repeat this operation for each 1500 bytes packets, it can fetch a large amount of data from disk, and the NIC will split the data in smaller packets by itself.
The benchmarking I had done had the [non-offloaded] bandwidth peak out around 2.5-3Gbit/s. Could have been trouble with the drivers, or a naive implementation, or any of a number of things. Didn't dig into it too deeply at the time as the offloading drivers worked fine.
Of course it's a completely another story without SIMD. A naive traditional checksum loop with a register dependency stall is just not going to be fast.