Well, it doesn't apply to OP's question, which is why I didn't mention it directly, but I know for sure that most enterprise class virtualization software doesn't accelerate through to the VMs. (Yes, there is SR-IOV, but then you lose hot migration and other HA features. SolarFlare had it a while back but it no longer works)
I didn't mean to suggest that it's hard to find, just that there are a host of things that need to be in place before you'll get the expected speeds. This isn't a knock on the tech or any manufacturer, it's just that it's not mature enough to to be like 1Gbit where you plug it in an almost everything starts running at 100Mbyte/s.
I doubt IP/TCP offloading even makes that big of a difference, if SIMD (SSE2+ or AVX2+) can be used. A single CPU core is probably capable of TCP checksumming more than 100 Gbps.
Of course it's a completely another story without SIMD. A naive traditional checksum loop with a register dependency stall is just not going to be fast.
The L3 layer checksum is useless because IP packet is small and the kernel has to read/write all the fields anyway.
The L4 checksum covers TCP/UDP packet data, which the kernel can avoid touching if necessary.
When a TCP sender uses sendfile(), the kernel does a DMA read from storage to a page if the data is not already in memory (in the so called page cache), and just ask the network card to send this page, prepended with a ETH/IP/TCP header. That only works if the NIC can checksum the TCP packet content and update the header.
If the network card can do TCP segmentation offload, the kernel does not have to repeat this operation for each 1500 bytes packets, it can fetch a large amount of data from disk, and the NIC will split the data in smaller packets by itself.
The benchmarking I had done had the [non-offloaded] bandwidth peak out around 2.5-3Gbit/s. Could have been trouble with the drivers, or a naive implementation, or any of a number of things. Didn't dig into it too deeply at the time as the offloading drivers worked fine.
simple checksum computation and/or verification, indeed most cards can do (sometimes with restrictions: not for IPv6, not for VLAN...)
the other kind of offloading that the kernel can use is TCP Segmentation Offload (TSO), which is much more complex to implement in hardware, and you won't find it on cheap NIC (like Realtek)
Any modern adapters you can think of that don't support TCP & UDP offloading (+ARP, etc.)? As far as I know, all of them support it.