That sounds like a ftp-like measurement of throughput, and yeah, what you said w...

StillBored · on Nov 6, 2017

In our case we were doing a fair amount of data manipulation, so it wasn't strictly a case of pushing the data through, although we had higher bandwidth per stream.

But, there are a bunch of different ways to solve the problems. I guess how impressive it is depends on they have gone about solving their particular cases. There is a fair number of network accelerators that offload individual stream level management to little cores running on the network adapter itself. Cavium, EzChip and now even companies like mellanox are playing in this space https://www.enterprisetech.com/2017/10/04/mellanox-etherneta....

So, i'm not sure the impressive parts are necessarily in the stream counts but what they must be doing to "align" (for lack of a better term) them. AKA the trade offs between keeping a few seconds of a video stream in RAM vs sourcing it from disk/wherever so that multiple users streams are aligned to avoid having to hit a secondary storage medium. In netflix's case I suspect that requiring fairly large buffers on the endpoint allow them to get away with a much lower QoS metric on any given stream.

Put another way, at least the few times I've watched netflix's bandwidth usage, it seems to be bursty. It blasts a few 10's of MB/s of data and then sits idle for a few seconds while the stream plays and then you get another chunk.

luckydude · on Nov 6, 2017

Randall Stewart at Netflix did a new TCP implementation that helps quite a bit. And he did this really cool thing for the nay sayers, he made it possible to have multiple stacks running in FreeBSD at the same time. I believe the default is you get the original stack, you can ask for his stack, and he did a super simple TCP stack just to show you how small a TCP stack could be.

They are using either Chelsio or Mellanox cards and they use the offload but they are doing TLS with the Xeon cpus. So they are getting 100Gbit while touching every byte.

And don't under estimate how hard it is to do 100,000 TCP connections. When I was at SGI we had a bunch of big SMP machines (I think they were 12 cpu Challenge) that someone was using to serve up web pages (AOL? It was someone big). Modems brought that machine to its knees. You would think that would be easy but it was not. A single (or small number of) fast streams is easy, a boat load of slow streams is hard. Think about it, if you have a TCP stack that gets a request and then nothing, you have all the overhead of finding that socket, doing that work, then nothing. It's way easier to have a stream of packets all for one socket.

It's that sort of stuff that they worked on so far as I can tell. Your caching idea is nice but the cache hit rate is very very low. They did way more work in the sendfile area, managing the page cache. Did you read Drew's post? It's worth a read for sure.

StillBored · on Nov 6, 2017

I didn't mean to minimize the difficulties of maintaining that many TCP connections (much less getting useful work out of them). I read the original article when it was on HN, but must have mentally thrown most of it away due to the freebsd bias. So I just reread it, and the fact that they are getting those numbers utilizing much of the OS buffer management and Nginx, is impressive by itself. But their difficulties sort of plays into my original assumptions. Basically, if you want cutting edge I/O perf your better off dumping most general purpose OS's I/O stacks unless you want to spend a lot of time re-engineering them to work around bottlenecks.

sendfile() is good, but the general concept tends to waste far to much time doing filesystem traversals, buffer management, dma scatter gather lists, and a bunch of other crap that gets in the way of getting a blob of data from the disk, encrypting it, and passing it off to a send offload to handle breaking up and apply the TCP/IP headers/checksums. Frankly the minimum MSS size is something that ipv6 should have fixed, given that no one is on 9600bps modems, but didn't.

Good for them for realizing that modern machines have a little less than a GB of bandwidth per pcie lane per direction, and memory bandwidth to match. If you don't mess up the CPU side of things you can even touch all that data once or twice and still maintain pretty amazing I/O numbers.

EDIT: Also in the case of x86 NUMA, you _REALLY_ want to make sure that the nvme/source disk, the memory buffer your writing to and the network adapter are on the same node with the core doing the encryption/etc. That is pretty easy if the "application" controls buffer allocation/pooling, but much harder with a general purpose OS which will fragment the memory pools.

kev009 · on Nov 6, 2017

We'll make it work on FreeBSD