Hacker News new | past | comments | ask | show | jobs | submit login

These are the slides from my EuroBSDCon presentation. AMA



I eventually figured it out. But I would suggest maybe giving a brief 1-slide thingy on "lagg", and link aggregation? (Maybe it was clear from the presentation, but I only see the slides here soooooo...)

I'm not the best at network infrastructure, though I'm more familiar with NUMA stuff. So I was trying to figure out how you only got 1-IP address on each box despite 4-ports across 2-nics.

I assume some Linux / Windows devops people are just not as familiar with FreeBSD tools like that!

EDIT: Now that I think of it: maybe a few slides on how link-aggregation across NICs / NUMA could be elaborated upon further? I'm frankly not sure if my personal understanding is correct. I'm imagining how TCP-connections are fragmented into IP-packets, and how those packets may traverse your network, and how they get to which NUMA node... and it seems really more complex to me than your slides indicate? Maybe this subject will take more than just one slide?


Thanks for the feeback. I was hoping that the other words on the slide (LACP and bonding) would give enough context.

I'm afraid that my presentation didn't really have room to dive much into LACP. I briefly said something like the following when giving that slide:

Basically, the LACP link partner (router) hashes traffic consistently across the multiple links in the LACP bundle, using a hash of its choosing (typically an N-tuple, involving IP address and TCP port). Once it has selected a link for that connection, that connection will always land on that link on ingress (unless the LACP bundle changes in terms of links coming and going). We're free to choose whatever egress NIC we want (it does not need to be the same NIC the connection entered on). The issue is that there is no way for us to tell the router to move the TCP connection from one NIC to another (well, there is in theory, but our routers can't do it)

I hope that helps


> We're free to choose whatever egress NIC we want

Wait, I got lost again... You say you can "output on any egress NIC". So all four egress NICs have access to the TLS encryption keys and are cooperating through the FreeBSD kernel to get this information?

Is there some kind of load-balancing you're doing on the machine? Trying to see which NIC has the least amount of traffic and routing to the least utilized NIC?


LACP (which is a control protocol) is a standard feature supported forever on switches. To put it simply both the server and switch see a single port (fake) that has physical ethernet members and a hashing algorithm puts the traffic on that physical ethernet link based on the selected hash. The underlying sw/hw picks which physical link to put the packet on. The input to the hash in picking the link can be src/dst (for all items listed here) MAC, port, IP addr. LACP handles the negotiation of what that hash is between the ends and also hands a link failure "hey man, something broke, we have one less link now). For any given single flow it will hash to the same link. So a for example in a 4x10G lag (also called a port-channel in networking speak) max bandwidth for a single flow would be 10g the max of a single member. In an ideal world the hashing would be perfectly balanced however it is possible to have a set of flows all hash to the same link. Hope that helps.


That's an excellent overview. I think I got the gist now.

There's all sorts of subtle details though that's probably just "implementation details" of this system. How and where do worker threads spawn up? Clearly sendfile / kTLS have great synergies, etc. etc.

Its a lot of detail, and an impressive result for sure. I probably don't have the time to study this on my own so of course, I've got lots and lots of questions. This discussion has been very helpful.

------

I think some of the "missing picture" is the interaction of sendfile / kTLS. It makes sense, but just studying these slides have solidified a lot for me as well: https://papers.freebsd.org/2019/EuroBSDCon/shwartsman_gallat...

Adding the NUMA things "on top" of sendfile/kTLS is clearly another issue. The hashing of TCP/Port information into particular links is absolutely important, because of the "Physical location" of the ports matter.

I think I have the gist at this point. But that's a lot of moving parts here. And the whole NUMA-fabric being the bottleneck just ups the complexity of this "simple" TLS stream going on...

EDIT: I guess some other bottleneck exists for Intel/Ampere's chips? There's no NUMA in those. Very curious.

----

Rereading "Disk centric siloing" slides later on actually answer a lot of my questions. I think my mental model was disk-centric siloing and I just didn't realize it. Those slides work exactly how I think this "should" have worked, but it seems like that strategy was shown to be inferior to the strategy talked about in the bulk of this presentation.

Hmmm, so my last "criticism" to this excellent presentation. Maybe an early slide that lays out the strategies you tried? (Disk Siloing. Network-siloing. Software kTLS, and Hardware kTLS offload)?

Just one slide at the beginning saying "I tried many architectures" would remind the audience that many seemingly good solutions exist. Something I personally forgot in this discussion thread.


The whole paper is discussing bottlenecks, path optimisation between resources, and impacts of those on overall throughout. It's not a simple loadbalancing question being answered


In terms of LACP in general, not for hw TLS. For HW TLS, the keys and crypto state are NIC specific.


If we are using LACP between router and server, it means we create single logical link between them. We can use as many as physical link supported by both server and router. The server and router will treat them as a single link. Thus the ingress and egress of the packet doesn’t really matter.



Whats happening when at the switch(es) the NICs are connected into?


although we’re free to choose egress port in LACP it’s still wise to maintain some kind of client or flow affinity, to avoid inadvertent packet reordering.


LACP/link aggregation are IEEE Ethernet standards concepts supported by nearly every hardware or software network stack. https://en.wikipedia.org/wiki/Link_aggregation


It’s an IEEE standard not a FreeBSD thing.

https://en.m.wikipedia.org/wiki/Link_aggregation


I build this stuff so it's so cool to read this, I can't really be public about my stuff. Are you using completely custom firmwares on your Mellanoxes? Do you have plans for nvmeOF? I've had so many issues with kernels/firmware scaling this stuff that we've got a team of kernel devs now. Also how stable are these servers? Do you feel like they're teetering at the edge of reliability? I think once we've ripped out all the OEM firmware we'll be in a much better place.

Are you running anything custom in your Mellanoxes? dpdk stuff


No, nothing custom, we don't have access to the source.

We like to run each server as an independent entity, so we don't run NVMEof.

They're pretty much rock solid.

If you'd like to discuss more, ping me via email. Use my last name at gmail . com.


It is hard to decipher your message, but just to clarify, firmware doesn't process packets (dataplane) it only manages and configures hardware (control plane). And no, you definitely won't be in "much better place" by "ripping it off" because modern NICs have very complex firmware with hundreds (if not thousands) man/years spent implementing and optimizing it.


Bud, I work on this stuff, I know all about Cavium and Mellanox firmware and it's issues, specifically with things like the math/integers used to determine packetflow which is used to charge customers having issues on specific versions that that have been internally patched. Do you think I just randomly typed this? It's even worse now that a huge chunk of their firmware teams have been lost in the shuffle of all of the competitors being bought and brought under one umbrella. Do you recall a bug on AWS years ago where they were incorrectly charging customers bandwidth usage? Everyone uses these types of nics, tons of PAAS/SAAS corps had that problem.

What an obnoxious, pedantic and naive response.

"And no, you definitely won't be in "much better place" by "ripping it off" because modern NICs have very complex firmware with hundreds (if not thousands) man/years spent implementing and optimizing it."

What do you think I just explained? I literally write this stuff. Modern nic firmware is still written in C and still has bugs. Do you seriously think drivers/firmware aren't going to have bugs? You just said yourself that they're high complexity. I can't believe I'm even needing to explain this. You've clearly never been a network engineer, do you have any idea how many bugs are in Juniper, Cisco, Palo, etc?

If you don't work on bare metal architecture/distributed systems, move along. This isn't sysadmin talk. Almost nothing used is stock, even k8s gets forked due to bugs. You can't resell PAAS with bugs that conflict with customer billing. NFLX isn't reselling bandwidth so they likely don't encounter these issues, they're using something like Cedexis to force their CDN providers at the edge to compete with one another down to the lowest cost/reliability and they're liable for things like this and can be sued/loss based on their contract. They (CDN customers) are acutely aware of when billing doesn't match up with realized BW usage. They'll drop the traffic to your CDN and split more of it over to a "better" CDN until those issues are mitigated - and while you're not getting that traffic you're not getting that customers exoected monthly payment... because now they're buying less bandwith from you because Cedexis tells them that you're less performant/reliable.

ANY large customer buying bandwidth from a CDN does this, none of this is specific to NFLX whom I know nothing about beyond.. "this is how reselling PAAS works."

I bet the next response is "no way NFLX uses a CDN".. LOL. They all do, my friend, HBO, paramount, disney, etc. They aren't in the biz of edge caching.


You really overreacted here.


How would you characterize the economics of this vs. alternative solutions? Achieving 400Gb/s is certainly a remarkable achievement, but is it the lowest price-per-Gb/s solution compared to alternatives? (Even multiple servers.)


I'm not on the hardware team, so I don't have the cost breakdown. But my understanding is that flash storage is the most expensive line item, and it matters little if its consolidated into one box, or spread over a rack, you still have to pay for it. By serving more from fewer boxes, you can reduce component duplication (cases, mobos, ram, PSUs) and more importantly, power & cooling required.

The real risk is that we introduce a huge blast radius if one of these machines goes down.


I almost fell for the hype of pcie gen4 after reading https://news.ycombinator.com/item?id=25956670, and it is quite interesting that pcie gen3 nvme drives can still do the job here. What would be the worst case disk I/O throughput while serving 400Gb/s?


If you look at just the pci-e lanes and ignore everything else, the NICs are x16 (gen4) and there's two of them. The NVMes are x4 (gen3) and there are 18 of them. Since gen4 is about twice the bandwidth of gen3, it's 32 lanes of gen4 NIC vs about 36 lanes of gen4 equivalent NVMe.

If we're only worried about throughput, and everything works out with queueing, there's no need for gen4 NVMes because the storage has more bandwidth than the network. That doesn't mean gen4 is only hype; if my math is right, you need gen4x16 to have enough bandwidth to run a dual 100G ethernet at line rate, and you could use fewer gen4 storage devices if reducing device count were useful. I think for Netflix, they'd like more storage, so given the storage that fits in their systems, there's no need for gen4 storage; gen4 would probably make sense for their 800Gbps prototype though.

In terms of disk I/O, either in the thread or the slides, drewg123 mentioned only about 10% of requests were served from page cache, leaving 90% served from disk, so that would make worst case look something like 45GB/sec (switching to bytes cause that's how storage throughput is usually measured). From previous discussions and presentations, Netflix doesn't do bulk cache updates during peak times, so they won't have a lot of reads at the same time as a lot of writes.


Thanks for the numbers. Perhaps hype is not the right word. It is just interesting to see that some older hardware can still be used to achieve the state of the art performance, as the bottleneck may lie elsewhere.


It's always balancing bottlenecks. Here, the bottleneck is memory bandwidth, limiting to (more or less) 32 lanes of network; the platform has 128 lanes, so using more lanes than needed at a slower rate works and saves a bit of cost (probably). On their Intel Ice Lake test machine, that only had 64 lanes which is also a bottleneck, so they used Gen4 NVMe to get the needed storage bandwidth into the lanes available.


thank you for this link, superuseful for my next project


Ouch. I guess there is always a trade off to be made…


One of the factors is ISPs that install these boxes have limited space and many companies that want to place hardware there. If Netflix can push more traffic in less space, that makes it a better option for ISPs that want to reduce external traffic and thus more likely to be installed benefiting their users.

If what I found is current, Netflix has 1U and 2U appliances, but FB does a 1U switch + groups of 4x 2U servers, so starting at 9U; but I was looking at a 2016 FB pdf that someone uploaded, they may have changed their deployments since then. 2U vs 9U can make a big difference.


I doubt that the 32-core EPYC they focused on is even the most economical solution in this situation.

If they're really RAM-bandwidth constrained, then the 24-core 74F3 (which still has all 256MBs of L3 cache) or even 8-core 72F3 may be better.


L3 won’t matter.

The more compute clusters the more PCI lanes in EPYC and the SSD lanes go direct per 8cores.


I recall when I was at Intel in the 90s and 3MB of L3 was a big deal.


The 8 core still has 128 lanes and 8 channels of memory, though.


Particularly, kTLS is a solution that fits here, on a single-box, but I wonder how things would look if the high-perf storage boxes sent video unencrypted and there was a second box that dealt only with the TLS. We'd have to know how many streams 400Gb/s represents though, and have a far more detailed picture of Netflix's TLS needs/usage patterns.


A proxy for TLS wouldn't help with this load.

That proxy would still need to do kTLS to reduce the required memory bandwidth to something the system can manage, and then you're at roughly the same place. The storage nodes would likely still have kTLS capable NICs because those are good 2x100G NICs anyway. It would be easier to NUMA align the load though, so there might be some benefit there. With the right control software, the proxy could pick an origin connection that was NUMA aligned with the client connection on the proxy and the storage on the origin. That's almost definitely not worth doubling the node count for though, even if proxy nodes don't need storage so they're probably significantly less expensive than flash storage nodes.


Could Netflix replace TLS with some in-house alternative that would push more processing to the client? Something that pre-encrypts the content on disk before sending, eliminating some of the TLS processing requirements?


If you pre-encrypt contents on disk, it wouldn't be a per-user unique key.


The content is already encrypted anyway for DRM.

I'd assume TLS is used in large part for privacy reasons (so ISPs can't sell info on what shows are popular)


This, can someone add to this it's crucial.


This level of architecture management on big server CPUs is amazing! I occasionally handle problems like this on a small scale, like minimizing wake time and peripheral power management on an 8 bit microcontroller, but there the entire scope is digestible once you get into it, and the kernel is custom-designed for the application.

However, in my case, and I expect in yours, requirements engineering is the place where you can make the greatest improvements. For example, I can save a few cycles and a few microwatts by sequencing my interrupts optimally or moving some of the algorithm to a look-up table, but if I can, say, establish that an LED indicator flash that might need to be 2x as bright but only lasts for a couple milliseconds every second is as visible as a 500ms LED on/off blink cycle, that's a 100x power savings that I can't hope to reach with micro-optimizations.

What are your application-level teams doing to reduce the data requirements? General-purpose NUMA fabrics are needed to move data in arbitrary ways between disc/memory/NICs, but your needs aren't arbitrary - you basically only require a pipeline from disc to memory to the NIC. Do you, for example, keep the first few seconds of all your content cached in memory, because users usually start at the start of a stream rather than a few minutes in? Alternatively, if 1000 people all start the same episode of Stranger Things within the same minute, can you add queues at the external endpoints or time shift them all together so it only requires one disk read for those thousand users?


> Alternatively, if 1000 people all start the same episode of Stranger Things within the same minute

It would be fascinating to hear from Netflix on some serious details of the usage patterns they see and particular optimizations that they do for that, but I doubt there's so much they can do given the size of the streams, the 'randomness' of what people watch and when they watch, and for the fact that the linked slides say the servers have 18x2TB NVME drives per-server and 256GB.

I wouldn't be surprised if the Netflix logo opener exists once on disk instead of being the first N seconds of every file though.


In previous talks Netflix has mentioned that due to serving so many 1000s of people from each box, that they basically do 0 caching in memory, all of the system memory is needed for buffers that are enroute to users, and they purposely avoid keeping any buffer cache beyond what is needed for sendfile()


Hey Drew, thanks for taking the time.

What would you rate the relative complexity of working with the NIC offloading vs the more traditional optimizations in the rest of the deck? Have you compared other NIC vendors before or has Mellanox been the go to that's always done what you've needed?


I wish the video was online... We tried another vendor's NIC (don't want to name and shame). That NIC did kTLS offload before Mellanox. However, they could not retain TLS crypto state in the middle of a record. That meant we had to coerce TCP into trying really hard to send at TLS record boundaries. Doing this caused really poor QoE metrics (increase rebuffers, etc), and we were unable to move forward with them.


Is there anyway we can see a video of the presentation? I'm extremely interested


The videos should appear on the conference's youtube channel in a few weeks: https://www.youtube.com/eurobsdcon


Is there something I as a person can contribute to make the videos available sooner?

I have video editing skills and have also done some past-time subtitling of videos. I have all of the software necessary to perform both of these tasks and would be willing to do so free of charge.


Was FreeBSD your first choice? Or did you try with Linux first? What were the numbers for Linux-based solution, if there was one?


FreeBSD was selected at the outset of the Open Connect CDN (~2012 or so).

We did a bake off a few years ago, and FOR THIS WORKLOAD FreeBSD outperformed Linux. I don't want to get into an OS war, that's not productive.


Its important to consider that we've poured man years into this workload on FreeBSD. Just off the top of my head, we've worked on in house, and/or contributed to or funded, or encouraged vendors to pursue: - async sendfile (so sendfile does not block, and you don't need thread pools or AIO) - RACK and BBR TCP in FreeBSD (for good QoE) - kTLS (so you can keep using sendfile with tls, saves ~60% CPU over reading data into userspace and encrypting there) - Numa - kTLS offload (to save memory bandwidth by moving crypto to the NIC)

Not to mention tons of VM system and scheduler improvements which have been motivated by our workload.

FreeBSD itself has improved tremendously over the last few releases in terms of scalability


> FreeBSD itself has improved tremendously over the last few releases in terms of scalability

True. FreeBSD (or its variants) has always been a better performer than Linux in the server segment. Before Linux became popular (mostly due to better hardware support), xBSD servers were famous for their low maintenance and high uptime (and still are). This archived page of NetCraft statistics ( https://web.archive.org/web/20040615000000*/http://uptime.ne... ) provides an interesting glimpse into internet history of how 10+ years back, the top 50 server with the highest uptimes were often xBSD servers, and how Windows and Linux servers slowly replaced xBSD.

(Here's an old HN discussion about a FreeBSD server that ran for 18 years - https://news.ycombinator.com/item?id=10951220 ).


a lot of this is in Linux now right ? i am asking a personal opinion and not necessarily a "why-dont-u-move-to-linux" question.

Genuinely curious on where u see state of art when it comes to Linux.


Yes. I ran it for a bake off ~2 years ago. At the time, the code in linux was pretty raw, and I had to fix a bug in their SW kTLS that caused data corruption that was visible to clients. So I worry that it was not in frequent use at the time, though it may be now.

My understanding is that they don't do 0-copy inline ktls, but I could be wrong about that.


Thank you for pushing kTLS!


Was licensing also a contributing factor, or was that irrelevant for you?


My understanding is that licensing did factor into the decision. However, I didn't join Netflix until after the decision had been made.


I recall reading that Netflix chose FreeBSD a decade ago due to asynchronous disk IO was (and still is?) broken and/or limited to fixed block offsets. So nginx just works better on FreeBSD versus Linux for serving static files from spinning rust or SSD.


This used to be the case, but with io_uring, Linux has very much non-broken buffered async I/O. (Windows has copied io_uring pretty much verbatim now, but that's a different story.)


Could you expand more on the Windows io_uring bit please?

I have run Debian based Linux my entire life and recently moved circumstantially to Windows. I have no idea how it's kernel model works and I find io_uring exciting.

Wasn't aware of any adoption of io_uring ideas in Windows land, sounds interesting


Windows has had “IO completion ports” since the 1990s which work well and are high performance async for disk/network/other IO operations.


This isn't the same as the old Windows async I/O. ptrwis' links are what I thought of (and it's essentially a 1:1 copy of io_uring, as I understand it).



By how much?


As an infrastructure engineer, these numbers are absolutely mind blowing to me!

Not sure if it’s ok to ask.. how many servers like this one does it take to serve the US clients?


I don't have the answer, and even if I did, I'm not sure I'd be allowed to tell you :)

But note that these are flash servers; they serve the most popular content we have. We also have "storage" servers with huge numbers of spinning drives that serve the longer tail. They are constrained by spinning rust speeds, and can't serve this fast.


I found somewhere that Netflix has ~74 million US/canada subscribers. If we guesstimate half of those might be on at peak time, that's 37 million users. At 400k connections/server that's only 85 servers to serve the connections, so I think the determining factor is the distribution of content people are watching.


What lead you to investigate PCIe relaxed ordering? Can you suggest a book or other resource to learn more about PCIe performance?


To be honest, it was mostly the suggestion from AMD.

At the time, AMD was the only Gen4 PCIe available, and it was hard to determine if the Mellanox NIC or the AMD PCIe root was the limiting factor. When AMD suggested Relaxed Ordering, that brought its importance to mind.


How do you benchmark this ? Do you use real-life traffic, or have a fleet of TLS clients ? If you have a custom testsuite, are the clients homogeneous ? How many machines do you need ? Do the clients use KTLS ?


We test on production traffic. We don't have a testbench.

This is problematic, because sometimes results are not reproducible.

Eg, if I test on the day of a new release of a popular title, we might be serving a lot of it cached from RAM, so that cuts down memory bandwidth requirements and leads to an overly rosy picture of performance. I try to account for this in my testing.


Test in prod FTW!

However, how do you test the saturation point when dealing with production traffic? Won't you have to run your resources underprovisioned in order to achieve saturation? Doesn't that degrade the quality of service?

Or are these special non-ISP Netflix Open Connect instances that are specifically meant to be used for saturation testing, with the rest of the load spilling back to EC2?


We have servers in IX locations where there is a lot of traffic. Its not my area of expertise (being a kernel hacker, not a network architect), but our CDN load does not spill back to EC2.

The biggest impact I have to QoE is when I crash a box, but clients are architected to be resilient against that.


Thanks, it's an interesting set of tradeoffs.


Does FreeBSD `sendfile` avoid a context switch from userspace to kernelspace as well, or is it only zero-copy? I've worked with 100Gbps NICs and had to end up using both a userspace network stack and a userspace storage driver on Linux to avoid the context switch and ensure zero-copy.

Also, have you looked into offloading more of the processing to an FPGA card instead?


There is no context switch, like most system calls, sendfile runs in the thread context of the thread making the syscall.

FreeBSD has "async sendfile", which means that it does not block waiting for the data to be read from disk. Rather, the pages that have been allocated to hold the data are staged in the socket buffer and attached to mbufs marked "not ready". When the data arrives, the disk interrupt thread makes a callback which marks the mbufs "ready", and pokes the TCP stack to tell them they are ready to send.

This avoids the need to have many threads parked, waiting on disk io to complete.


To be clear, was FreeBSD used because of historical reasons or because similar performance can't be/harder to achieve on Linux?


I mean most CDN and FANG run on Linux, I think in that case it's kTLS that makes a big difference the rest not much.


Async sendfile is also an advantage for FreeBSD. It is specific to FreeBSD. It allows an nginx worker to send from a file that's cold on disk without blocking, and without resorting to threadpools using a thread-per-file.

The gist is that the sendfile() call stages the pages waiting to be read in the socket buffer, and marks the mbufs with M_NOTREADY (so they cannot be sent by TCP). When the disk read completes, a sendfile callback happens in the context of the disk ithread. This clears the M_NOTREADY flag and tells TCP they are ready to be sent. See https://www.nginx.com/blog/nginx-and-netflix-contribute-new-...


sendfile() with splice and io_uring is similar? I know that this is very experimental on Linux.

The overall idea is to copy bytes from disk to the socket with almost no allocation and not blocking, this is the idea right?


Maybe. A few things in io-uring are implemented by letting a kernel task/thread block on doing the actual work. Which calls that are seems to change in every version, and might be tricky to find out without reading the kernel code.


I'd imagine scaling, licensing and overhead all had something to do with it, too.


Interesting. Are there any benchmarks that you would recommend to look at, regarding FreeBSD vs Linux networking performance?


For anyone interested, here is benchs from late 2018, comparing Fedora and FreeBSD performance: https://matteocroce.medium.com/linux-and-freebsd-networking-...


Why does the author put so much effort into testing VMs? Bare metal installations aren't even tried, so the article won't represent more typical setup (unless you want to run in cloud, in which case it would make sense to test in cloud).


If given the choice I'd never run anything on bare metal again. Let's say we have some service we want to run on a bare metal server. For not very much more hardware money, amortized, I can set up two or three VMs to do the same server, duplicated. Then if any subset of that metal goes bad, a replica/duplicate is already ready to go. There's no network overhead, etc.

I've been doing this for stuff like SMTPE authority servers and ntpd and things that absolutely cannot go down, for over a decade.


That doesn't really matter for benchmarking purposes. I think the parent comment was emphasizing that syscall benchmarks don't make sense when you're running through a hypervisor, since you're running tacitly different instructions than would be run on a bare-metal or provisioned server.


Because the cards were in PCI passthrough, so the performance was exactly the same of a physical system


Author mentions as much as to say that there's was some indirectness at least in the interrupts.

There are also VirtIO drivers involved, and according to the article, they had effect too.


It’s probably worth noting that there have been huge scalability improvements - including introduction of epochs (+/- RCU) - in FreeBSD over the last few years, for both networking and VFS.


Nginx and OpenSSL are open source. Give it a try and reproduce their results with Linux ;-).


IMO the question was reasonable, whereas the answers like yours have always sounded to me like "fuck you."


It was done a tested more than once. As I recall, it took quite a bit to get Linux to perform to the level that BSD was performing for (a) this use can and (b) the years of investment Netflix had already put into the BSD systems.

So, could Linux be tweaked and made as performant for _this_ use case. I expect so. The question to be answered is _why_.


sendfile + kTLS. I'm unaware of the in-kernel TLS implementation for Linux. Is there any around?


Yes, Linux has kTLS. When I tried to use it, it was horribly broken, so my fear is that its not well used/tested, but it exists. Mellanox, for example, developed their inline hardware kTLS offload on linux.


> When I tried to use it, it was horribly broken, so my fear is that its not well used/tested, but it exists.

Do you have any additional references around this? I'm aware that most rarely used functionality is often broken and therefore usually don't recommend people to use it, but would like to learn about kTLS in particular. I think for Linux OpenSSL 3 now added support for it in userspace. But there's also the kernel components as well as drivers - all of them could have their set of issues.


I recall that simple transmits from offset 0..N in a file worked. But range requests of the form N..N+2MB lead to corrupt data. Its been 2+ years, and I heard it was later fixed in Linux.


I've used sendfile + kTLS on Linux for a similar use case. It worked fine from the start, was broken in two (?) kernel releases for some use cases, and now works fine again from what I can tell. This is software kTLS, though; haven't tried hardware (not the least because it easily saturates 40 Gbit/sec, and I just don't have that level of traffic).


I once recommended a switch and router upgrade to allow for more, new WAPs for an office that was increasingly becoming dependent on laptops and video conferencing. I went with brand new kit, like just released earlier in the year because I'd heard good things about the traffic shaping, etc.

Well, the printers wouldn't pair with the new APs, certain laptops with fruit logos would intermittently drop connection, and so on.

I probably will never use that brand again, even though they escalated and promised patches quickly - within 6 hours they had found the issue and we're working on fixing it, but the damage to my reputation was already done.

Since then I've always demanded to be able to test any new idea/kit/service for at least a week or two just to see if I can break it.


Interesting. I had imagined the range handling is purely handled by the reading side of things, and wouldn't care how the sink is implemented (kTLS, TLS, a pipe, etc). So I assumed the offset should be invisible for kTLS, which just sees a stream of data as usual.



1. Could GPU acceleration help at all?

2. When serving video, do you use floating point operations at all? Could this workload run on a hypothetical CPU with no floating point units?

3. How many of these hardware platforms do you guys own? 10k?100k?


1) No. Well, potentially as a crypto accelerator, but QAT and Chelsio T6 are less power hungry and more available. GPUs are so expensive/unavailable now that leveraging them in creative ways makes less sense than just using a NIC like the CX6-Dx, which has crypto as a low cost feature.

2) These are just static files, all encoding is done before it hits the CDN.


I wonder if one creative (and probably stupid) way to leverage GPUs might be just as an additional RAM buffer to get more RAM bandwidth.

Rather than DMA from storage to system RAM, and then from system RAM to the NIC, you could conceivably DMA to GPU RAM and then to the NIC for a subset of sends. Not all of the sends, cause of PCIe bandwidth limits. OTOH, DDR5 is coming soon and is supposed to bring double the bandwidth and double the fun.


The videos are precomputed. So no GPU required to stream


How exactly 'Constrained to use 1 IP address per host' helps eliminate cross-NUMA transfers?


If we could use more than 1 IP, then we could treat 1 400Gb box as 4 100Gb boxes. That could lead to the "perfect case" every time, since connections would always stay local to the numa node where content is present.


I wonder if you could steer clients away from connections where the NIC and storage nodes are mismatched.

Something like close the connection after N requests/N minutes if the nodes are mismatched, but leave it open indefinitely if they match.

There's of course a lot of ways for that to not be very helpful. You'd still have only a 25% of getting a port number that hashes to the right node the next time (assuming tcp port number is involved at all, if it's just src and dest ips then client connections from the same IP would always hash the same, and that's probably a big portion of your clients), and if establishing connections is expensive enough (or clients aren't good at it) then that's a negative. Also if a stream's files don't tend to stay on the same node, then churning connections to get to the right node doesn't help if the next segment is on a different node. I'm sure there are other scenarios too.

I know some other CDN appliance setups do use multiple IPs, so you probably could get more, but it would add administrative stress.


Is there a reason you can’t?

In a past life we broke LAGs up to use different subnets per port to prevent traffic crossing the NUMA bridge.

I’m sure there are good reasons you didn’t take this approach, be interesting to hear them.


There is no kTLS for IPv6? IPv6 space is abundant and most mobiles is USA/Canada have IPv6. Won't that solve the problem?


IPv4 and IPv6 can both use kTLS. We offer service via V6, but most clients connect via IPv4. It differs by region, and even time of day, but IPv4 is still the vast majority of traffic.


I've had to blackhole Netflix IPv6 ranges on my router because Netflix would identify my IPv6 connection as being a "VPN" even though it's not.


If you could email me your ranges, I can try to look into it internally. Use my last name at gmail.com (or at freebsd.org)


I'm not GP; Hurricane Electric ipv6 always had Netflix think I was on a VPN, but now I have real ipv6 through the same ISP, I just pay more money so Netflix doesn't complain anymore.


Are you referring to HE's tunnel broker service?

If so, then yeah, that's a VPN.


> Mellanox ConnectX-6 Dx - Support for NIC kTLS offload

Wild, didn't know nVidia was side-eyeing such far-apart but still parallel channels for their ?GPUs?.

Was this all achievable using nVidia's APIs out-of-the-box, or did the firmware/driver require some in-house engineering :)


Mellanox was bought by nVidia 2 years ago, so while it's technically accurate to say it's an nVidia card, that elides their history. Mellanox has been selling networking cards to the supercomputing market since 1999. Netflix absolutely had to do some tuning of various counters/queues/other settings in order to optimize for their workload and get the level of performance they're reporting here, but Mellanox sells NICs with firmware/drivers that work out-of-the-box.


The architecture slides don't show any in-memory read caching of data? I guess there is at least some, but would it be at the disk side or the NIC side? I guess sendfile without direct IO would read from a cache.


Caching is left off for simplicity.

We keep track of popular titles, and try to cache them in RAM, using the normal page cache LRU mechanism. Other titles are marked with SF_NOCACHE and are discarded from RAM ASAP.


How much data ends up being served from RAM? I had the impression that it was negligible and that the page cache was mostly used for file metadata and infrequently accessed data.


It depends. Normally about 10-ish percent. I've seen well over that in the past for super popular titles on their release date.


in which node would that page cache be allocated? In the one where the disk is attached, or where the data is used? Or is this more or less undefined or up to the OS?


This is gone over in the talk. We allocate the page locally to where the data is used. The idea is that we'd prefer the NVME drive to eat any latency for the NUMA bus transfer, and not have the CPU (SW TLS) or NIC (inline HW TLS) stall waiting for a transfer.


This may be a naive question, but data is sent at 400Gb/s to the NIC, right? If so, is it fair to assume to assume that data is actually sent/received at a similar rate?

I ask since I was curious why you guys opted not to bypass sendfile(2). I suppose it wouldn't matter in the event that the client is some viewer, as opposed to another internal machine.


We actually try really, really hard not to blast 400Gb/s at a single client. The 400Gb/s is in aggregate.

Our transport team is working on packet pacing, or really packet spreading, so that any bursts we send are small enough to avoid being dropped by the client, or an intermediary (cable modem, router, etc).


Have you done any work to see whether the NIC hardware packet pacing mechanisms could improve QoE by reducing bursts?


Well it's not 1 client. It's thousands of viewers and streams. An individual stream will have whatever the maximum 4k bandwidth for Netflix is.


What architectures are you guys running FreeBSD on?

Would these techniques be applicable on arm64 and/or riscv64?


Slide 5. What is the difference of "mem bw" and "networking units"?


Networking tends to use bits per second instead of bytes per second, so in order to more easily compare the memory bandwidth to the rest of the values used in the presentation, the presenter multiplied the B/s values by 8 to get the corresponding b/s values.


Oh.. Networking unit is using "bit" instead of "byte".




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: