The CPU Cost of Networking on a Linux Host

ra1n85 · on May 15, 2020

For many years, the most popular routing platforms (i.e., boxes built by Cisco) performed IP packet forwarding and management functions on the same processor (often a RISC architecture). In cases where packet rate was high, it was possible for devices to become unresponsive or lose critical protocols responsible for sharing routing information.

In the last 15 years there has been a hard move away from these architectures. Almost no packets are forwarded by the same processors running management and control plane functions anymore. This is mainly because the required traffic rates today need dedicated silicon purpose built for the task (the Broadcom Tomahawk3 can do 12.8 Terabits/sec above a relatively small packet size).

I don't know how things will shake out for the Linux world and x86 packet forwarding given the trend and lack of real performance in the kernel. Right now, your best bet when it comes to Linux and high network throughput/packet processing requirements is to just bypass the kernel entirely with DPDK, a "smart" NIC, or XDP.

rstuart4133 · on May 16, 2020

> I don't know how things will shake out for the Linux world and x86 packet forwarding given the trend and lack of real performance in the kernel.

That's pretty much settled isn't it? It was settled in the same way the "it got too much for the general purpose CPU" problem always gets settled - you do it in custom silicon, and define a standardised interface.

That's pretty much what happened with 3D graphics where the standardised interface was OpenGL (but now seems to be up in the air). It's also pretty much what's happened in AI with standardised libraries interfacing to custom hardware, and it's what happened with networking.

For networking, the interface standard is OpenFlow. So, if you think it's possible you will need to handle links of about 1Gbs or over in the future, you do your networking using an OpenFlow implementation like Faucet. If it's not much above 100Mb/s the Linux kernel module that implements OpenFlow, called openvswitch, will be fine. Otherwise use some custom hardware.

Openvswitch has been around since 2009 - so it's not exactly a new thing.

jes · on May 16, 2020

I was intrigued by your comment about the Tomahawk3. Turns out there is a Tomahawk4 implemented in 7nm which has twice the performance. Thank you for tipping me off to these devices:

https://www.broadcom.com/company/news/product-releases/52756

ra1n85 · on May 16, 2020

Yep, though it’s not commercially available quite yet. Likely only shipping samples to switch manufacturers at this time.

diegocg · on May 15, 2020

These is the most up-to-date article I've seen on these matters https://lwn.net/Articles/787754/ https://lwn.net/Articles/629155/

EvanAnderson · on May 15, 2020

I wonder if there's an economic argument, at useful scales, to using FPGA's in general purpose servers to accelerate network performance. The purpose-built ASICs would win on cost-per-unit every time, but the FPGA would have some adaptability to new protocols or algorithms that the ASIC wouldn't.

wmf · on May 15, 2020

Yes, Azure uses FPGAs for networking. Programmable NPUs like Netronome or Pensando are a middle ground that are more cost-efficient than FPGAs for most needs.

karatekidd32v · on May 16, 2020

There's some cool tech going on at barefoot (https://barefootnetworks.com/)

I don't know much about the technical details, but the pitch I've heard is that it gives you ASIC level performance with more flexibility to reprogram the chip (not full FPGA).

ra1n85 · on May 16, 2020

Yeah, it's an interesting approach. They're basically allowing you to define packet processing with P4 on their Tofino family of chipsets: https://p4.org/

That said, there's only so much you can do in a chip before considerable tradeoffs are going to be made. They're not going to offer the same level of flexibility you get out of a general purpose CPU, but may not have same the restrictions of most fixed pipeline chips - their product sits somewhere in the middle. Also, P4 seems to sit in a space complex enough to make it unreasonable for most network shops - it's not for your average enterprise or service provider network.

yummypaint · on May 15, 2020

Ive been waiting for mainstream workstation motherboards with this capability for years. Presumably a pci card is how this would be handled in practice for now. My naive take is that toolchains are still too convoluted and bogged down with licensing schemes to deliver the kind of real-time highly integrated adaptability this would require for individual users. Would be great if the barrier to entry has in fact dropped enough that university-sized networks could consider them.

bogomipz · on May 15, 2020

>"In the last 15 years there has been a hard move away from these architectures."

It's been longer than that for router vendors. The move to hardware based forwarding was in the 90s.

tw04 · on May 15, 2020

That's not entirely accurate - Cisco open sourced VPP which they use in many products:

https://blogs.cisco.com/sp/a-bigger-helping-of-internet-plea...

https://docs.fd.io/vpp/16.09/

While they still obviously make custom ASICs, they've moved towards more software where possible.

Hikikomori · on May 15, 2020

Pretty much everything above their small medium business ISR routers use ASIC/NPU for all traffic possible. Which routers are you referring to?

ra1n85 · on May 15, 2020

Sorry, what's not accurate here?

gnufx · on May 16, 2020

Are those kernel bypasses better than the Linux-based RDMA we've had for ages (on appropriate fabrics)?

wmf · on May 16, 2020

One problem with RDMA is that it's not compatible with existing protocols.

gnufx · on May 17, 2020

It's existed for a fair while as Infiniband doing Linux by-pass; the comment didn't say TCP/IP. (The general technique pre-dates Linux, of course.)

dahfizz · on May 15, 2020

This is one of the reasons I have a strong aversion to "cloud" technologies like docker and kubernetes. You take networking, something with decades of development and hardware support, and you put it all in the CPU.

To be clear, Linux has a very robust networking stack. But it will never come close to the natting and routing performance of an actual router.

And so we develop things like DPDK to spend even more CPU just to keep things usable, but it still feels like a big step backwards.

A typical k8s deployment runs in containers that are in VMs. So each packet you want to send to or from a container needs to touch a cpu and traverse a networking stack six times. That's dumb.

dan_quixote · on May 15, 2020

> That's dumb.

Is it? It's certainly inefficient compared to dedicated hardware. But so is anything relying on a CPU - we could just use ASICs for everything. But then every logical change requires weeks/months/years of development and manufacturing.

The goal of k8s, VMs, etc is flexibility. I can set up a 100-node k8s cluster with less-than-perfectly-efficient networking stack in mere minutes. Good luck matching that with dedicated hardware.

dahfizz · on May 15, 2020

> But so is anything relying on a CPU - we could just use ASICs for everything. But then every logical change requires weeks/months/years of development and manufacturing.

Right, except the asics already exist and you're actively choosing the less efficient, more expensive option.

> The goal of k8s, VMs, etc is flexibility.

I don't think this is necessarily bad, as long as you understand the tradeoff. You're choosing a fundamentally slower architecture to make management easier. It's a choice of prioritizing the developer experience over the user experience.

oneplane · on May 15, 2020

In this case it's not even user experience vs. developer experience as you can have both; it's just the cost for performance gets higher as the efficiency decreases.

On the other hand: the cost for development goes down if you don't need to pay for extra steps taken by an extra person. If you take a 10-step process that is run by 5 people and reduce it to a 5-step process run by 3, you have 2 more people to do other stuff, or roles that you don't have to create/fill in the first place.

CountSessine · on May 16, 2020

Right, except the asics already exist and you're actively choosing the less efficient, more expensive option.

It’s hard to imagine any dev task being more expensive than organizing and programming tables into a trident or tomahawk chip using the broadcom’s SDK.

gnufx · on May 16, 2020

Experimentally, you don't need luck to set up ~200 dedicated hardware nodes in minutes in an HPC cluster, for instance. (They do have to be connected up, with known or discoverable MAC addresses.)

supahfly_remix · on May 15, 2020

For vCPUs on the same physical multicore CPU, packets are sent by shared memory constructs which is very fast. It's effectively sending a pointer to the memory. I have seen transfers like this up to 80 Gbps, much faster than the physical NIC.

dahfizz · on May 15, 2020

The transitions from host -> VM is not slow, it's the fact that the packet now has to touch the VM's CPU and go through it's whole routing and natting tables again. And then again for the container.

The transition from physical host memory -> vnic is analogous to just sending the packet over a wire. That's not the slow part of modern networks.

wmf · on May 15, 2020

This is partly true and partly not. All the physical routing and switching is still there and it's more efficient than ever. What we've done is added more functionality. Additional layers of abstraction like VMs or containers are generally going to have some performance cost.

It's possible to optimize basic packet forwarding in a hypervisor/containervisor using leaner options like ipvlan or even SR-IOV. If you replace hardware firewalls with something like k8s network policy then you are technically replacing hardware with software but it isn't that slow if configured right (hello Cilium) and you're probably implementing a more sophisticated policy anyway.

Klinky · on May 15, 2020

Do most scenarios require that pedal-to-the-metal level of performance? A lot of applications are CPU-bound either within the application server or a dependent service(database). Your payloads/throughput may actually be rather low, and the inefficiency may be a tiny percentage of CPU usage compared to the application itself.

If you need nonstop CDN levels of bandwidth, then you're probably going to go with a setup that has fewer layers, but CRUD applications don't exactly need terabits of bandwidth.

dahfizz · on May 16, 2020

As the article explains, it's not all about throughput or latency (though those do suffer), it's also about how much CPU time you spend because of the network architecture. If you're running a cpu bound app, all the more reason to avoid sending 6+ interupts per packet.

onei · on May 15, 2020

How did you get to six? Is that just sending a single packet without dealing with responses? Never thought about this before so I'm curious.

dahfizz · on May 15, 2020

Source container -> VM -> host -> host -> VM -> destination container

otterley · on May 16, 2020

This hasn't been the case for years. See, e.g., SR-IOV, which allows modern PCIe network adapters to completely bypass a host virtual switch and present itself directly to a VM.

https://www.intel.com/content/dam/doc/application-note/pci-s... is a good overview.

Also, at least on AWS, you can directly attach VPC network interfaces to containers, which (among other things) obviates the need for software bridging, veth pairs, etc.

Matthias247 · on May 15, 2020

Correct me if I'm wrong, but I don't think the container transition introduces any additional copies and hops. Containers are not VMs. They just use isolation that the kernel enforces. Whether you send a packet from inside or outside a container to the network should be exactly the same amount of work unless you opt-in to run some extra software-level NAT step.

shanemhansen · on May 15, 2020

You're wrong and right, because "container" is ill-defined. If you run your container in the host network namespace, then there's zero overhead. It's more common to have your container's "nic" be one end of a veth pair. That's not free.

Finally many platforms like docker, in the past, will spin up userspace proxies (docker-proxy) which is even more expensive.

dahfizz · on May 16, 2020

You are correct that containers can be implemented that way, but docker containers aren't. They use NATing / bridging / veth pairs, which is much more usable but also much slower.

deeblering4 · on May 15, 2020

This is one of the benefits to running containers on bare metal.

sleepydog · on May 15, 2020

It's not as bad as it sounds as a few of those transitions can be implemented via zero-copy APIs, so you're really just passing around pointers to shared memory.

dahfizz · on May 15, 2020

The transitions happen efficiently, but the packet still has to spend cpu time traversing the network stack at all 6 hops. So the packet moves all the way through the routing / natting tables of the host, and then very quickly ends up in the VM. Then has to traverse the whole routing / natting tables of the VM.....

toast0 · on May 15, 2020

If there's NAT involved (as was posited in the start of the thread), then you can't just copy, you also need to modify.

nickpsecurity · on May 15, 2020

Given clouds are like mainframes reinvented, you might like reading on how the mainframes handled I/O:

https://en.wikipedia.org/wiki/Channel_I/O

I always thought PC's could be setup to do something similar. The cores used for that could even be small and simple compared to main cores. Folks in early 2001 could be Bittorrenting their Linux distros or whatever they use it for with no user-visible lag. I'd like at least one each for user-input devices, graphics, storage, and networking. Other good reasons to isolate them a bit from each other.

There's already ARM SoC's in embedded that have a good core for apps and weak one for I/O. Recently, the RISC-V chip with the minion cores. If embedded can do it, then there seems to be no technical limitation holding back desktops. Just marketing, backward compatibility, etc.

heinrichhartman · on May 15, 2020

Interesting. I always suspected something like this was going on. Virtual networking / "SDN" just sounded slow to me.

Do you know a good source, to learn about the lower-level Linux/container/docker/k8s networking?

cheeseprocedure · on May 16, 2020

packagecloud's in-depth writeups on the Linux networking stack are a great read:

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tunin...

https://blog.packagecloud.io/eng/2017/02/06/monitoring-tunin...

The illustrated guide to receiving data is also solid:

https://blog.packagecloud.io/eng/2016/10/11/monitoring-tunin...

zamadatix · on May 15, 2020

I got the most value poking around with network namespaces manually via the "ip" command. There are also true VRFs in Linux these days which are much lighter but not nearly as common, most everything is a namespace with either bridging or NAT and some fancy configuration interface to drive that state.

neurostimulant · on May 16, 2020

I think I read somewhere a while ago that kubernetes cluster will start experiencing big slowdown at around 500 pods due to overhead of its internal networking components. Not sure if it's still true now. I'm going to deploy a cluster that might contains more than 500 pods soon so I'm not really eager to find out the hard way.

Aaronstotle · on May 15, 2020

I'm curious, what makes the natting/routing performance much better on a router? Is it that the router is an ASIC designed only to process packets vs a cpu that can execute any code?

rayiner · on May 16, 2020

That's not really the whole explanation. There are a couple of things in play. First, while Linux can perform routing, it's mainly intended to be an end host. So for example software like VPP that is designed with the assumption that it can use polling and burn up entire CPU cores to optimize packet forwarding can perform 10x faster on the same hardware.

Second, router ASICs can be designed to sped transistors on a predictable packet flow path, instead of intelligence like out-of-order execution to make general purpose code fast. For example, routers have big expensive content-addressable memories called TCAMs that are used to store things like routing tables and ACLs. The ASIC, moreover, can implement a highly tuned pipeline designed around the latencies in the underlying memories. E.g. you get a packet, grab the destination address, look up the next hop in the routing table, etc. Each step takes a predictable amount of time that you can account for and optimize.

Third, parallelism is much cheaper in hardware than in general purpose CPUs. It takes a lot more transistors to be able to execute a second general-purpose instruction stream than to have a single-purpose circuit that does some work in parallel with something else.

dahfizz · on May 15, 2020

> Is it that the router is an ASIC designed only to process packets vs a cpu that can execute any code?

Yes. Most routers + switches are "line rate" meaning the packets go through the switch at the speed of electricity, as if the switch wasn't there are all.

dsr_ · on May 15, 2020

In particular, a modern high-end router has a CPU (which might be an x86 of some kind) which handles administrative and auxiliary tasks (NTP, ssh, DHCP reflection, whatever you want) while making one or more dedicated ASICs handle packets. When you manage the router, you're actually talking to the supervisory CPU.

marsdepinski · on May 16, 2020

Good short article. CPU power saving has an effect on interrupt handling and becomes a factor at high packet rates. Turning off p-states will improve performance.

z3t4 · on May 15, 2020

What happens when you "netperf TCP_STREAM from 2 sources"?