Why we use the Linux kernel's TCP stack

raphaelj · on July 11, 2016

I implemented an highly-scalable user-space TCP stack as part of my Master's thesis [1], last year.

One doesn't use an user-space network stack because the Linux's network stack is slow (it's fast), but because it doesn't scale correctly on an high number of CPUs (> 8 cores) [2]. This is because the kernel suffers from some lock contention when accessing the table containing the socket descriptors.

An user-space stack can be significantly faster when the application layer is really simple (e.g. doing some very simple filtering or routing) and when it does not share any mutable state. As soon as your application layer starts sharing mutable states between connections (e.g. like a database), you'll start having contention issues similar to those experienced by the kernel, and you'll not gain anything from using an user-space stack. Very often, applications that can benefit from such as stack can also be scaled easily on multiple machines, and it's usually easier to keep using the Linux's stack and add more servers.

--

[1] https://github.com/RaphaelJ/rusty

[2] https://github.com/RaphaelJ/rusty/blob/master/doc/img/perfor...

majke · on July 11, 2016

> This is because the kernel suffers from some lock contention when accessing the table containing the socket descriptors.

This would indicate the problem is with packet delivery to application. From my experience even packet delivery to "filter" iptables chain is "slow".

But let's assume you are right, can you elaborate? Do you think SO_REUSEPORT on TCP sockets can solve the contention of accept()?

https://lwn.net/Articles/542629/

There are some initiatives improve SO_REUSEPORT CPU affinity, hopefully making it even faster.

Update: I misread. Ok, so "table containing the sockets", but this is just a large hash table, nothing too fancy... aRFS for greater locality?

raphaelj · on July 11, 2016

A single socket can be shared by multiple cores. That means that the kernel must both protect the socket descriptor from concurrent writes, and can't enforce a TCP link to be handled by a defined core (CPU affinity).

majke · on July 11, 2016

> A single socket can be shared by multiple cores

Absolutely, it can. But everybody sane avoids that, pinning worker processes to specific CPU's and not sharing sockets between them. The rule of thumb is that spinlocks on the hot path of socket access become a bunch of no-ops if there is no lock contention.

joosters · on July 11, 2016

I can't see why pinning is such an obvious choice. The kernel's scheduler may decide that CPU 1 should be woken to handle some new packets, because CPU 2 is busy. If you've pinned the socket to CPU 2, you may be losing out.

I get that there are trade-offs between the two modes: pinning can provide better cache usage, you can avoid some locks (but indirectly make the kernel do the work for you) and so on, but I don't see how you can confidently state that pinning is the 'sane' choice.

In an ideal world, the kernel has a better overview of the network state and CPU state, and therefore is best positioned to decide which CPU should handle each packet.

felixgallo · on July 11, 2016

The kernel scheduler has no idea which application thread handles which socket. You can formulate an application level plan and then enforce your will with socket and cpu pinning.

scott_s · on July 11, 2016

Correct, but the difficulty is if your application must share the machine with any other application - even short lived ones. That, I think, is what joosters was alluding to. If the assumption that your application is the only consumer of system resources is broken, then you may see pathological scheduling behavior.

achamayou · on July 11, 2016

You can set isolcpus to earmark some cores for your application, and let the kernel manage the rest.

rconti · on July 11, 2016

Tangential, but what are some good resources for understanding the limitations of Linux on larger systems? (8 socket, multi-TB, multi-10gig, NUMA, etc)? Over the years I've found that the trivial questions have saturated the internet and made it very hard to GoogleShoot complicated problems.

jhallenworld · on July 11, 2016

Cavium allows you to run multiple instances of the Linux kernel- one pinned to each core of their np (you get to have a window of shared memory between them). It would be interesting to try this on Intel.

0xbadcafebee · on July 11, 2016

Physical limitations of bandwidth on packets per second: https://www.cisco.com/c/en/us/about/security-center/network-...

Why the Linux kernel has a hard time processing more than 1-2M packets per core per second, and patches/improvements for the kernel: https://lwn.net/Articles/629155/

CloudFlare's kernel bypass blog post: https://blog.cloudflare.com/kernel-bypass/

A paper from NTop on doing 10G line rate packet processing, the limitations, and then-current options: http://luca.ntop.org/10g.pdf

NetOptimizer kernel dev blog, where they've maxed out the throughput of a 10G link using the kernel's stack, and details on latency, theoretical maximums and how to test: https://netoptimizer.blogspot.com/search/label/10G

--

The real answer to "Why do we use the Linux kernel's TCP stack?" is that operating systems are designed to help users and programs. They are not designed to be a custom tailored highest-performance cure-all for the highest possible theoretical computing throughput. Using one tcp stack helps users and programs more than each user or program using its own unique stack.

blastrat · on July 11, 2016

| The real answer ... is that operating systems are designed to help users and programs. They are not designed to be a custom tailored highest-performance cure-all ... Using one tcp stack helps users and programs more...

I don't use that definition of operating systems, I use the unix virtual machine because it's ubiquitous and provides its same large core set of defined functionalities across many scales and platforms; and it is a particularly genius virtual machine; coding to other virtual machines is more difficult and terribly parochial, decreasing the usefulness of the work.

Much work is done at many layers to the internals of the unix virtual machine to increase its ubiquity, scope, scale, and performance. The more the internal design of the consistent virtual machine can support a custom tailored highest-performance cure-all for the highest possible theoretical computing throughput, the more ubiquitous and useful that virtual machine will be.

0xbadcafebee · on July 11, 2016

There is no such thing. (Was this an attempt at a troll?)

blastrat · on July 11, 2016

there is no such thing as what? i didn't posit the existence of something that does not exist.

I disagreed with your POV because I can hardly imagine a scenario where I would recommend dropping an OS like Linux. This article and discussion is about understanding what a particular performance problem in linux is about, and (many eyes) perhaps people will suggest plausible solutions.

Your post is somewhat dismissive of the effort, to my ear, and it stems from your reductionist view of an OS as just some more software in addition to the software required for a project app.

0xbadcafebee · on July 11, 2016

There is no such thing as a "unix virtual machine". Unless, perhaps, you include Solaris Containers, since Solaris is technically a Unix operating system. I don't know what other "virtual machines" you discuss, nor what features or implementation details you're generalizing about, but your main assertion is nonsense and the rest is fluff.

In terms of your assertion that the easier it is to design a highest-performance tcp/ip stack the more ubiquitous it will be, is also wrong. Cisco and other vendors all have userspace plug-in frameworks for stacks that reach the highest performance packet-analysis-per-core in the industry, and they sure as shit aren't ubiquitous.

An OS is basically fancy glue to help programs work together to make the user's life easier. This has always been the case, because people are pretty universally annoyed by having to feed 1000 custom punch cards to a mainframe every time they want to run a program.

blastrat · on July 11, 2016

I'm using a different (and I believe more accurate) definition of virtual machine, and I'm using it because I think it offers more insight.

There are many possible chips you might have running inside your workstation, and none of them are the hardware they pretend to be, they are many varieties of microcoded superscalars that emulate the functionality of an amd64 architecture. They are virtual amd64s. On top of those you run OS binaries tuned for your hardware that present an API consistent with Linux (or Windows, depending on the software layers you run). If you had a hardware implementation of the Unix/Linux API, then that would be a Linux machine, but otherwise you are running a Linux virtual machine.

It's the Linux physical machine that does not exist. Linux virtual machines abound. (and many of them run on processors other than amd64s)

EDIT: I'm not offered a reply link to my repliers, so I'll edit answer in here instead.

I use that definition of virtual machine because it's the one true definition; to use the "naive" definition is to be wrong. I thought I was making that apparent in my description. There is no actual hardware amd64, there are only microcoded emulators of the architecture. So right off the bat, you see that if we code in assembly language, we are coding to a virtual machine, not an actual machine.

And just as you can think of mathematics as nothing more than the manipulation of symbols, when we write software, we are arranging symbols to code a virtual machine; in assembly language; in C; in Haskell; etc. Many (most?) C language implementations have symbolic references to the operating system. But inside that black box, we know is actually a virtual implementation on top of another virtual implementation.

An analogy would be, if you learn to drive an automobile, you learn to drive all of them because they use the same arrangement (more or less) of controls. They are different physical hardware manifestations (some gas, some electric, some diesel) of the same virtual machine (steering wheel, go pedal, stop pedal).

It was taught to me as a more useful definition, and I embraced it. I think using virtual machine to refer only to products from VMware is more problematic definitionally.

And in particular I used it in this thread to respond to the post at the very top that seemed to suggest that all software is just software as if we can just discard the operating system when it doesn't do what we want.

This is all what they teach at MIT in the computer science curriculum, BTW, nothing weird or cultish, or DOWNVOTEY about it.

anarazel · on July 11, 2016

What's the point of that definition?

blastrat · on July 12, 2016

i replied above because reply here wasn't an option after i got downvoted

Annatar · on July 11, 2016

They are not designed to be a custom tailored highest-performance cure-all for the highest possible theoretical computing throughput.

I beg to differ vehemently, as the FireEngine TCP/IP stack in illumos was designed to be the highest possible performance cure-all for highest possible throughput. I've posted the links above in another entry. That GNU/Linux's TCP/IP stack is hitting the limits does not mean that nobody else is capable of designing a high performance TCP/IP stack, and indeed, I have been able to max out 1 Gbit connection running Solaris 10 on a measly DELL R910. If the network administrator hadn't come running to "turn it off! Turn the damn thing off!", I would have maxed out a trunked 40 Gbit link too. I sure taught that guy a lesson that day, never again have I heard a peep about "not ever seeing anyone being able to max out a 1 Gbit connection with a single machine".

Only Solaris / illumos' FireEngine makes it possible.

jsnell · on July 11, 2016

Help me understand this. How is maxing out a 1Gbps connection with a 4U, 4 socket (so presumably 24-32 core) Xeon server supposed to be impressive?

Annatar · on July 11, 2016

Considering I did it seven years ago, I think it's impressive in that context.

0xbadcafebee · on July 11, 2016

I didn't say anything was impossible. I said the Linux Kernel, and operating systems in general, have TCP/IP stacks not designed to facilitate the highest theoretical possible throughput.

And you mention "maxing out" a link. I bet you're talking about maximum throughput. That requires the maximum possible frame size. I, and the article, are talking about maximum frame rate, which requires the minimum possible frame size. The maximum rate results in about 18x more packets per second. (This is for a typical MTU of 1500; let's not get into jumbo frames...)

And you talk about 1Gbps. For a 10Gbps link, which is the most likely use case in the article, it results in 10x more packets per second, which requires a comparable scale in computing resources. But the stack is only designed to scale to the more likely use case of 1Gbps links; after 8 or so cores, performance can drop off precipitously.

So to hit the maximum packets per second of a 10Gbps link, you would need to handle 180x more packets per second than the max throughput of a 1Gbps link. Most stacks are not designed or performance tested with that in mind.

Annatar · on July 11, 2016

Solaris 10 can drive a 10Gb link at 7.3Gbps (limited by PCI-X bandwidth) using 2x2.2Ghz opteron CPUs utilized at less than 50%

https://sunaytripathi.wordpress.com/2010/03/25/solaris-10-ne..., page nine.

Now the PCIe is multiple times faster than the old PCI-X, and in addition to being several times faster then the old Opteron 939 and 940 series, modern intel based systems have 80 CPU's or more.

0xbadcafebee · on July 12, 2016

You are still quoting throughput when i'm quoting frame/packet rate. On top of this the line you have in italics doesn't show up on that page (it shows up in two random websites that provide no detail as to that claim), I have no idea what "page nine" on my resolution monitor is compared to yours, and you're trying to suggest something about the PCI bus being a factor (which it isn't; the limiting factor is cycles per packet). So I don't think you understand what's going on.

Annatar · on July 12, 2016

My apologies, wrong link. This is hard to do on a mobile telephone.

http://www.baylisa.org/library/slides/2005/august2005.pdf, page nine, and in more detail further in the document.

wmf · on July 11, 2016

Let's call that 3.3 Gbps/GHz. Linux can now drive 25 Gbps with a single 4 GHz core, so that's over 6 Gbps/GHz. Given the number of years involved, it's hard to tell how much improvement is due to hardware and how much is software. But I'm not blown away.

doomrobo · on July 11, 2016

>With this scale of attack the Linux kernel is not enough for us. We must work around it...we added a partial kernel bypass feature to Netmap: that's described in this blog post. With this technique we can offload our anti-DDoS iptables to a very fast userspace process.

What do they mean by "very fast userspace process"? If you're doing the same thing the kernel would be doing, the userspace process should be strictly slower due to context switches. What costs are they saving on here?

majke · on July 11, 2016

The "very fast userspace process" has direct access to hardware NIC RX queue and is doing busy polling. It uses 100% CPU all the time. The process is faster then kernel because:

- it does less only since it implements only some subset of iptables

- is single threaded, no locks

- doesn't implement TCP

- it's small, no iTLB misses

- the working size set is small, we only deal with couple of packets at a time

- no memory allocations on the hot path (no skb)

- it is doing busy polling, saving the Xus needed for an interrupt context switch

catern · on July 11, 2016

It seems like the main thing making that process fast is the fact that it's doing polling. But Linux does polling too, when there are enough packets coming in. Likewise, if you really are dropping packets very early in the networking stack, you wouldn't reach the TCP layer and would have a small working set in the kernel too.

So, why not just implement this very fast userspace process as a kernel patch? That seems much easier...

zeroxfe · on July 11, 2016

> If you're doing the same thing the kernel would be doing, the userspace process should be strictly slower due to context switches.

Actually, you would cut down on context switches as you ship packets between kernel and user-space. This allows you to do all kinds of intelligent filtering and other header processing that can vastly improve latency in gateway systems (proxies, routers, VM hosts, etc.)

dgemm · on July 11, 2016

Saving an interrupt per packet. It's far fewer context switches to process packets in a busy loop.

kieranelby · on July 11, 2016

The argument about not being able to run SSH on the server seemed a bit weak, surely just stick two NICs in there, one for management + one for user-space stuff?

scott_s · on July 11, 2016

I think that's less an argument, and more of an example. You are correct, that is certainly something one can do. But I think the related argument is that now your system configurations are more complicated; you have tied your hardware and software together, for example. For some that may not be possible, and even if it is, some may not want to give up the abstraction that the kernel provides.

Annatar · on July 11, 2016

With this scale of attack the Linux kernel is not enough for us. We must work around it.

...Or you could just use an operating system substrate based on illumos which utilizes the FireEngine, like for instance SmartOS, instead of having to invent workarounds or use one's own TCP/IP stack implementations:

http://www.baylisa.org/library/slides/2005/august2005.pdf

https://sunaytripathi.wordpress.com/2010/03/25/solaris-10-ne...

this technology has been available for over ten years now; designed from the ground up to scale across multiple hardware threads for high performance, by the experts in the problem domain.

quitspamming · on July 11, 2016

Quit spamming about stupid SmartOS, you try to shoehorn it in to every topic. You're like a Mormon Missionary for SmartOS and it is super annoying.

oddsignals · on July 11, 2016

I found his comment relevant and interesting enough, and judging by his posting history SmartOS is far from the only thing he comments about. It certainly added more to the discussion than yours did.

cyphar · on July 11, 2016

An incredibly large percentage of Annatar's posting history is a misunderstanding of something about GNU/Linux, followed by a pitch about SmartOS. It's not the only thing they talk a out, but it's the only posts that stick in my mind. While I find the history of free operating systems fascinating, it's quite dismissive to pretend that all possible problems that GNU/Linux faces today were solved "10+ years ago by experts in the problem domain".

Annatar · on July 11, 2016

Misunderstanding? I develop on Linux day in and day out. Care to qualify that assertion?

it's quite dismissive to pretend that all possible problems that GNU/Linux faces today were solved "10+ years ago by experts in the problem domain".

As one of the principal kernel engineers of the FireEngine, yeah I think Sunay is the expert in the problem domain, having invented parallel enqueuing or what he terms "fanout", and Radia Perlman, who I believe collaborated with him on it invented the spanning tree protocol. If that doesn't make them the subject matter experts in the TCP/IP stack domain, then I have nothing more to add. And yes, some or the problems GNU/Linux is hitting today have been solved on Solaris more than ten, others more than twenty years ago. Solaris had large enterprises as paying customers throughout the nineties of the past century, and those customers both demanded and paid huge sums of money to have these types of problems solved, so in some cases illumos has up to 25 years of a headstart, and by the time GNU/Linux catches up, illumos will already be ahead, as the development is not standing still and it has professional kernel engineers working on the code base.

cyphar · on July 12, 2016

> Care to qualify that assertion?

The most recent example I can think of is you posting about containers on GNU/Linux[1], claiming that they were implemented primarily using cgroups (and that the main purpose was resource restrictions). That is not true, and hasn't been true for a long time (if ever). Yes, the very first upstream "container" primitive was cgroups -- but that was very quickly replaced with namespaces and cgroups took on the resource restriction role. What most people call "containers" was always about virtualization (ie isolation), and the isolation primitive in the Linux kernel is namespaces.

There are almost certainly more examples, but I don't feel like going through any more of your comment history at the moment.

> And yes, some or the problems GNU/Linux is hitting today have been solved on Solaris more than ten, others more than twenty years ago.

Believe it or not, but constraints have changed in the past 20 years. I'm not saying that illumos doesn't have awesome technology (it does), but it is not a panacea. I get it, you're an advocate for alternative free operating systems. Good for you. Solaris does have a 25 year headstart -- on solving problems 25 years old. Modern computing has many more problems that weren't even concieved 25 years ago (cloud and distributed computing being the main ones, as well as embedded devices which is something that Solaris can't put a candle to GNU/Linux on). So it's very dismissive to claim that Solaris has solved all problems that may face GNU/Linux. Both operating systems have problems they need to fix.

> and it has professional kernel engineers working on the code base

So does Linux, I'm missing your point here.

[1] https://news.ycombinator.com/item?id=11944847

Annatar · on July 12, 2016

> What most people call "containers" was always about virtualization (ie isolation), and the isolation primitive in the Linux kernel is namespaces.

There is no isolation with cgroups in Linux, that is the crux of the matter:

https://www.youtube.com/watch?v=coFIEH3vXPw

since containers in Solaris existed before cgroups and before the entire Linux hype, and you specifically adress my "misunderstanding" (of hype), you compel me to correct on terminology:

containers are resource constraints, while technology like LXC and OpenVZ provide the lightweight virtualization and isolation, a very important distinction (full virtualization is achieved via XEN and KVM on GNU/Linux). Conceptually, as a resource constraint, containers are in that sense the same in Solaris as they are in Linux, with vastly different mechanism implementations, but neither provide isolation.

Again, and I corrected you on this before (this happens to be my problem domain), what you think of as containers are lightweight virtual machines, as zones in Solaris and LXC / OpenVZ in Linux, and equating cgroups and namespaces with a lightweight virtual machine technology is conflating two different things.

If you should have the inclination to point out my other "misunderstandings" of Linux, an operating system I very heavily use, develop on, and engineer for, I would be interested to learn of them.

> So does Linux, I'm missing your point here.

If they exist, I have not heard of them, read about them, or met them yet; at any rate, since Linux has so many architectural and performance problems, again I am compelled to conclude that those "Linux kernel engineers" are not of the same caliber as the ones working on BSD and illumos kernels. That an operating system, after almost twenty years of massive investment and literally armies of programmers still cannot get basic things like startup (init.d/systemd/other variants of startup), shutdown (trying to flush a filesystem buffer to an unmounted filesystem), or even TCP/IP performance right tells me it is missing kernel engineers. Enthusiasts and volunteers tinkering with the kernel do not professional kernel engineers make, as is evident by this entire topic of whether to bypass the kernel's TCP/IP stack with one's own implementation, because the stack cannot deliver sufficient performance. That is what one can call damning evidence, no matter how one slices or dices it.

cyphar · on July 17, 2016

> There is no isolation with cgroups in Linux > containers are resource constraints

I'm going to say this one more time:

Linux containers use namespaces as the primary isolation mechanism -- NOT cgroups. You can create containers without cgroups. This happens to be my problem space too, and you're not helping by spreading ignorance.

> equating cgroups and namespaces with a lightweight virtual machine technology is conflating two different things.

Finally you mention namespaces. Who mentioned "lightweight vritual machines"? Namespaces are just tags for a process that are used to scope operations to provide isolation. Cgroups are different tags used to provide resource constraints. Just because people use containers in that way at the moment doesn't make the underlying technology just about that.

> an operating system I very heavily use, develop on, and engineer for, I would be interested to learn of them.

Arrogance is not an endearing quality.

> If they exist, I have not heard of them

We can play that game all day. I don't care who you have and haven't heard of, Linux has talented kernel engineers as evidenced by the fact that Linux is widely used for production deployments. You might not agree with what has been built, but you can't deny that it does exist and is being used to power production systems. Please calm down on the saltiness, sodium is bad for your health.

Annatar · on July 11, 2016

No, I'm just sick and tired of Linux and want my favorite OS to finally hit the mainstream, so there would be some job opportunities. (Linux became popular the same way, for those of you with a short memory.) Now that Linux is finally hitting scale, people are getting busted by the shoddy programming, hence discussions about in- or out of kernel TCP/IP stack, which is preposterous, since the OS is supposed to provide an interface to the hardware. And I make no apologies for being a SmartOS advocate, just to set the record straight.

DblPlusUngood · on July 11, 2016

It would be great if you could provide specific reasons why FireEngine is able to avoid the overheads which other projects avoid via kernel bypass.

Annatar · on July 11, 2016

The main reason being that the packets are put into queues bound to hardware threads (VCPU's), which the illumos kernel treats as processors. Sunay, the principal author of the FireEngine explains it in detail in the second link I cited. Long story short, on illumos based systems network performance scales linearly with available processors, and on systems where the NIC's PHY would be faster, a kernel tunable, ip_squeue_fanout (edittable via /etc/system), enables one to change the packet processing methodology. One of the techniques which enable the FireEngine to provide high network performance is eschewing mutex locks in favor of multiple parallel queues and parallel queue drainage. Should I also explain how mutex locks function, and why they are detrimental to performance as opposed to enqueueing?

wmf · on July 11, 2016

This sounds pretty similar to RSS/RPS/RFS in Linux. Granted, that functionality was added more recently and may have even been inspired/copied from Solaris for all I know, but the past is a sunk cost. None of your "back in the day" comments convince me that Solaris today has any advantages over Linux today.

alberth · on July 12, 2016

DragonflyBSD

I really wonder how DragonflyBSD compares given that they have a lockless network stack implemented in the kernel.

https://www.dragonflybsd.org/~aggelos/netmp-paper.pdf

__b__ · on July 11, 2016

Previous: https://news.ycombinator.com/item?id=12048709

chmike · on July 11, 2016

dpdk and netmap is really only for applications with cooperating I/O processes. This is because the queue of received packets is shared between all process and any of them can delete any packet.

It may not be good for CloudFlare hosting multiple web servers on the same host, but I it could be good for a database or cache server usually run in a LAN with 10Gbit/s network cards.

gpderetta · on July 11, 2016

Don't modern high performance network cards have multiple tx/rx queues which are virtualizable via IOMMU?

That's a genuine question BTW, I've only a bit of experience with userspace networking with fully cooperating processes.

jsnell · on July 11, 2016

SR-IOV is good for actual virtualization, but it's pretty clumsy for trying to create isolation within a single VM. For example:

- Every VF you create using SR-IOV will need to have a distinct MAC (and thus in practice different IP). But what you'd usually want for this use case is use the same IP for all apps, and do the split by destination port.

- Another consequence of the previos point is that all apps would need to include their own support for ARP, DHCP, etc. Doing it centralized doesn't really work.

- No promiscuous mode (at least on Intel NICs)m you only get traffic directed to one specific MAC address

Now, if you didn't try to use the virtualization support but just use the separate RX/TX queues, with something like the flow director for deciding what traffic gets sent to which queue, you'd get rid of the above problems. But then you end up with the issue that DPDK makes it very hard to have separate applications access the same NIC, even on different queues.

0xbadcafebee · on July 11, 2016

Use VPP for namespace-specific userland applications? https://wiki.fd.io/view/VPP/What_is_VPP%3F