One Million Concurrent TCP connections

mrb · on Sept 23, 2011

I remember reading around 2002-2004 about a sysadmin managing a very large "supernode" p2p server who was able to fine-tune its Linux kernel, and to recompile an optimized version of the p2p app (to allocate data structures as small as possible for each client) to support up to one million concurrent TCP connections. It wasn't a test system, it was a production server routinely reaching this many connection at its daily peak.

If it was possible in 2002-2004, I am not impressed that it is still possible in 2011.

One of the optimizations was to reduce the per-connection TCP buffers (net.ipv4.tcp_{mem,rmem,wmem}) to only allocate one physical memory page (4kB) per client, so that one million concurrent TCP connections would only need 4GB RAM. His machine had barely more than 4GB RAM (6 or 8? can't remember), which was a lot of RAM at the time.

I cannot find a link to my story though...

gtani · on Sept 23, 2011

(Not your link but) some pointers on tuning erlang/OTP servers:

http://www.erlang-consulting.com/thesis/tcp_optimisation/tcp...

http://www.trapexit.org/Building_a_Non-blocking_TCP_server_u...

http://groups.google.com/group/erlang-programming/browse_thr...

(this erl mailing list thread is pretty typical, if you put up code, describe your app, hardware, network, database/external dependencies, etc, you'll get a ton of good advice about killing off bottlenecks. Another example

http://groups.google.com/group/erlang-programming/browse_frm...

mmaunder · on Sept 23, 2011

Agree with this being underwhelming. Maybe they'll hire someone who understands that Erlang is using kqueue to make this possible.

Running netstat|grep like this on a high concurrency server takes a long time to run. I've never found a faster way to get real-time stats on our busy servers and would be interested if anyone else has.

thomasknowles · on Sept 23, 2011

Actually I would run ss over the netstat for such high concurrency connection as netstat hits /proc/net/tcp and it's quite slow. However, you need to have the tcp_diag module loaded otherwise it falls back to /proc/net/tcp.

X-Istence · on Sept 23, 2011

You may not have read the entire article, but he clearly explains that their tech stack is FreeBSD + Erlang not Linux + Erlang.

There is no /proc/net/tcp on FreeBSD. Hell, there is no /proc unless it is specifically mounted by the administrator, but tools certainly don't use it to get data.

thomasknowles · on Sept 24, 2011

I read it, I just presume people most users are linux based in here.

derobert · on Sept 23, 2011

You might be able to parse /proc/net/tcp yourself faster, especially for just counting by state.

X-Istence · on Sept 23, 2011

/proc/net/tcp doesn't exist on FreeBSD...

adient · on Sept 23, 2011

You may have tried but generally add -n to netstat to speed it up by just displaying IPs instead of trying to resolve them. I've never seen a system with so many concurrent connections hands on so there could be some other bottleneck.

jen_h · on Sept 23, 2011

netstat -s |grep "connections established" is pretty fast and does the job if you're not filtering for certain ports or IPs. But running plain old netstat on a system with so many connections is a really bad idea. I almost fired myself once over a crufty cron job I left running...

tworats · on Sept 23, 2011

The WhatsApp guys are very sharp ex-Yahoo guys who've had tremendous experience with scaling systems. Rick Reed is fairly legendary. Yahoo was a long time FreeBSD shop, so it's not surprising they went with that.

I hope they publish how they did it - in fact let me drop them an email and see if I can convince them to do so.

cperciva · on Sept 23, 2011

Yahoo was a long time FreeBSD shop

FWIW, Yahoo still uses FreeBSD extensively.

lenn0x · on Sept 23, 2011

It's been done before. Here is an article using Erlang and Linux.

http://www.metabrew.com/article/a-million-user-comet-applica...

Part 3 is my favorite. http://www.metabrew.com/article/a-million-user-comet-applica...

jen_h · on Sept 23, 2011

This article series is kind of a Bible to me. It's not going to solve all of your problems, for sure, depending on yer stack, and setting yourself up for a forkbomb isn't the wisest in all situations, but it's got a lot of good advice & is pretty good about providing you a "Okay, tweak these parameters and then try to break it" baseline.

indygreg2 · on Sept 23, 2011

Technical details would be interesting. Until then, here's Urban Airship's post from last year on 500k connections on Linux:

http://urbanairship.com/blog/2010/09/29/linux-kernel-tuning-...

jsr · on Sept 23, 2011

OK, you hooked me with the title. But "FreeBSD + Erlang" was kind of a dissatisfying reason for how you achieved it. Would love to hear more details! How far we've come since http://www.kegel.com/c10k.html

getsat · on Sept 23, 2011

They could have done it using C, Ruby, Python, or any other language. kqueue is what makes FreeBSD (and OSX) awesome at concurrency.

http://en.wikipedia.org/wiki/Kqueue

silentbicycle · on Sept 23, 2011

How does kqueue compare to epoll on Linux? I've written C code using kqueue on OpenBSD and OS X, but have only used epoll via libev (and not at especially high load). I thought the big change came from trading level- for edge-triggered nonblocking IO, but maybe the kqueue implementation is superior for sockets somehow?

The main advantage Erlang has over C/Python/Ruby/etc. is that asynchronous IO is the default throughout all its libraries, and it has a novel technique for handling errors. Its asynchronous design is ultimately about fault tolerance, not raw speed. Also, it can automatically and intelligently handle a lot of asynchronous control flow that node.js makes you manage by hand (which is so 70s!).

You can make event-driven asynchronous systems pretty smoothly in languages with first class coroutines/continuations (like Lua and Scheme), but most libraries aren't written with that use case in mind. Erlang's pervasive immutability also makes actual parallelism easier.

With that many connections, another big issue is space usage -- keeping buffers, object overhead, etc. low per connection. Some languages fare far, far better than others here.

asomiv · on Sept 23, 2011

Yes I would say kqueue, the interface, is superior to epoll. Kqueue allows one to batch modify watcher states and to retrieve watcher states in a single system call. With epoll, you have to call a system call for every modification. Kqueue also allows one to watch for things like filesystem changes and process state changes, epoll is limited to socket/pipe I/O only. It's a shame that Linux doesn't support kqueue.

But as awesome as kqueue is, OS X apparently broke it: http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#OS_X_AN...

gorset · on Sept 23, 2011

I fully agree that kqueue is awesome, but what specifically is broken on OSX? I've used in extensively on that platform, and haven't run into any showstoppers.

asomiv · on Sept 23, 2011

You can't poll on TTYs for example. I remember this bug from 2004 and apparently they still haven't fixed it.

gorset · on Sept 23, 2011

Yeah, the lack of support for TTYs can be annoying when writing a terminal application (a workaround for some cases is to use pipes), but it hardly qualifies as a significant problem for writing network applications.

Here's a more interesting bug in OSX: kqueue will sometimes return the wrong number for the listen backlog for a socket under high load.

Flow · on Sept 23, 2011

IIRC, kqueue can be told to read the whole http-request before letting client know there's data to read.

rdtsc · on Sept 23, 2011

How does it work? Do users provide a buffer and the kernel fills the buffer with data and notifies the user when ready?

That is more akin to AIO Linux system, then? Otherwise, epoll/poll/select just notifies users when data is available but the actual copy is done by the user. Surprisingly this can make a huge difference when streaming large amounts of data.

http://blog.lighttpd.net/articles/2006/11/12/lighty-1-5-0-an...

We have argued here before and I have gotten downvoted into oblivion for being pedantic and distinguishing between asynchronous IO and non-blocking IO but it looks like that extra user-space memcpy can make a huge difference.

Flow · on Sept 23, 2011

I can't find anything about this now, just spent a good 20 minutes searching for it. I guess keywords kqueue, buffer request http are too generic in some sense. :-/

Anyway, the idea was to avoid context switches by waiting/parsing in kernel-side till there was enough data for the client to do something else that just another gimme_more_data()-call back to the kernel.

It could even be applied to other methods than kqueue, so perhaps I remember a bit wrong that this was just for kqueues.

gorset · on Sept 23, 2011

That's actually accept_filter(9). The man page for freebsd has a interesting info:

    The accept filter concept was pioneered by David Filo at Yahoo! and
    refined to be a loadable module system by Alfred Perlstein.

The closest you can get by using kqueue is to set a low water mark, so that a read event is only returned when there's enough data ready.

Flow · on Sept 23, 2011

Ah, that's it! Thanks for finding it.

asomiv · on Sept 23, 2011

The kqueue interface is awesome, but unfortunately kqueue is bugged on OS X: http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#OS_X_AN... (as are poll() and select()!)

Hoff · on Sept 23, 2011

There might well be errors here, but what those errors might be is not stated.

From the cited article on ports of the libev event loop library: "The whole thing is a bug if you ask me - basically any system interface you touch is broken, whether it is locales, poll, kqueue or even the OpenGL drivers." with no particular details on what is broken in Mac OS X.

Issues with porting to AIX, Solaris and Windows are also discussed in that article, and with reports of errors, though with no specific details provided for those platforms.

Without error details, there is also insufficient information around whether alternatives or workarounds or fixes might exist, or whether there were bug reports and reproducers logged that would allow the vendors to address the (unspecified) errors.

Rickasaurus · on Sept 23, 2011

This may be a dumb question (I'm not a networks guy) but how do you maintain so many connections with just 65535 ports? Can you have more than one connection per port?

caf · on Sept 23, 2011

The server generally only ever uses one port, no matter how many clients are connected. It is the tuple of (client IP, client port, server IP, server port) that must be unique for each TCP connection - so the limit of 65535 ports is only relevant for how many connections a single client can make to a single server.

nathanappere · on Sept 23, 2011

I believe this is incorrect. The server usually listen on one port, but everytime it does an accept, a different random port is used, and the client start talking to the server on that new port.

caf · on Sept 23, 2011

This is a surprisingly common misconception. When you accept, you get a new socket. but it is on the same local port. You can readily see this by running 'netstat' on a busy server.

ay · on Sept 23, 2011

Another way to see that there are no magic "dynamically selected" potrs is to run on the server "tcpdump src x.x.x.x or dst x.x.x.x" where the x.x.x.x is your client address - and check the outputs - the packets will have ports in them.

Alternatively one can read RFC793 or, better, Stevens' "TCP/IP Illustrated".

nathanappere · on Sept 23, 2011

Just checked it and you're right. I believed I had encoutered the behaviour I described but I do not remember in which context. Anyway glad I learned something!

dsl · on Sept 23, 2011

pavelkaroukin · on Sept 23, 2011

I believe this comes from how FTP protocol works which do it to some extent and the reason why it do not work trough firewalls well without using passive mode.

tybris · on Sept 23, 2011

By that logic, you would have to open every single port on your firewall if you wanted to set up a web server. Fortunately, your OS just looks at the client IP/port to distinguish multiple TCP streams.

richardw · on Sept 23, 2011

You can have multiple connections on the same port from the same client. E.g. all connections via HTTP commonly go over the same port and a single browser usually has 2-4 simultaneous connections to the server. They don't use multiple ports to achieve this.

caf · on Sept 23, 2011

They use multiple ports on the client side to achieve this - the "client port" part of the tuple varies.

pacala · on Sept 23, 2011

There is one unique connection per (address, port) pair.

huhtenberg · on Sept 23, 2011

Actually, the quintet of [protocol, src_ip, src_port, dst_ip, dst_port] is what's unique (protocol being TCP or UDP)

pacala · on Sept 23, 2011

Good point. Each unique client (address, port) can get a separate connection on a given/fixed server (address, port).

Rickasaurus · on Sept 23, 2011

Ahh I see, you just give one box a ton of IPs. Thanks :)

kornholi · on Sept 23, 2011

No, the pair uses the client IP which means the client can have as many connections to your server as number of ephemeral ports allowed. There is no limit on connections except ram AFAIK.

stonemetal · on Sept 23, 2011

Not sure why you got down voted, last time this topic came up the author was using EC2 instances for test clients, it took them 17 or so to get the number of connections to their server that they wanted. When the server IP, server port part of the 4 tuple is constant, it takes quite a few client IPs to turn 64K ports into a million.

chubs · on Sept 23, 2011

They beat me to it! I've only gotten to 500k on EC2, however i believe there's some trickery in their firewalls / NAT which is holding me back... If anyone's interested in the gory details, see: http://splinter.com.au/tag/comet

ericholscher · on Sept 23, 2011

Curious about what you've seen from the EC2 networking gear that's holding you back? Firewall not letting more connections through?

chubs · on Sept 23, 2011

Well, at the moment i can't pin it down to EC2, but it's the only thing i can imagine it'd be. The network/cpu/memory usage is all healthy, and there's nothing in the kernel log, so that's what i'm guessing is the cause. Although i may be wrong.

csarva · on Sept 23, 2011

We've observed the bottleneck to be an upper limit on packets/sec for a given instance type. On an m1.large this is about 100k/sec. I believe it's due to the virtual NIC just not being fast enough to handle high traffic loads.

The rightscale folks found the same thing:

http://blog.rightscale.com/2010/04/01/benchmarking-load-bala...

huhtenberg · on Sept 23, 2011

In absolute numbers - wow, that's impressive.

It was several years ago, but I've done my share of high-concurrency stuff under Linux and the highest I got to was about 200K connections - at which point the single-threaded server bottlenecked at its disk I/O.

The main issue is not the actual connection count, it's what the per-socket OS overhead is (so not exhaust non-swappable kernel memory), how many sockets are concurrently active (have an inbound or outbound data queued) and if the application can handle all the events that epoll/kqueue report. This is not a rocket science by any means, and the kernel is relatively easy to fine-tune even when the actual load is present.

cnlwsu · on Sept 23, 2011

I would be curious about the hardware used, it can make it much more or less impressive. I have done a test with 1 million concurrent tcp connections using java (mina) on an Ubuntu system... but it had 64 gb of ram. It kept running for weeks under the load which I felt pretty good about.

getsat · on Sept 23, 2011

What is the kernel structure overhead (in bytes) per connection on FreeBSD?

imsy · on Sept 23, 2011

Good work. On Imsy (www.imsy.com) we hope to achieve the same with Node.js on EC2. Presently the numbers are smaller (in 100K range).. but looks like it will scale smoothly till 1 million.

henry501 · on Sept 23, 2011

It's hard to say that something is going to scale smoothly up an order of magnitude until you're there. 900K requests is a lot of room for things to go wrong.

nivertech · on Sept 25, 2011

Node has memory limit of about 140K connections. I was able to accept and actually do something usable with 1M on EC2 and 3M on physical server (Erlang and Ubuntu/CentOS)

kitsune_ · on Sept 23, 2011

How much ram does this single server have, what about its cores?

jen_h · on Sept 23, 2011

A. That is AWESOME. Props, guys! B. Just how long did it take for that netstat to return? ;)

spokengent · on Sept 23, 2011

FWIW, netstat isn't fun to use. conntrack is better, or cat /proc/net/ip_conntrack

Luyt · on Sept 23, 2011

They're on FreeBSD. conntrack is a Linux utility.

spokengent · on Sept 23, 2011

ah ok I see...

KonradKlause · on Sept 23, 2011

conntrack is not netstat. I requires netfilter's connection tracking feature.

Using conntrack on a 1MC system will waste even more kernel memory!

gabi38 · on Sept 23, 2011

How would kqueue compare to Windows's IO completion ports in terms of performance?

flzz · on Sept 23, 2011

Just because you can doesn't mean you should.

nivertech · on Sept 25, 2011

Not many can, but some fortunate enough to actually need this stuff.