C10K 2012: Erlang Wins, Go a close second. Java, Haskell, Node & Python fail

mrj · on June 16, 2012

This doesn't seem to investigate tuning at all.

I see the Tornado script forks processes, for example, but it doesn't experiment with different values (or PyPy?). The Java code appears to be run with default VM settings, which pretty much always deserves some tweaks to run in a server environment.

Those are just the two I'm most familiar with. The rest seem to have similar faults. Plus, it would be a more interesting test if the server had to perform some kind of work. Simply echoing the request is not a typical usage and could really bias the results in favor of setups that would fail on a real project.

I better title might be, "Erlang wins unrealistic test over other VMs in their default configuration."

dvirsky · on June 16, 2012

I contributed the tornado script to this project, but the results of it are yet to be published (or if they have I've missed them). I haven't tested it myself on AWS, just on a physical desktop machine, I got great results, but they are so good it of course seems to be an irrelevant comparison.

I have benchmarked the ws4py code in the test on that machine, and it was much slower than the tornado code - quite obvious since it ran in a single process thus on a single CPU.

BTW The default number of processes when you fork is the number of CPUs, which sounded like a fair estimate to me, not knowing the target machine, so I left it.

regularfry · on June 17, 2012

It's perfectly valid to say "this is what you get out of the box" which is how a large proportion of people are going to be running their servers. Admittedly, not a lot of people are going to be both running a server in a default configuration and having to deal with C10K, but slashdotting can happen to anyone.

huggyface · on June 16, 2012

This doesn't seem to investigate tuning at all.

This is always the response to any benchmark where one's pet technologies don't win. If you have a magic quadrant of tuning, put it forth. If not, you have said nothing that counters the results.

_delirium · on June 16, 2012

It's a common response to off-the-cuff benchmarks, yes, but imo a deserved one. Most peer-reviewed, published benchmarks go to considerably more effort to make sure they're comparing like to like, using settings that would be considered sensible by users of each of the technologies under consideration, etc., etc. In part, that's because you wouldn't be able to get your benchmark paper accepted to a reputable journal if you didn't do that.

Basically, benchmarking is hard, and if you threw together a quick benchmark over a weekend, there is a good chance it might not be representative.

scoot · on June 17, 2012

If there are sensible defaults and a technology doesn't use them "by default", then that is a failing of the technology, not the benchmark.

mistercow · on June 17, 2012

Arguably, though, what makes a default "sensible" depends on how the technology is being used. I'm speculating here, but it could be that there are settings that greatly increase performance with < 1000 connections, but greatly degrade performance with > 3000 connections. In that case, would it not be a sensible default to start out with the fast, lower scale option?

fijal · on June 18, 2012

Is using pypy a reasonable default? Who's there to make that choice? I suppose the answer is "no", while it makes sense to use it while benchmarking.

SkyMarshal · on June 16, 2012

He doesn't have to, it's enough just to point out gaping flaws in the 'study' to call it into question. The onus is on the study's author to make it comprehensive and thorough, not the critics.

This one in particular is clearly incomplete, especially when things like Haskell, Java, and Python are represented by a single framework, and not even the best or most optimized ones. For example, why not use Yesod/Warp, BlueEyes, and Tornado on PyPy, respectively, for those instead?

Also, the submission title is editorializing flame bait. The study's author didn't make any claims whatsoever about 'Haskell', 'Java', and 'Python', etc. but rather about Snap, Webbit, and ws4py.

davidw · on June 16, 2012

> The onus is on the study's author to make it comprehensive and thorough, not the critics.

It's always possible for the critics to go out there and do something better, rather than limiting themselves to pointing out flaws.

calibraxis · on June 16, 2012

The problem is that it takes effort to make good comparative benchmarks — for a platform you're not totally immersed in, you'd ideally speak with the community about your methods. Furthermore, good benchmarks require real investigation explaining the differences; otherwise you could be measuring something silly. And you should graph some of the stats by time.

Maybe in an ideal world, quality benchmarks would be commonplace. But in this world, doing good benchmarks in response to everyone's bad benchmarks is a heavy burden.

cantankerous · on June 17, 2012

Writing good benchmarks is hard. Spotting bad ones is easier. I personally am glad that it's this way. Otherwise, we'd be taking bunk benchmarks at face value because we can't tell for ourselves.

pyre · on June 17, 2012

So what you're really saying is that no one is 'allowed' to make a critical remark unless they are willing to stand up and do better themselves?

SkyMarshal · on June 17, 2012

Problem of induction - for theories that cannot be logically or mathematically proven, and can only be confirmed by empirical evidence, no amount of confirming evidence can prove the theory true, but one single refuting observation can disprove it.

So maybe the study's author only cared about webbit, ws4py, and Snap, and that was the implicit constraint of his test. If so, that's fine, and the submitter mangled the author's intent with an overly-editorialized, senationalized title.

But if the author was actually using webbit, ws4py, and Snap as proxies for all of Java, Python, and Haskell, then no one should be surprised that it gets quickly shot down. There's value in critics debunking BS quickly, even if they don't provide a better alternative. Absence of bad knowledge is better than presence of it.

politician · on June 16, 2012

Doing both is also possible...

duaneb · on June 17, 2012

I don't see why it's not the critics' onus to produce better benchmarks.

cantankerous · on June 17, 2012

That's like one posting a bad argument on here, another saying it's a bad argument, then the first saying it's not a bad argument because the second hasn't supplied a better argument to replace the first's argument.

Nobody has to accept this benchmark as indicative of anything if they don't find it robust enough.

amalcon · on June 17, 2012

The burden of proof is always on the one making the claim, not on the one refuting it. See Russell's Teapot, et al.

mibbitier · on June 16, 2012

It's not even a benchmark of languages. So I have no idea why it's the languages listed in the title.

It's a benchmark of specific websocket implementations, run with default configurations, which happen to be in different languages.

Drawing any conclusions about the merits of each language based on the results would be very foolish.

jbooth · on June 16, 2012

Sure, but language shootouts are more commonly about the code run than the language itself.

I really like Go and can believe that Erlang performs well. For the java example.. wtf is "webbit"? I've programmed Java for 10 years and never heard of that. Why doesn't he just find a way to run on Jetty or Tomcat like everyone else?

I'd believe for an example like this that native languages can outperform java, because it's basically a ton of no-op web requests. But Java underperforming Node and frickin Python? Python's a great language but famously slow. No way is this study legit.

fizx · on June 16, 2012

Webbit seems like a thin wrapper around Netty to make Netty do websockets. Unless Webbit grossly screws things up, it seems like a reasonable choice, because all the work falls to Netty, a well-known and much-used library.

akkartik · on June 16, 2012

But java did better than node and python.

hermanhermitage · on June 16, 2012

Indeed. It seems pretty clear to me it is framed as an benchmark of non tuned idiomatic performance.

Maybe it doesn't call it out explicitly - but in my view that is a reasonable enough test.

I'm pretty sure all the platforms could support an FFI binding or equivalent to an optimized epoll C implementation - or hours of tuning.

jfoutz · on June 16, 2012

It seems more about running on an amazon instance and getting throttled. Haskell got every successful result back in less than a second, erlang took 7 seconds and go took 49(!) seconds to get an answer back.

I'm really impressed that erlang didn't drop a single connection. I suspect if you ran the same test with just erlang 5 times, you'd see it start to behave like go.

nosequel · on June 17, 2012

> I suspect if you ran the same test with just erlang 5 times, you'd see it start to behave like go.

I don't think it would. Erlang was designed for things like this exactly and has many many many years up on the competition (Go in this case). I'm sure Go will get there, but by then, with Erlang's further threading improvements coming in R16 it will also get better.

KirinDave · on June 16, 2012

For Haskell, there is no doubt it could do better. The culprit is probably the relatively new Websocket server. For evidence, I cite http://www.yesodweb.com/blog/2011/03/preliminary-warp-cross-... But "Not bad for 3 lines of code" is a pretty fair sentiment.

Erlang and Go doing great is no surprise. Java didn't do too well, but the implementation didn't really use the most performant tools available.

The real loser here is Node.js. Nearly as long as the Java example, but the worst performer in the real metrics. So much for "making concurrency easy."

eps · on June 16, 2012

Is this a joke? Where is a simple epoll-based C server for a baseline comparision? :)

willvarfar · on June 17, 2012

yeah, I'd love to see him enter a hellepoll-based one

jrockway · on June 16, 2012

My conclusion is that he managed to write the fastest code in the language he works with most.

kogir · on June 16, 2012

Benchmarking anything on EC2 instances will yield statistically suspect results. You know nothing about the other workloads on the box, the underlying hardware, or the network connectivity.

gchpaco · on June 16, 2012

Doesn't mean it can't be statistically valid, the noise just means you need to run more tests.

klodolph · on June 17, 2012

That assumes the noise is uncorrelated. It's not hard to imagine scenarios where it's correlated.

invisible · on June 17, 2012

This test is lacking some due diligence that I feel is skewing the results significantly (and, I assume, there are more that I haven't even noticed in Python/Java). (While m1.medium only has one CPU, the network IO cost exists.)

Erlang is automatically threaded by nature so it has some inherit scaling built-in (so the code benefits from threading as long as it runs correctly).

The Go code is set up to spin up threads inside of ListenAndServe thus gaining the benefit of splitting up IO.

The Haskell code has a thread specifically for garbage collection, thus utilizing the second core.

The Node.js code could be using cluster (or threads/fibers more directly) but isn't for some reason. It also seems this was using the websocket npm (some unknown code running on the stack!). For a valid test the websocket code should be written in JavaScript directly in the test itself.

Edit: Researched and Go actually uses ListenAndServe which creates threads on-the-fly but the m1.medium is bound to 1 CPU (it still benefits from threading due to IO).

stock_toaster · on June 17, 2012

  > The Go code is set to spin up threads to accomodate the number of CPUs available on the system (2 logical cores on m1.medium).

m1.medium is a single virtual cpu (it is a vm, so no 'cores' to speak of). c1.medium has 2.

invisible · on June 17, 2012

Corrected in my post to reflect that - it was kind of difficult to find a real answer when I was looking that up.

dchest · on June 16, 2012

Discussed 3 days ago http://news.ycombinator.com/item?id=4105317

simfoo · on June 16, 2012

Try again and increase GOMAXPROC to 10-20. See for example here: https://groups.google.com/d/msg/golang-nuts/fjBft6qeMo0/kYSi...

gorset · on June 17, 2012

I think he risks dropped connections because the listen queue can overflow.

He doesn't mention increasing somaxconn, which means it probably has a default of 128 (I don't think increasing tcp_max_syn_backlog is needed since he's using syncookies). When creating a new connection every 1ms it only takes 128ms for the backlog to fill up, which is not that improbable with GC and JIT pauses. Java should be faster than erlang most of the time, but the pauses can kill you if you don't handle them gracefully.

If the benchmark had been about "normal" http requests, I would suggest putting HAProxy in front with a reasonable maxconn - then HAProxy will hold on to the connections until the application starts accepting again.

lnanek2 · on June 16, 2012

Title doesn't agree with the article...article says Java beats Go, orders Java above it in the ranking chart, and the raw numbers say Java dropped fewer connections...

tuxychandru · on June 17, 2012

But it returned only half as many messages as Go.

rvirding · on June 17, 2012

I am not competent to judge the tests Eric used. But the obvious reply to those who complain about his test for some language is to fix it and come with an improvement which you feel better represents that language. It's all on github so fork it and come with a pull-request.

To say it bluntly, "put your money where your mouth is".

KirinDave · on June 16, 2012

Really. Bad. Title.

DrJosiah · on June 17, 2012

There are at least 2 better http servers for Python, uWSGI and gEvent. Tornado is known to be slower: http://nichol.as/benchmark-of-python-web-servers . More specifically, uWSGI has been shown to respond at under 25ms with 15k+ concurrent connections.

Throw uWSGI behind Nginx (with it's new websocket support), tune it a bit, and I wouldn't be surprised to see it "pass" and perhaps even be competitive.

adamtulinius · on June 16, 2012

The numbers are confusing and doesn't seem to add up at all. The section "stat definitions" describes data not available in the table below.

KirinDave · on June 16, 2012

Part of the problem is that EC2 is such a whacky environment. Notice Erlang's median connection time was tiny but its average had huge outliers.

Read the timing numbers with a reasonable portion of salt. It's EC2 we're talking about here.

tedsuo · on June 16, 2012

Yeah I wince a bit at any benchmarking being run in a shared environment like EC2, where other activity outside of your VM can affect performance.

stavros · on June 16, 2012

Hmm, why would he run a benchmark on a shared environment? Didn't he have a home computer to run this on?

wmf · on June 16, 2012

Which is more repeatable but less representative. IMO a non-shared EC2 instance is probably the way to benchmark (if you can drive enough load to saturate it).

stavros · on June 16, 2012

I agree, but does EC2 have non-shared instances?

bartman · on June 17, 2012

On EC2 there are two ways to get a dedicated machine:

- Using VPC you can specify that your instance should run on dedicated hardware [1]

- Cluster Compute instances [2] most likely (see [3]) run on dedicated hardware too

[1] http://aws.amazon.com/dedicated-instances/

[2] http://aws.amazon.com/ec2/#instance

[3] https://forums.aws.amazon.com/message.jspa?messageID=238197#...

gojomo · on June 17, 2012

Hmm. Some cloud provider ought to provide (for a premium price) guaranteed-identical non-shared configurations... including small groups of machines with uncontended, identical cross-connects.

(I know it's then close to 'dedicated' hosting, but people running benchmarks also want the quick setup and discard of virtualized instances. This would be a hybrid offering that ensures their cloud services is always chosen for such benchmarking comparisons. Of course this offering would not be good for cross-cloud comparisons, because it's not representative of their usual offerings.)

wmf · on June 17, 2012

guaranteed-identical non-shared configurations... including small groups of machines with uncontended, identical cross-connects.

That's called EC2 cluster compute.

j2labs · on June 16, 2012

I totally agree, but I'm not sure of any other places that offer such flexibility in pricing and hardware.

If you were to conduct a test like this, where would you go for more reliable performance from hardware?

KirinDave · on June 17, 2012

I think EC2 is great. It's just whacky.

caf · on June 17, 2012

This is true, but it's still interesting if you're intending to deploy onto EC2.

bcx · on June 16, 2012

I agree, this is really a test of various websocket implementations. (which is still cool)

keymone · on June 17, 2012

people get so much butthurt when they don't see "<stuff i chose to work with> wins this benchmark"

any framework, any language, hell any hardware is either generalized to deal with many problems or specifically designed to battle one.

you sure can write fast C program that does exactly this benchmark well - what will that prove? nothing at all..

why is it so hard to accept the fact that some tools can provide good enough results without too much tweaking while other tools may provide better result with more time spent achieving it?

mokus · on June 17, 2012

Could anyone explain to me why some of the rows don't add up to 10k attempted connections?

For example, looking at the raw data for Haskell and java, and adding up connections, disconnects, crashes, and timeouts doesn't give anywhere near 10k. What happened to the rest of the connections? Were they simply not attempted? Was the port closed? Or am i just missing a relevant field in the output? It's not clear to me from the description.

Weltschmerz · on June 16, 2012

Any reason you haven't tested the 'ws' node version I submitted? I think it should work fine...

stock_toaster · on June 16, 2012

This appears to be pointing to the old results. I dont think the author of the benchmark has rerun them yet.

nirvana · on June 16, 2012

I'm not the author of the benchmarks. I merely submitted the results to hacker news because I found them interesting.