*This doesn't seem to investigate tuning at all.* This is always the response to...

_delirium · on June 16, 2012

It's a common response to off-the-cuff benchmarks, yes, but imo a deserved one. Most peer-reviewed, published benchmarks go to considerably more effort to make sure they're comparing like to like, using settings that would be considered sensible by users of each of the technologies under consideration, etc., etc. In part, that's because you wouldn't be able to get your benchmark paper accepted to a reputable journal if you didn't do that.

Basically, benchmarking is hard, and if you threw together a quick benchmark over a weekend, there is a good chance it might not be representative.

scoot · on June 17, 2012

If there are sensible defaults and a technology doesn't use them "by default", then that is a failing of the technology, not the benchmark.

mistercow · on June 17, 2012

Arguably, though, what makes a default "sensible" depends on how the technology is being used. I'm speculating here, but it could be that there are settings that greatly increase performance with < 1000 connections, but greatly degrade performance with > 3000 connections. In that case, would it not be a sensible default to start out with the fast, lower scale option?

fijal · on June 18, 2012

Is using pypy a reasonable default? Who's there to make that choice? I suppose the answer is "no", while it makes sense to use it while benchmarking.

SkyMarshal · on June 16, 2012

He doesn't have to, it's enough just to point out gaping flaws in the 'study' to call it into question. The onus is on the study's author to make it comprehensive and thorough, not the critics.

This one in particular is clearly incomplete, especially when things like Haskell, Java, and Python are represented by a single framework, and not even the best or most optimized ones. For example, why not use Yesod/Warp, BlueEyes, and Tornado on PyPy, respectively, for those instead?

Also, the submission title is editorializing flame bait. The study's author didn't make any claims whatsoever about 'Haskell', 'Java', and 'Python', etc. but rather about Snap, Webbit, and ws4py.

davidw · on June 16, 2012

> The onus is on the study's author to make it comprehensive and thorough, not the critics.

It's always possible for the critics to go out there and do something better, rather than limiting themselves to pointing out flaws.

calibraxis · on June 16, 2012

The problem is that it takes effort to make good comparative benchmarks — for a platform you're not totally immersed in, you'd ideally speak with the community about your methods. Furthermore, good benchmarks require real investigation explaining the differences; otherwise you could be measuring something silly. And you should graph some of the stats by time.

Maybe in an ideal world, quality benchmarks would be commonplace. But in this world, doing good benchmarks in response to everyone's bad benchmarks is a heavy burden.

cantankerous · on June 17, 2012

Writing good benchmarks is hard. Spotting bad ones is easier. I personally am glad that it's this way. Otherwise, we'd be taking bunk benchmarks at face value because we can't tell for ourselves.

pyre · on June 17, 2012

So what you're really saying is that no one is 'allowed' to make a critical remark unless they are willing to stand up and do better themselves?

SkyMarshal · on June 17, 2012

Problem of induction - for theories that cannot be logically or mathematically proven, and can only be confirmed by empirical evidence, no amount of confirming evidence can prove the theory true, but one single refuting observation can disprove it.

So maybe the study's author only cared about webbit, ws4py, and Snap, and that was the implicit constraint of his test. If so, that's fine, and the submitter mangled the author's intent with an overly-editorialized, senationalized title.

But if the author was actually using webbit, ws4py, and Snap as proxies for all of Java, Python, and Haskell, then no one should be surprised that it gets quickly shot down. There's value in critics debunking BS quickly, even if they don't provide a better alternative. Absence of bad knowledge is better than presence of it.

politician · on June 16, 2012

Doing both is also possible...

duaneb · on June 17, 2012

I don't see why it's not the critics' onus to produce better benchmarks.

cantankerous · on June 17, 2012

That's like one posting a bad argument on here, another saying it's a bad argument, then the first saying it's not a bad argument because the second hasn't supplied a better argument to replace the first's argument.

Nobody has to accept this benchmark as indicative of anything if they don't find it robust enough.

amalcon · on June 17, 2012

The burden of proof is always on the one making the claim, not on the one refuting it. See Russell's Teapot, et al.

mibbitier · on June 16, 2012

It's not even a benchmark of languages. So I have no idea why it's the languages listed in the title.

It's a benchmark of specific websocket implementations, run with default configurations, which happen to be in different languages.

Drawing any conclusions about the merits of each language based on the results would be very foolish.

jbooth · on June 16, 2012

Sure, but language shootouts are more commonly about the code run than the language itself.

I really like Go and can believe that Erlang performs well. For the java example.. wtf is "webbit"? I've programmed Java for 10 years and never heard of that. Why doesn't he just find a way to run on Jetty or Tomcat like everyone else?

I'd believe for an example like this that native languages can outperform java, because it's basically a ton of no-op web requests. But Java underperforming Node and frickin Python? Python's a great language but famously slow. No way is this study legit.

fizx · on June 16, 2012

Webbit seems like a thin wrapper around Netty to make Netty do websockets. Unless Webbit grossly screws things up, it seems like a reasonable choice, because all the work falls to Netty, a well-known and much-used library.

akkartik · on June 16, 2012

But java did better than node and python.

hermanhermitage · on June 16, 2012

Indeed. It seems pretty clear to me it is framed as an benchmark of non tuned idiomatic performance.

Maybe it doesn't call it out explicitly - but in my view that is a reasonable enough test.

I'm pretty sure all the platforms could support an FFI binding or equivalent to an optimized epoll C implementation - or hours of tuning.

jfoutz · on June 16, 2012

It seems more about running on an amazon instance and getting throttled. Haskell got every successful result back in less than a second, erlang took 7 seconds and go took 49(!) seconds to get an answer back.

I'm really impressed that erlang didn't drop a single connection. I suspect if you ran the same test with just erlang 5 times, you'd see it start to behave like go.

nosequel · on June 17, 2012

> I suspect if you ran the same test with just erlang 5 times, you'd see it start to behave like go.

I don't think it would. Erlang was designed for things like this exactly and has many many many years up on the competition (Go in this case). I'm sure Go will get there, but by then, with Erlang's further threading improvements coming in R16 it will also get better.