Hacker News new | past | comments | ask | show | jobs | submit login

The JVM runs every language I care about, in a mature system with a simple, stable interface. It has its problems (startup time), but for the kind of jobs you use Hadoop for it's rarely an issue.

Switching to Docker feels like a real step backwards. Rather than a jar containing strictly-specified bytecode, cluster jobs are going to be random linux binaries that could do any combination of system calls. You need another layer of build/management system to make your docker containers useable. Worst of all, rather than defining common shared services for e.g. job scheduling, this is going to encourage users to put any random component in their docker image. We'll end up with a more fragmented ecosystem, not less.




Hi, JD @ Pachyderm here.

Our feeling isn't that Java is bad it's that it shouldn't be the only option for big data processing. There are a lot of people that want to do this type of computing and a lot of them hate Java. We want to build a solution that can serve them better and Docker is actually an incredibly high leverage way to do it.

I feel like I don't fully understand your criticism of choosing Docker as platform.

> cluster jobs are going to be random linux binaries that could do any combination of system calls

Could you give an example of what could go wrong here? Docker prevents you from talking to the outside world so you can't call out to an outside service in the middle of a job or anything like that.

> Worst of all, rather than defining common shared services for e.g. job scheduling

In Pachyderm job scheduling happens outside the user submitted containers. The job scheduler is what actually spins up the containers and pushes data through them. I'm afraid I don't understand what it would even mean for one of those containers to then do job scheduling itself.

Hope this clears up our thoughts on this a bit.


The JVM is the best VM out there, it has been developped for a long time by very smart people, I don't see how docker is better in that case.


Docker isn't a VM? In docker everything is native code. There is no emulation. It uses OS level sandboxing primitives to provide protection and isolation. Running an application in docker gives you the same performance profile as running it on the host because it is running it on the host. So if you write your job in C or Fortran you can beat any Java code on the JVM nearly everytime. Especially for linear algebra applications.


But in this use-case, you don't want to pick a VM; you want each task to specify a VM. A docker container is, effectively, an app that bakes in whatever VM it requires.

I can see the argument for managing workload definitions as an {app repo, VM descriptor} tuple—kind of like Heroku app definitions—rather than a single app+VM image blob. But when you're actually deploying them to run on machines, nothing beats baking everything into a reproducable static-binary-equivalent slug.

You don't need to let people run arbitrary docker images as workloads, but the ability to let people pick any VM—rather than just your preferred VM—while having this be just as safe and sandboxed, opens up a lot of flexibility.

(This is the same argument I give for Chromium's PNaCl, by the way: the advantage doesn't come from your ability to run one-off native apps in the browser; the true advantage is in <script>s that can specify exactly the interpreter engine—a PNaCl-binary VM—they need. Nobody seems to see eye-to-eye on this one, either.)


> Our feeling isn't that Java is bad it's that it shouldn't be the only option for big data processing. There are a lot of people that want to do this type of computing and a lot of them hate Java. We want to build a solution that can serve them better and Docker is actually an incredibly high leverage way to do it.

Thank you.


> Our feeling isn't that Java is bad it's that it shouldn't be the only option for big data processing. There are a lot of people that want to do this type of computing and a lot of them hate Java.

JVM != Java. It runs a wide variety of languages, and this is not just theory - you can do Spark in Python right now. "Every language I care about" was perhaps overly flippant, but between Groovy, Clojure and Scala there's something to suit most programming tastes.

Meanwhile I hate Linux, and Docker doesn't have a good answer for that; the suggestions I've seen for developing with Docker on other platforms boil down to "run Linux in a VM".

> Could you give an example of what could go wrong here? Docker prevents you from talking to the outside world so you can't call out to an outside service in the middle of a job or anything like that.

I'm more worried about forward compatibility, standardization - and partly attack surface I guess. I can run jars from 15 years ago, on today's JVM - and not because it bundles an old version; I can link today's programs against 15-year-old libraries and they'll just work. If a library I built against has disappeared from the internet, that's fine.

Meanwhile I can't run linux binaries from 15 years ago (I did try with some of the loki games - maybe it's possible but it's decidedly nontrivial). Even if I found the right .sos, if I wanted to use a library with the ABI of ten years ago in my program, I'd have to rebuild the whole world. I guess I'm meant to rebuild my docker image, but that doesn't work if the library source is no longer available, and I'm not at all confident that the tooling does fully reproducible builds to the same extent as e.g. maven (from what I can see plenty of dockerfiles use unpinned versions).

Finally, the spec for the JVM is in one place, and it's published; apache harmony shows that a ground-up reimplementation is possible (that the TCK isn't open is very unfortunate, but the patents will expire sooner or later). I have a reasonable level of confidence that we'll always be able to run the jars we're producing today, without having to emulate x86s or what have you. Linux doesn't have a spec like that (yes the C code is available), and I don't think a ground-up reimplementation would be feasible.

> In Pachyderm job scheduling happens outside the user submitted containers. The job scheduler is what actually spins up the containers and pushes data through them. I'm afraid I don't understand what it would even mean for one of those containers to then do job scheduling itself.

Job scheduling was a poor example; I guess I'm thinking more things like e.g. running your own database on the cluster nodes, instead of talking to one outside the cluster via a proper interface. (Admittedly that's possible on hadoop with derby or h2).

Thinking about it more I guess it's less the service part and more the interface part that I'm worried about. A docker image could implement critical functionality with shell scripts. I bet they will. I've already spent too much of my life trying to get people to move away from shell scripts into something safer and more maintainable.


> between Groovy, Clojure and Scala there's something to suit most programming tastes

> the spec for the JVM is in one place, and it's published

Between Groovy, Clojure and Scala, there's something to suit most tastes on the scale of how much language specification there is. Scala has a published spec (along with many white papers on the R.I. internals), Clojure has its published spec in the comments in the source code, and Groovy actively changes the spec between versions to prevent addons and alternative implementations being built.


Sure, and I'd prefer people used the more specified options. But even if you have a Groovy library that you can't compile under modern Groovy, the bytecode will still run on new JVMs - more than you can say in the Docker case.


PySpark does not run Python code on the JVM. It uses py4j and it is slow as molasses. This has the benefit of supporting native Python libraries like NumPy. If you're not using such libraries you'd be better off using a JVM language, including Jython.


I'm not really sure why this post got downvoted, because I think there's some merit to it. In a spherical-cow universe I think there's an argument for opening up the universe to all tools, but I think in practice this makes good practices much harder to maintain and harder to proof against issues and failures at a glance.

Personally, I dig Spark on EMR or Spark on Mesos (with Scala as a domain language), and I'm not sure how this plays with the rest of the world.


I think Spark on Yarn on Mesos co habiting with Docker on Mesos managed by Twill is the future.

I just need to take a big breath every time I say it though.


Sounds like some one doesnt want to put the work into learn java - which I admit is a stone cold bitch to work with (and I have done MR with Pl/1G and Fortran 77 back in the day ) but you can write hadoopp jobs in python.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: