Hacker News new | past | comments | ask | show | jobs | submit login
DARPA gives $3M to Continuum Analytics to improve Python for big data (drdobbs.com)
216 points by astaire on March 27, 2013 | hide | past | favorite | 40 comments



More info from Continuum Analytics:

http://continuum.io/blog/continuum-wins-darpa-xdata-funding

The money is going to develop Blaze and Bokeh:

http://blaze.pydata.org/

https://github.com/ContinuumIO/Bokeh


I think the lack of a ggplot2 equivalent is one of the main things holding people back from fully converting from R to Python - I know it is for me. Once you really understand the ggplot2 syntax and style, making your graphics seems to come naturally. It's fantastic.

I'm extremely excited to see Bokeh develop more.


I've found matplotlib [1] to be an intuitive and customizable way to make beautiful plots with python

[1]: http://matplotlib.org/


I see that Blaze is BSD-licensed... but Bokeh has no clear licensing information. Anyone know the intended Bokeh license?

(The Dr. Dobbs article doesn't mention licensing at all; the Continuum release implies the XDATA program has a focus on open source but doesn't specify more about Bokeh.)



Better D3 integration from within Python would be awesome.


Bokeh has three components. BokehJS runs in the browser, the Bokeh server and the Bokeh Python client library.

BokehJS draws plots in the browser with canvas. BokehJS maintains a data model of plot models (plots, renederers of different types, tools) with Backbone JS. It has a a connection to the Bokeh server via web sockets.

The Bokeh server keeps track of plot objects and other models with redis. It keeps these persistently in documents.

The python client connects to the server and either create new plots or updates existing plots. You can create a plot with python and get a url to embed that plot in any web page. This means that your python code can just be a script that doesn't have to worry about serving up a web page.

We intend to write client libraries in other languages like java, R, and ruby to name a few.


So... the server would maintain a set of grammar-of-graphics-like objects, which would be replicated as backbone models? In theory we could use whichever graphics libraries we want on top the backbone layer, right?

Bokeh is starting to sound like everything I've ever wanted.


Why, is there ANY D3 integration with Python?



Blaze has the potential to be great. Numpy/Scipy (and previously, numarry and NumPy) are, at least to scientists, some of the most useful libraries in Python. Being able to express computation in a high level language, and having it run in native code, yet still be able to access the data in python, is often a major productivity enhancer.

I'm still a bit skeptical Blaze will catch on. It's ambitious, and the python numeric community has had to chase each new implementation (with serious issues with backwards compatibility and cost of adoption) over the years. I would actually far prefer numpy and its limited functionality, over a much more complex system that didn't have any adopters.


Having a bit of a look at the docs^1, I don't really see it as "yet another arrays package". The API exposes generalized table/array operations, and a subset of those features can be implemented using NumPy's ndarrays. If the interface is clean, then I don't think adoption will be an issue: there simply isn't anything else that does this.

You are absolutely right that both of these projects are extremely ambitious, but, given the $3 mil grant, I can't help but be optimistic.

1 http://blaze.pydata.org/docs/overview.html#blaze-is-a-genera...


DARPA is also working closely with the Blaze team on several projects to enhance its functionality and performance, so it will continue to improve. I'm not aware of any money being directly invested, but the organization has a way of encouraging projects it finds promising.


This is great! Continuum Analytics are doing cool stuff for numeric / big data Python and it's great to see the language getting more and more traction in this domain.


Will this mean better scientific computing with Python?


Depends on what you call scientific computing, Continuum Analytics appears to have a data/stats focus (I now work as a statistician and people talks good about their products, I never used them), there I believe Python has the possibility to displace R and Octave/SciLab/MatLab as the language for light academic stuff, for CFD (Computational Fluid Dynamics) and CEM (Computational Electrodynamics) guys generally run on C, C++ and Fortran, or whatever runs Numerical Linear Algebra faster, I doubt Python will take this post anytime soon.


> for CFD (Computational Fluid Dynamics) and CEM (Computational Electrodynamics) guys generally run on C, C++ and Fortran, or whatever runs Numerical Linear Algebra faster, I doubt Python will take this post anytime soon.

You are welcome to harbor your doubts. :) There will continue to be a place for scientists who want to write software that takes advantage of the intricacies of low-level memory transfer. (Fundamentally, this is the control that C gives you.)

However, as hardware becomes more heterogenous (Xeon Phi, GPU, SSD/Hybrid drives, Fusion IO, 40gb/100gb interconnects, etc.), it's going to get harder and harder for a programmer to learn the knowledge required to truly optimize data transfer and compute within a single node or in a cluster, and still have any time left to do real science. Newer approaches to software development are needed, to maximize the potential of all this great new hardware, while also minimizing the pain and the knowledge level required of the programmer trying to utilize this hardware.

The key saving grace is that much of these hardware innovations are really geared for data-parallel problems, and it just so happens that parallel data structures are also easy for scientists and analysts to reason about. The goal of Blaze is to extend the conceptual triumphs of FORTRAN, Matlab, APL, etc. and build an efficient programming model around this data parallelism.

If you look at what's already been achieved with NumbaPro (the Python-to-CUDA compiler) in just a few months' work, I think the future is very promising.

The underlying premise for all of Continuum's work is: "Better performance through appropriate high-level abstractions".


I work with Python but mainly with light statistics (numpy, scipy, pandas and some other libs), SAS is used to work with large datasets (mainly because that's the policy of the bank I work).

I do not work with CFD anymore, 5 years ago I left the field and returned to Brazil, CUDA/OpenCL was just a promise back them, it was fast but no one had included the technology in their solvers.

The problem for CFD is that generally some simulations could take weeks to run, using python here would just add some more weeks just because of the overhead that the language have, if time or energy consumption is a problem people would just stay with what they have right now, for small stuff people use whatever they want (and I used Python back then, these days they use OpenFoam), for performance it will be C, Fortran or C++ compiled with the best optimizing compilers for at least a decade more. In these large problems data transportation and storage was a big problem.

Generally people use C++ these days, Fortran is only important in old codebases (although Coarray Fortran was a buzzword just like CUDA when I left the field).


I doubt, that you will find it easy to beat well written numeric Python code with naive C, C++ or Fortran approach.

Python is not inherently slower than C, C++ or Fortran. It is just a language after all. If you have a fast implementation of computational engine, like Numpy, you can reach speeds of C++ in computational tasks. If you have an optimizer that simplifies your math expressions, before running them - you can get better than naive C++ approach. An optimizer and computational engine that optimally offloads work from CPU to GPU can give an order of magnitude advantage over naive C++/blas code.

And to give you a concrete example - there is a nice Python library called Theano, that is doing just than.


>I doubt, that you will find it easy to beat well written numeric Python code with naive C, C++ or Fortran approach.

Huh? People do it all the time. Naive, but not memory-leaking or improper complexity using, C/C++ etc code, beats Python hands town -- and can be more than 10-20 times faster.

>Python is not inherently slower than C, C++ or Fortran. It is just a language after all.

An interpreted language, with a not-that-good interpreter and garbage collector, non primitive integers and other such things holding it back.

>If you have a fast implementation of computational engine, like Numpy, you can reach speeds of C++ in computational tasks.

That's because the "fast implementation" is NOT written in Python.


Let's go down to some example and consider a typical numeric problem, for example estimation of a cross entropy gradient of some function on your data.

With Python (and Theano computational engine) you can write something along the lines:

    def cost(goal, prediction):
        crossEntropy = -goal * log(prediction) - (1 - goal) * log(goal - prediction)
        return mean(crossEntropy)

    prediction = 1 / (1 + exp( ....  -dot(x,w) - b + ...)) > 0.5
    gradW, gradB = grad(cost(y, prediction) + (w**2).sum(), [w,b])
And then just apply that function to a matrix containing your data. That's it. When you apply a function, it will be interpreted, converted into a computation graph, this computation graph will be optimized and parts of the computation will be offloaded to GPU with memory transfers between the host and GPU taken care of, and you will get a result in a user friendly and efficient Numpy array.

The resulting computation will be nearly optimal and limited by memory bandwith, CPU - GPU bus bandwidth and GPU FOPS rate. With luck you can get close to theoretical maximum of your GPU floating point performance. And all done in a few lines of Python.

Now consider the same in C++. Yes, it can be done. But there are just no open source libraries available that can do that. Closest open-source implementation that I know of is gpumatrix, a port of C++ Eigen library to GPU. And it doesn't even come close to what is available in Python. So with C++, if you want to match the performance of these few lines of Python code, good luck studying Cuda or OpenCL and implementing the computation engine right, from the first time.

(disclaimer) I'm not in any way affiliated with OP and I actually use (and like) C/C++ a lot.


Numpy uses Lapack, no?

With vectorization and compiler optimizations* I doubt pure Python, well CPython running pure Python, would match Fortran and some C open source solvers that are the state of art like Lapack for dense matrices and superlu or plastix for sparse ones. It's not much because of the language, but mainly because python has more overhead than C or Fortran.

Maybe as you said python can work well in solving a single linear system using GPU. I worked with this before GPUs came to be used for that, so I can't comment on this one.

* Generally compilers optimize mathematical expressions unless they involve complicated pointer arithmetic, this is why Fortran is still used in this area, Fortran does not have this feature.


In numeric Python packages computations are never done inside the CPython runtime, they're done in specialized kernels written in C, Fortran and sometimes fine-tuned assembly including a sampling of the various BLAS/LAPACK flavors... so yes NumPy can often be competitive with or outperform libraries written in lower-level languages.


You assume all the CFD guys are writing shitty code. It's not like Python is a prerequisite to writing good code.


Yup, just gonna fire up an interpreted language on the cluster and see how my other buddies at the ole national lab react.


Lots of people do just that, e.g. http://www.training.prace-ri.eu/uploads/tx_pracetmo/pythonHP...

Maybe you should change buddies


So you still link to functions in compiled language code to do the actual leg work and all threading has to be done in C extensions, or am I missing something?


yes you are. NumPy is much more than a wrapper around C libraries - if it were just that, it would exist in most languages, and yet few general programming languages have something like numpy.

The whole point is that it gives you abstractions on top of those libraries, like broadcasting, ufunc, fancy indexing, etc...


I saw a presentation about Blaze at PyData. The CEO of Continuum was the author of Numpy and said they are doing everything he would have done differently in Blaze.

Also, Blaze is attempting to abstract the concept of having huge arrays distributed across multiple systems, clusters, clouds, whatever. So imagine having large distributed arrays that you can easily perform computations on. Could be very powerful if it works as planned.


"If they can learn an easy language, they won't have to rely on an external software development group to complete their analysis."

Although scientists earn more credit here than business people, allow me to make a skeptical remark:

This is exactly the same reason they had in the past regarding with SQL. "Business people can't wait for programmers to run their queries; if they can learn an easy language, they won't have to rely on programmers to do that for them..."

In the end, the business people did not do SQL; they just force developers to do it for them, using SQL.

History repeats itself, perpetuating fallacies


Many business people do write SQL. Typically in a poorly administered access database, but they do write SQL.

Scientists also already use several easy languages (r, MATLAB, SAS, increasingly python) rather than relying on outside developers.


> In the end, the business people did not do SQL

False. There are literally hundreds of thousands of people using SQL every day to query and analyze their data, that would not be able to write a for-loop or tell you what a pointer is. The same goes for R.

Just because the needs of business analytics have grown to a point where developers are needed to build extremely sophisticated things with SQL, does not mean that simple SQL does not serve its original purpose beautifully.

The same is true for Python. You can do incredibly sophisticated things with Python, but there are many, many scientists and data analysts that use it to do simple things that would not quite fit into a SQL mold, but also don't require a deep knowledge of CS.


Would be really cool, if part of this money went for funding PyPy's NumPy effort (they still need $15K to be fully funded)


This will sure make me to learn python over ruby


Article calls it both a 'donation' and an 'investment' - which is it?


I'd guess 'grant' would be a better term than either for what's happening here. AFAIK, DARPA doesn't take equity, but does make grants based on intended deliverables.

('Investment' was used in the article in the general sense of "an investment in our children's future" or "investing in shared infrastructure", rather than expecting a specific liquid return like cash dividends or an appreciated exit event.)


We are working to build open source software for them that meets their needs. I think they are making an investment in the sense that any customer is invested in the outcome of a project. It wasn't a VC investment.


Since DARPA can't take equity in companies, it doesn't really matter. Its a donation in that the department will get nothing in return. Its an investment in that it will ideally make America a better place. Think of everything DARPA does as both a donation and an investment.


All I see in these libraries are standard data types which are already available in Python. Are these libraries even being developed by mathematicians? It looks like someone hired a bunch of first grade computer science students to chunk out wrappers.


There are many new data types introduced by these libraries, distinct from the standard library types. The N-dimentional array is the most notable one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: