And there is nothing wrong with C++. For linear algebra I use the armadillo libr...

rasbt · on Aug 25, 2015

I agree with you. However, note that many people who are using Python for writing scientific code make use of C/C++ in one way or the other (aside from NumPy, SciPy, and Theano). For example, many people write the "most intensive" computations down in C/C++/Cython if they call those functions frequently -- Python becomes a wrapper. One example that pops into my mind is khmer (https://github.com/dib-lab/khmer)

srean · on Aug 25, 2015

This is a common refrain: drop down to C, C++, Fortran for the computation intensive parts. It works, but only to a degree. The inefficiencies lie in the vectorization semantics of the host language(s) that leads to extra copies and extra levels of indirection. So this dual language mode of operation typically does not approach what one could have obtained had one disposed the baggage entirely, except for I/O. Usually in the quest for better speed, that is what remains, as one moves progressively larger portions of the application in the C, C++, Fortran part of the code. A reason I like Julia is that I can largely avoid this dual language annoyance, and enjoy the succinctness of pithy vectorized expressions using https://github.com/lindahua/Devectorize.jl

nxb · on Aug 25, 2015

The speed performance of C/C++ is terrible, compared to GPU speed, which is often 50x to 100x faster. The goal has shifted and continues to shift in that direction.

C/C++ is no longer best for speed, not even close.

Now, all that matters is which language has the libraries that make it easiest to get custom code onto the GPU. Python and Lua seem to be winning there, by far.

leni536 · on Aug 25, 2015

>Now, all that matters is which library makes it easiest to get custom code onto the GPU. Python and Lua seem to be winning there, by far.

This is interesting. How is it possible that python and lua have more efficient wrappers around GPU libraries? Also there are many GPU libraries for C/C++ too. Armadillo can use NVBLAS as a backend too. I'm not sure if I get your point of C/C++ being slow.

nxb · on Aug 25, 2015

It's not about wrapping. The real power is in the cross-compilation of expressions and entire complex data pipelines, from a simple-as-possible high-level language into GPU language. That's the power at the core of, e.g. Theano.

srean · on Aug 25, 2015

Compiling high level instructions to different hardware backends is hardly an exclusive feature of python and lua libraries. Google would swamp you with hits if you were to search

nxb · on Aug 25, 2015

Show me one C/C++ library that competes with Theano or Torch7?

Google / Facebook and many other huge companies are using Theano and Torch7 in production, at scale. The ML industry has been continuously moving in this direction for years now.

On these optimized ML systems, only a tiny fraction of CPU time is spent outside of the GPU. The goal in many of these companies is to migrate all tasks that can be done on GPUs to GPUs, as soon as possible. It's far faster and more cost efficient.

novocaine · on Aug 25, 2015

Do you have some sources demonstrating that google and facebook are using them in scaled production? My impression was that presently these were more for research and prototyping.

I would have thought that if you were going to run prod systems in the gpu you would actually write CUDA (C++) or similar to avoid the inefficiency of the abstraction layer.

(also, this comment is bordering on the uncivil).

srean · on Aug 25, 2015

> show me ...

I can only bring the horse to the water (or Google as its sometimes called these days) :)

>Google/Facebook and many other huge companies ... You are totally wrong.

That totally settles it then thank you, who am I to argue and surely there are no ML jobs that spend time outside of GPU.

srean · on Aug 25, 2015

Did you just compare a language with a piece of hardware ? I hope you realize:

(i) how nonsensical such a comparison is

(ii) there are many algorithms for which GPU offers no speedup at all in fact the the data transfer can actually hurt. There are instances where using CPU's SIMD instructions makes more sense than GPU.

syllogism · on Aug 25, 2015

You don't have to use the vectorization though. I prefer to work with Cython and just use pointers. You don't get to write vector + vector --- you have to actually write the loop --- but that's not such a big deal imo.

srean · on Aug 25, 2015

I am lazy and super prone to make one-off errors, so that to me is the _biggest_deal_ TM :)

Lofkin · on Aug 25, 2015

Numba for python also has a no copies vectorization mode.

RA_Fisher · on Aug 25, 2015

I believe S (ancestor of R) started as C glue at Bell Labs. I've heard it said a few times that R is slow, but that doesn't really make much sense if your bottleneck routines are R calls to C++.

cinquemb · on Aug 25, 2015

I see this a lot in my lab (days/weeks to run computations and analysis that could take minutes/hours), and I used to offer to help the postdocs/phd students port their code from matlab to C/C++, but I've mostly given up on offering unless they ask for help, or if it seems like fun. I also think it's because they mostly think they won't work on these things again (which is kind of weird to think about since they spent most of their life working to get to this point, but that's another conversation).

Aside, I think armadillo is pretty great, especially going back and forth with other matrix libraries, also nice wrapper around OpenBLAS which made not running something on a cluster more bearable.

srean · on Aug 25, 2015

I saw you mention Armadillo a few times, seems you are having a lot of fun with it. Another somewhat look-alike is Eigen, its pretty nice too. These are all modern takes on the original sin Blitz++ which is actually more full featured in what they can represent than Eigen, Armadillo, mublas (by Boost) and Blaze: it supports multidimensional arrays as opposed to just (2d) matrices. Where Blitz++ falls short of the competition is full use of SIMD instructions. However, G++ does a plenty good job of vectorization, but the king of the hill is still Intel's compiler.

danieldk · on Aug 25, 2015

I am not sure about Intel's compiler, but OpenBLAS is pretty much on par with Intel MKL in most benchmarks I have seen recently, e.g. [1][2].

The thing that I like with Armadillo is that you can just pick the BLAS implementation that is the fastest on your platform and run with it, while still being high-level compared to BLAS.

[1] http://gcdart.blogspot.co.uk/2013/06/fast-matrix-multiply-an...

[2] https://github.com/tmolteno/necpp/issues/18

srean · on Aug 25, 2015

> you can just pick the BLAS implementation that is the fastest on your platform and run with it

Just in case this is useful info, Eigen and Blaze works the same way, Blitz++ doesn't. Blaze seems to be doing the best in the benchmarks.

> OpenBLAS is pretty much on par with Intel MKL

I wont be surprised. I have seen ATLAS beat MKL on some BLAS functions on my runs. Sparse matrix multiply is particularly badly implemented in MKL (atleast was). One could beat their sparse multiply by using multiple instance of their own sparse matrix vector multiplies.

But the reason I mentioned ICC is that there is much more to SIMD than just linear algebraic operations. Non-linear operations that come up quite frequently in stats/ML are sin, cos, exp, log, tanh etc on vectors. GCC/G++ does a decent job now (the best part is that it emits info why it wasn't able to vectorize a particular loop. This lets you restructure the loops to ai the compiler), but ICC still rules.

danieldk · on Aug 25, 2015

Just in case this is useful info, Eigen and Blaze works the same way, Blitz++ doesn't.

I was under the impression that Eigen only works with MKL?

Enables the use of external BLAS level 2 and 3 routines (currently works with Intel MKL only)

Source: http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html

But the reason I mentioned ICC is that there is much more to SIMD than just linear algebraic operations.

Indeed. I compiled some stuff with ICC, but it did not give me an tangible improvement over the latest GCCs. But for those projects, linear operations were the vast majority.

srean · on Aug 25, 2015

Its a BLAS library standard API that you are linking to, so the vendor doesn't matter so much (in theory). In practice, the amount of annoyance that you have to go through so that it links correctly can vary a lot. Even MKL did not implement all of BLAS / Lapack and that would lead to some linking errors.

Re GCC, the same here. ICC is mighty expensive.

danieldk · on Aug 25, 2015

I know that BLAS is a standard API, I switch between BLAS implementations fairly often :). I suspected that Eigen relied on some non-BLAS functions in MKL, since they explicitly state that only MKL works.

srean · on Aug 25, 2015

> I switch between BLAS implementations fairly often

... that makes us co-sufferers then, or brothers in masochism :) if you will

acconsta · on Aug 25, 2015

Not so surprising that OpenBLAS beat MLK on an AMD processor. Although it's definitely keeping up.

cmarschner · on Aug 25, 2015

For machine learning / neural nets in C++ there is also Caffe (for ConvNets) and CNTK (for almost any type of network).

plafl · on Aug 25, 2015

Writing scientific code is hard both in C++ and python, but reading and manipulating data and visualizing it is way easier in python. It's also easier to install and use third party libraries.

tdees40 · on Aug 25, 2015

It's unimaginable for me to do most of this stuff without a REPL. That's the show-stopper for me with C++.

leni536 · on Aug 25, 2015

Maybe there will be a working REPL for C++ in the future:

https://github.com/vgvassilev/cling