SciPy - the embarrassing way to code

chromoose · on Nov 11, 2009

After spending most of the last night trying to model and fit a bunch of data with scipy, my main bottleneck was my understanding; working out what I needed to do. Once that was done, everything could be implemented in a single call. I don't think that's a particularly bad bottleneck at all.

There was some good discussion about this last time this was posted, and also current discussion over on reddit.

http://news.ycombinator.com/item?id=182676 http://www.reddit.com/r/Python/comments/a2xzj/the_embarrassi...

dkarl · on Nov 11, 2009

The problem is that programmers like to program. Make the programming part trivial, and we start to fear that we aren't needed, and that perhaps we could be replaced with domain experts :-O

hyperbovine · on Nov 11, 2009

I had the exact same experience recently while writing a dynamic programming simulation in Python. The naive (and elegant) solution to the problem consists of a 20-line recursive function which took me about an hour to write. Unfortunately function calls are very expensive in Python, and I my sim required a couple billion of them, so I started searching for ways to eliminate the recursion. 3 weeks and a deep foray into Numpy later and I had it.

The second implementation is about the same length and replaces all of the simplicity of the recursive solution with a bunch of five-dimensional arrays. Never before did I have to think as deeply about vectorized functions, array broadcasting, dimension matching, etc. The non-recursive version is about an order of magnitude faster, and what's better, using the numpy operators means nearly all of the processing is offloaded into low-level C routines, so the Python overhead is effectively nil. On the downside the code is extremely difficult to follow now... trying to visualize what's going on in 5 dimensions is a real pain.

DannoHung · on Nov 11, 2009

Might be able to save yourself some trouble if you can generalize it to an n-dimensional set of operations. Then again, that might not work for your situation.

Estragon · on Nov 11, 2009

  trying to visualize what's going on in 5 dimensions is a real pain.

Yes, that's been my main problem with the programming style described in the linked post.

rabidgnat · on Nov 11, 2009

I did some SciPy work in computer vision applications, and it was both a blessing and a curse. About 80% of the code could be reduced into matrix operations, which provided the 'embarrassing' moments that the article mentions. However, I'd hit something like Hysteresis Thresholding in the Canny Edge Detector, which traces through the image looking for faint edges, and spend a day figuring out how to write a Python extension against the Numpy API. This was necessary, because any code that iterated over 1000 x 1000 matrices in Python that I wrote took well over 10 seconds, instead of the < 100 milliseconds it should take.

As a nublet, I probably didn't end up saving much time over C++, but I could see that in the right hands, this tool would be a lifesaver

kurtosis · on Nov 11, 2009

scipy.weave solved a lot of these types of problems for me. It allows you to place inline c++ (with access to the blitz library) it does the trick for those cases where expressing something as a matrix operation would be wasteful.

ramanujan · on Nov 11, 2009

scipy still doesn't have a good submodule for doing manipulations with multivariate gaussians, which is a HUGE part of machine learning applications.

I'm referring to the analog of R's mvtnorm library.

Yes, you can use rpy2. Yes, you can roll your own dmvnorm (but be careful about the degenerate situations with zero eigenvalues that always arise in high dimensional problems with real data). It is significantly more of a pain to roll your own pmvnorm, because you are now talking about efficient computation of high dimensional integrals, which always seems to involve a trip to Numerical Recipes or the like (see the Genz and Miwa references when you do "?pmvnorm" in R).

Basically this is a huge thing to lack.

PS: yes, if you grep the codebase, you can find this: http://www.scipy.org/doc/api_docs/SciPy.stats.kde.gaussian_k...

But you need to write wrappers because it doesn't provide a basic interface to the multivariate gaussian density.

I could go on about other parts of scipy -- it has some advanced stuff but lacks a lot of basics...

SwellJoe · on Nov 11, 2009

The definition of "basics" depends on your domain. For the fluid dynamics and geological data that Enthought (the sponsor and primary developer of SciPy) works with, SciPy and NumPy provides all the basics.

One could argue that if you're going to call something "SciPy", it ought to cover all sorts of scientific computation...but, it is Open Source under a very open license. Contributing is very easy, and the team are very friendly to outsiders sending patches, and it's not hard to no longer be considered an "outsider". So, if you need it and SciPy doesn't provide it, why not develop it and contribute?

What I'm trying to say is that complaining about missing functions you need from an Open Source project just wastes your time and annoys the pig.

ramanujan · on Nov 11, 2009

Hey, without going into details I have contributed a fair amount of code to R.

Scipy (and the whole Python numerical computation stack) has nontrivial competition out there and here's what it needs to do to compete...if its developers care about adoption, which most do.

Definitely not saying scipy == crap, just that for stats/machine learning (which is a big percentage of a lot of applications today) it is not mature.

fhars · on Nov 11, 2009

"It's just a matrix" and all the comments sound like SciPy is some kind of APL for the rest of us, is that impression correct? There you too can achieve incredible conciseness once you manage to formulate you problem using some matrices and vectors with funny dimension. (And you get the benefit that the resulting code doesn't look too simple, so you get something to show off witth in return for your effort.)

gjm11 · on Nov 11, 2009

Depends on what you think of when you read "APL".

1. A language with matrices as a central data structure and a rich variety of operations on them. Yup, SciPy is this, kinda, though not to the same extent as APL.

2. A language with an extremely strange syntax, apparently engineered for terseness on the smallest possible scale. Nope, SciPy is basically just a bunch of Python libraries, and the language syntax is still Python's.

(There's more to SciPy than the matrix operations, but that's what the original poster was mostly talking about and what might prompt comparisons with APL.)

ynniv · on Nov 11, 2009

This is precisely why people are afraid of math - the more work you do, the smaller the end result. E = mc^2 is actually an approximation, but even the full set isn't very complicated, and imagine how difficult it is to get there. The other truth is that less code is generally faster, but not as small or as fast as a better algorithm, which is in turn not as small or as fast as a mathematical equation. The poster talks about moving across this space, which is actually a great accomplishment... but will probably get the best reception among great programmers or mathematicians.

ars · on Nov 11, 2009

E = mc^2 is not an approximation. What are you referring to?

danteembermage · on Nov 11, 2009

They might have been referring to relativistic effects. Wikipedia has E^2 - (pc)^2 = (mc^2)^2

http://en.wikipedia.org/wiki/Mass_in_special_relativity

ynniv · on Nov 11, 2009

danteembermage caught my gist. E = mc^2 is one degenerate case of a slightly more sophisticated equation. But, it's useful for many common situations, and looks great.

mhartl · on Nov 11, 2009

I once spent around ten weeks writing a C program to numerically integrate the Schrödinger equation for an electron in a particular magnetic field. When I started looking at results, they looked eerily like trig functions. A little digging revealed that the problem was stated and solved (exactly) in one of the exercises to Quantum Mechanics by Landau and Lifshitz.

I sheepishly told my advisor at the time what had happened, worried he would think me an idiot, but he was thrilled: "You solved the problem! That's what's important."

buugs · on Nov 11, 2009

I think the author is taking the wrong thing away from this, all those little realisations are things that if he comes across in the future he will have a better understanding of what to do.

In other words rather than become embarrassed after realising a months work boiled down to just a few lines think about how much you learned to get to those few lines.

markerdmann · on Nov 11, 2009

His use of the word "embarrassing" is tongue-in-cheek.

Tichy · on Nov 11, 2009

Just curious, can SciPy also automatically take advantage of cell processors or graphics processors? If it is all just matrix operations, it seems doable?

cdavid · on Nov 11, 2009

It cannot, at least not ATM and not out of the box.

I am by no mean knowledgeable in those areas, but my understanding is that using specialized hardware/aggressively optimized code in numpy (or similar tools) would be very hard, though. One of the problem is that CUDA or multi-core optimized numpy would have a significant overhead (obviously of a totally different nature), and would be justified only for very large problems. Multi-code numpy was tried a few years ago by E. Jones, and the overhead was almost never worths it IIRC.

The current solution is to use those through libraries which implement linear algebra, etc... , which can be implemented on top of those architectures. There is obviously quite a bit of excitement around those technologies, and this was one of the most talked area at las scipy conference. Video are available: http://blog.enthought.com/?p=184. Some of them are related to those