This is a useful walk through optimization in python.
> The lesson is clear: do not write Python code as you would do in C. Use numpy array operations rather than iterate on arrays. For me it meant a mental shift.
I think the lesson is that "performance critical software requires attention to very many specific details." Using a garbage collected language that allocates objects on the heap by default means that you'll have to take special steps to change the default behavior of the system. You will see very similar behavior with C++, Java, etc.
> I didn't use Pypy so far because it only supports part of Numpy, and not Pandas, Scipy, and Sklearn. It may be that Pypy supports enough of Numpy to run my closest distance code though. If that's the case, then I'll blog about it soon.
I'll confess, I was eagerly waiting the pypy results. IMO using pypy means I get the best of both worlds: more code that can be naively python with python idioms and still yield sufficient performance. For critical algorithms where I can't hit sufficient performance, it's appropriate to use cffi or something to optimize.
> The lesson is clear: do not write Python code as you would do in C. Use numpy array operations rather than iterate on arrays. For me it meant a mental shift.
That's true in C as well. If you're writing performance-critical code, you use things like BLAS and LAPACK (possibly even the optimized, often parallel versions supplied by your compiler vendor).
Thank you, you made Pypy move higher on my todo list.
You have a good point about gc and memory layout. That's why I like C or C++ when writing high performance software. You can control memory locality more easily than with other languages.
I like Numba, but it's sometimes jarring writing C-like code in a language which almost explicitly advises against it. Once you get into the habit of "FOR LOOP BAD" it's difficult to get out of!
It's interesting seeing what the effect of Big-O is. Once you introduce a logarithmic solution, it doesn't really matter that you're using Python because it's asymptotically so much better than the n^2 brute force approach. Nearest neighbour is a particularly nice demonstration of this, because it's an algorithm which is frequently used in realtime and obviously you don't want a performance hit of 2 seconds for every query!
Note Numpy has a builtin np.radians() which is a little cleaner than a one value look up table.
In [2]: %timeit np.random.random((10000))*3.14
1000 loops, best of 3: 245 µs per loop
In [3]: %timeit np.radians(np.random.random((10000)))
1000 loops, best of 3: 262 µs per loop
Yeah I've started using Haskell a lot lately, which doesn't really do for loops per se. It was pretty natural to use `map` though because I was plenty used to Numpy style (ok really it's originally APL style but Numpy is where I picked it up).
Numba is pretty nifty for things that are unnatural in this style, where an algorithm is better expressed in an index-dependent style, like e.g. a numerical ODE solver. Sometimes you can get around it with tricks like convolution, but not always.
At first glance, it does not seem to provide an object for dealing with a finite set of points. Using it would require first to transform the track into a continuous line, via interpolate(), then project() each waypoint to find the nearest point on that line. Issue is that this will not give me one of the original points of the track, which is what I needed.
I'm curious about why that was? From the graph showing the raam datapoints, there's a region with large gaps between points, and it would choose a track point that's well away from the original?
I've written some code to deal with this before (in my case, I wrote a distributed timer for cycle races, predicting current rider locations and gaps from sparse user reports). For me, finding nearest point on the segments of the polyline worked, but choosing only vertices would have made the time estimates terrible. Were the points somehow more fixed than that, eg towns/motels on the route?
Good point. I'm not sure off the top of my head if Shapely has a specific function for what you're trying to do. It does have highly-optimized distance functions though. Might be interesting to compare those to your version.
One way I've approached this problem that seems pretty fast is to load a set of points into a PostGIS-enabled database, then use the `ST_Distance` function and order by distance.
[Edit]
Shapely solution:
points = MultiPoint(((0, 0), (1, 1), (4, 4)))
point = Point((3, 3))
sorted(((p, p.distance(point)) for p in points), key=lambda item: item[1])
An alternative solution that I discovered a year ago. My problem was that I had a list of 2D points lying in an image. I wanted to calculate the distance to the nearest point for every pixel - somewhat similar to a Voronoi diagram except I needed the distances as well as the cells.
When I asked Stackoverflow, having tried Kdtree, someone suggested a simple brute force solution in OpenCL (using the pyopencl library):
More seriously, are you unhappy with the default layout of developerWorks blogs? What change do you suggest? Not that I am in any way responsible for it, but I can certainly pipe your suggestions to the power that be.
As long as you're taking suggestions - i would like to see bigger fonts, syntax highlighted code and maybe more color contrast between sidebar/top/comments and the main content.
On a retina screen, 12px is practically unreadable, particularly in Helvetica Neue, which is quite a light font. To be honest, IBM's site has had tiny fonts for as long as I can remember, they just become more problematic as resolutions increase over time.
Maybe I should get into the habit of doing that, then. I've only ever felt the need to do it for one site before (Daring Fireball which uses an unbelievable 11px font!), and I've been put off in the past by the inferior behaviour of zoom as opposed to text size (which now seems to not even be available in Chrome). Minimum font size would be a good workaround if it didn't apply to the developer console, which is pretty crazy.
Note that you execute C code anyway when you use Python standard implementation (CPython). Point is to use the best C code you can get with limited effort.
> The lesson is clear: do not write Python code as you would do in C. Use numpy array operations rather than iterate on arrays. For me it meant a mental shift.
This was actually one of the main motivators for me to learn Julia. No need to vectorize loops. Looping through things is idiomatic Julia code that the compiler knows what to do with.
> The lesson is clear: do not write Python code as you would do in C. Use numpy array operations rather than iterate on arrays. For me it meant a mental shift.
I think the lesson is that "performance critical software requires attention to very many specific details." Using a garbage collected language that allocates objects on the heap by default means that you'll have to take special steps to change the default behavior of the system. You will see very similar behavior with C++, Java, etc.
> I didn't use Pypy so far because it only supports part of Numpy, and not Pandas, Scipy, and Sklearn. It may be that Pypy supports enough of Numpy to run my closest distance code though. If that's the case, then I'll blog about it soon.
I'll confess, I was eagerly waiting the pypy results. IMO using pypy means I get the best of both worlds: more code that can be naively python with python idioms and still yield sufficient performance. For critical algorithms where I can't hit sufficient performance, it's appropriate to use cffi or something to optimize.