Even that list is superfluous; the map function returns a generator, so it's better you just need one number at a time but possibly not the whole list (e.g. loop through results until you find a certain value).
Nope, at least not on my computer. The map solution was ~50% slower than the list comprehensions when run a list with 1 million ints, and almost twice as slow when run on a list with 100k ints.
In fact even using a for loop in the obvious way was slightly faster than map with 1 million elements (about halfway between the map and list comprehension performance wise).
Map without the last list call is however very slightly faster than the equivalent generator comprehension.
That is interesting. It's possible that they have done some optimisations on loops since the last time I measured it, but that was back in Python 2.6 (I think), and map was much faster than a loop (no comprehensions back then), and it was particularly obvious on very high number of input elements.
Also, from your description I suspect that the list() call over a generator might introduce extra overhead, rather than map itself...
Your last comment seems reasonable. I guess the new rule of thumb should be maps are still fast, but only if you want a generator. If you need a list use a list comprehension.
Last I checked, accessing global variables is slow in python because it goes through a hash table (or something) on every lookup. On the other hand, the interpreter is able to optimize accesses to stuff on the stack.
I suspect wrapping the benchmark code (and all the variables) in a method, and then invoking that once would substantially improve the raw python numbers.
By the time I figured this out, I’d already written a pile of python code that was mysteriously >10x slower than it would be in perl 5.
Let's take a step back from the empirical measurements. It's always good to take measurements when considering performance but in this case I think the concepts are pretty simple. Simple enough that we can analytically predict performance.
Pure python is slow. Dog slow. For the most part, its slowness is due to its power. It's so terribly powerful that very near nothing is determined statically. The next line of code we're about to execute is "i = i + 1". What sort of computation is that in terms of the computer itself -- registers, load/stores with memory? We have no idea: none! Until we get to that line and start to break it down. We can't even guess (or even bound) whether it does one operation or a billion operations. And not just the first time, either. Every time†! i could refer to something other than an integer this time, or you could have patched i's __add__(). Sometimes we capitalize on python's mega-dynamic features but in simpler cases like these loops we don't.
So we should expect to see vast differences between numpy and pure python because numpy's guts are implemented in an extension.
But I'd love it if this article had gone a little deeper. What are some specific details about why we see some differences here? Does one of these versions suffer more heap fragmentation? Are the differences for bigger matrices dominated by cache misses? What's the difference in the disassembly between the for/while/comprehension loops?
† this is technically an implementation detail of CPython IIRC because tracing JIT optimizers like PyPy are somehow able to reason that this can't happen (or trap the cases when it could) and make this go faster when we don't use dynamic features.
> For the most part, its slowness is due to its power. It's so terribly powerful that very near nothing is determined statically.
No. It's slow because the language is interpreted, not JITed. Powerful, highly-dynamic languages can be quite amenable to JIT compilation, especially if you're not using the features. Shape optimizations (assuming the code in question is going to be used more or less monomorphically) can convert name lookups to structure offsets. You can make code work with dynamic modifications by implementing deoptimization guards.
The CPython implementation explicitly favored simplicity of the interpreter over making it a competitively fast runtime.
I think it's especially odd that they only take the time for the actual addition of the arrays into account.
Initializing the arrays already takes way longer than adding them in a list comprehension. So if you're doing one-off additions of lists, converting them to np arrays is most likely not worth it.
Here are a couple of quick results I got with the same data setup as in the article:
List comprehension with zip: 4.748612998810131e-05s
Just building the np.arrays: 0.00013617933000205086s
Building and adding np.arrays: 0.00013746836000791518s
Well, given that they didn't even use pythonic constructs, I'm not quite sure what to think of the article:
In [1]: import random
In [2]: r = [random.randrange(100) for _ in range(100000)]
In [3]: x, y = random.sample(r, 1000), random.sample(r, 1000)
In [4]: %timeit z = [x[i] + y[i] for i in range(1000)]
106 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit z = [i + j for i, j in zip(x, y)]
67.3 µs ± 3.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(under python 3.6.3)
For those that "don't see it", what I'm seeing is that, instead of looping over a zip of iterables, like they should be doing, they're using an "index range" to access each element by index -- not the best practice, and also results in a noticeable slowdown.
Edit: And for the curious who might think they're different result:
In [6]: z1 = [x[i] + y[i] for i in range(1000)]
In [7]: z2 = [i + j for i, j in zip(x, y)]
In [8]: z1 == z2
Out[8]: True
Final edit: all in all, I'd say this is a low effort post, aimed at gathering attetion and showing "look how good we are that we know how to speed up python loops using numpy" (/s)...
The article is interesting but not for the face value.
On the surface, the comparison is rather weak because it is comparing loops in a dynamic language to optimized, low-level vector operations. Which is, you know, "whoa, low-level hand-optimized routines are faster than Python", or, as I said, weak.
The interesting bit is that Python cannot figure out that it could unroll or vectorize these, rather simple, operations. Surely there are languages with optimizers that can figure that out. Is that impossible to do in Python? Is that by design or by accident?
And, as many people said, if you are doing a lot of math in pure Python, you are doing it wrong. There are other, much more useful, ways to use loops. I like Python especially because it beautifully handles those "other cases".
On the surface, the comparison is rather weak because it is comparing loops in a dynamic language to optimized, low-level vector operations. Which is, you know, "whoa, low-level hand-optimized routines are faster than Python", or, as I said, weak.
I'm not sure where you're trying to go with that, but comparing Python's interpreted, non-optimized speed to low level OMG optimized code is exactly the point of the article.
The comparison is very useful for anyone looking to increase the speed of code they've written in Python. This points out the limits of what can be achieved in pure Python, and where you'd need to look to improve things, even if it means rewriting the whole thing in a different language. From what you've said, it sounds like you were expecting an article on something different.
> The comparison is very useful for anyone looking to increase the speed of code they've written in Python.
Python has numerous mechanisms that are much more efficient than the solution in this artice, like zip, map and filter functions, as well as the excellent built-in libraries like itertools and functools.
So a conclusion would be to use numpy instead of for loops for number crunching.
But behold! This does not scale as one could expect.
Turning a little more complex elementwise computation into vectorized numpy calls produces a lot of memory overhead. For each vectorized numpy call, which does a basic computation (like adding two operands, or negating, or sinus of, ...), you allocate a whole new array. The calls _can_ leverage parallel computation but you run into cache issues because of the massive memory overhead.
A simple C-loop computing each element completely can be much more memory local and cache friendly. Turning it into a simple chain of numpy operations can destroy that.
I think using Numba to compile ufunc for numpy can be a better approach to that problem.
> Turning a little more complex elementwise computation into vectorized numpy calls produces a lot of memory overhead. For each vectorized numpy call, which does a basic computation (like adding two operands, or negating, or sinus of, ...), you allocate a whole new array.
I believe this is no longer correct, within a single expression. In the last couple of years things were fixed so that `a = x+y+z`, for instance, will not create an intermediate array.
def foo(y):
def bar():
l = []
a = l.append
for x in y:
a(x + 1)
return l
return bar()
Obviously you can't use the `LIST_APPEND` primitive in a for loop. Also, note that CPython doesn't try to preallocate the list. `l = [None] * len(y)` and then iterating over a range might beat a comprehension for long lists.
The performance story with numba and numpy is basically that you have a boundary between Python and their internal representation of data, and crossing that boundary is slow.
For instance, in numpy, even if you specify a dtype, `np.array` must check and cast every element to that type. Likewise, `tolist` must construct new python objects for every value in the array.
Once you're dealing with ndarrays, though, operations are fast as the array has a single type and the code is heavily optimized. But iterating over ndarrays from python tends to be worse than native Python containers.
Numba is a mixed bag. It's no worse than numpy at interpreting ndarrays, and bad with Python containers. It has to inspect arguments to call a JITted function, and this can be far more surprising than numpy.
But if you can combine many JITted functions, you can combine a lot of logic and completely avoid introspection in a way you can't do with numpy alone. The difficulty of doing that is simply that it's having to reimplement many of the numpy features in an early-binding language.
The 'premature optimization is the root of all evil' quote is from Knuth's 1974 paper "Structured Programming With Go To Statements", p268, second column not "The Art of Computer Programming"
Sometimes optimization is not premature: when you have a known performance problem, have done proper profiling, and see that it's a tight loop (as usual).
Then you can shave off some execution time by using a slightly different construct, though bigger wins usually lie elsewhere.
That’s just optimization. Premature optimization is to use certain constructs that often make code more complex because you believe that they are faster.
If you've got a known performance problem it's not premature. Knuths quote is the programming version of don't buy clothes for your baby before you've had sex for the first time.
There are some important distinctions though, which gets lost if one takes the quote at face value without much context. For example, I think one should strive to avoid O(n^2) loops and similar unless one is absolutely positively certain n will forever be bounded to some sane value.
My point being that O(n^2) code can make the application completely unusable, which makes it something more than a pure optimization problem.
(As an aside, some asexual people, who never had sex and are not interested in ever having sex, are still interested in adopting, and would buy clothes for their baby before having sex for the first time.)
> The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
That was published the same year and month as the first paper, so I don't know which gets priority for any attributions.
You shouldn't use the mean when doing benchmarking. Better to use the median or fastest time. Lots of random things can happen on computers (usually in the OS) that can result in some operation taking 1000x longer.
Interesting point. The %timeit functionality actually used to output the mean of the x fastest runs, they seem to have changed that at some point. The docs still explain the old behavior [1].
I assume that they feel your concerns are addressed because they display the stddev together with the mean, so you can see if there were any extreme outliers.
If you're testing a single-threaded benchmark, then the test statistics aren't going to be meaningfully different in interpretation, especially if you're only asking the question "is A or B faster?" What's more important is that you capture enough runs to characterize the distribution well; if you have that, you'll get meaningful results no matter which statistic you're actually measuring.