Hacker News new | past | comments | ask | show | jobs | submit login
Loops in Python – Comparison and Performance (duomly.com)
96 points by duomly on Aug 8, 2019 | hide | past | favorite | 42 comments



Note that this article is missing the most pythonic implementation, which is

   z = [a + b for a, b in zip(x, y)]
This is also twice as fast as their fastest pure-python implementation on my machine.


Or even better, and likely faster for huge lists:

    z = list(map(sum, zip(x, y)))
Even that list is superfluous; the map function returns a generator, so it's better you just need one number at a time but possibly not the whole list (e.g. loop through results until you find a certain value).


likely faster for huge lists:

Nope, at least not on my computer. The map solution was ~50% slower than the list comprehensions when run a list with 1 million ints, and almost twice as slow when run on a list with 100k ints.

In fact even using a for loop in the obvious way was slightly faster than map with 1 million elements (about halfway between the map and list comprehension performance wise).

Map without the last list call is however very slightly faster than the equivalent generator comprehension.


That is interesting. It's possible that they have done some optimisations on loops since the last time I measured it, but that was back in Python 2.6 (I think), and map was much faster than a loop (no comprehensions back then), and it was particularly obvious on very high number of input elements.

Also, from your description I suspect that the list() call over a generator might introduce extra overhead, rather than map itself...


Your last comment seems reasonable. I guess the new rule of thumb should be maps are still fast, but only if you want a generator. If you need a list use a list comprehension.


Thank you I was just thinking that the author ruined the credibility of the tests by not using pythonic code.


For performance purposes, is there something especially important about zipping the two arrays into one array of tuples before adding them?


Zip does not create an array, it creates a generator which returns a tuple on each pass.


I seem to remember other analyses show that the differences are consistent with the number of variable name lookups.

The while loop looks up i, n, z, x, y in every iteration.

The for loop looks up i, z, x and y in every iteration.

The list comprehension looks up i, x and y in every iteration.

The numpy one doesn't look up anything.

If you check the proportions you'll find this is compatible with this guy's data.


Veteran python hater here.

Last I checked, accessing global variables is slow in python because it goes through a hash table (or something) on every lookup. On the other hand, the interpreter is able to optimize accesses to stuff on the stack.

I suspect wrapping the benchmark code (and all the variables) in a method, and then invoking that once would substantially improve the raw python numbers.

By the time I figured this out, I’d already written a pile of python code that was mysteriously >10x slower than it would be in perl 5.


> On the other hand, the interpreter is able to optimize accesses to stuff on the stack.

That is correct – there is no hash lookup for local variables and the variables in a list comprehension are local.


I'd also like to see benchmarks where z is pre-allocated to the right size and instead of append you're just doing z[i] = x[i] + y[i]


Let's take a step back from the empirical measurements. It's always good to take measurements when considering performance but in this case I think the concepts are pretty simple. Simple enough that we can analytically predict performance.

Pure python is slow. Dog slow. For the most part, its slowness is due to its power. It's so terribly powerful that very near nothing is determined statically. The next line of code we're about to execute is "i = i + 1". What sort of computation is that in terms of the computer itself -- registers, load/stores with memory? We have no idea: none! Until we get to that line and start to break it down. We can't even guess (or even bound) whether it does one operation or a billion operations. And not just the first time, either. Every time†! i could refer to something other than an integer this time, or you could have patched i's __add__(). Sometimes we capitalize on python's mega-dynamic features but in simpler cases like these loops we don't.

So we should expect to see vast differences between numpy and pure python because numpy's guts are implemented in an extension.

But I'd love it if this article had gone a little deeper. What are some specific details about why we see some differences here? Does one of these versions suffer more heap fragmentation? Are the differences for bigger matrices dominated by cache misses? What's the difference in the disassembly between the for/while/comprehension loops?

† this is technically an implementation detail of CPython IIRC because tracing JIT optimizers like PyPy are somehow able to reason that this can't happen (or trap the cases when it could) and make this go faster when we don't use dynamic features.


> For the most part, its slowness is due to its power. It's so terribly powerful that very near nothing is determined statically.

No. It's slow because the language is interpreted, not JITed. Powerful, highly-dynamic languages can be quite amenable to JIT compilation, especially if you're not using the features. Shape optimizations (assuming the code in question is going to be used more or less monomorphically) can convert name lookups to structure offsets. You can make code work with dynamic modifications by implementing deoptimization guards.

The CPython implementation explicitly favored simplicity of the interpreter over making it a competitively fast runtime.


Yes and I recommend trying numba for a mostly seemless optimization using JIT.


Pitting numpy append against python append is an odd choice for a performance comparison.


I think it's especially odd that they only take the time for the actual addition of the arrays into account.

Initializing the arrays already takes way longer than adding them in a list comprehension. So if you're doing one-off additions of lists, converting them to np arrays is most likely not worth it.

Here are a couple of quick results I got with the same data setup as in the article:

  List comprehension with zip:   4.748612998810131e-05s
  Just building the np.arrays:   0.00013617933000205086s
  Building and adding np.arrays: 0.00013746836000791518s


Precisely my thought :).

Also, what about the builtin array module? https://docs.python.org/3.7/library/array.html


Well, given that they didn't even use pythonic constructs, I'm not quite sure what to think of the article:

    In [1]: import random
    In [2]: r = [random.randrange(100) for _ in range(100000)]
    In [3]: x, y = random.sample(r, 1000), random.sample(r, 1000)
    In [4]: %timeit z = [x[i] + y[i] for i in range(1000)]
    106 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    In [5]: %timeit z = [i + j for i, j in zip(x, y)]
    67.3 µs ± 3.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(under python 3.6.3)

For those that "don't see it", what I'm seeing is that, instead of looping over a zip of iterables, like they should be doing, they're using an "index range" to access each element by index -- not the best practice, and also results in a noticeable slowdown.

Edit: And for the curious who might think they're different result:

    In [6]: z1 = [x[i] + y[i] for i in range(1000)]
    In [7]: z2 = [i + j for i, j in zip(x, y)]
    In [8]: z1 == z2
    Out[8]: True
Final edit: all in all, I'd say this is a low effort post, aimed at gathering attetion and showing "look how good we are that we know how to speed up python loops using numpy" (/s)...

And I've successfully been python-nerd-sniped.


> builtin array module?

It's not for arithmetic but packing large amounts of data in memory efficiently. I've never seen it used, since NumPy usually is easier


The article is interesting but not for the face value.

On the surface, the comparison is rather weak because it is comparing loops in a dynamic language to optimized, low-level vector operations. Which is, you know, "whoa, low-level hand-optimized routines are faster than Python", or, as I said, weak.

The interesting bit is that Python cannot figure out that it could unroll or vectorize these, rather simple, operations. Surely there are languages with optimizers that can figure that out. Is that impossible to do in Python? Is that by design or by accident?

And, as many people said, if you are doing a lot of math in pure Python, you are doing it wrong. There are other, much more useful, ways to use loops. I like Python especially because it beautifully handles those "other cases".


that would be https://pypy.org/ I guess


On the surface, the comparison is rather weak because it is comparing loops in a dynamic language to optimized, low-level vector operations. Which is, you know, "whoa, low-level hand-optimized routines are faster than Python", or, as I said, weak.

I'm not sure where you're trying to go with that, but comparing Python's interpreted, non-optimized speed to low level OMG optimized code is exactly the point of the article.

The comparison is very useful for anyone looking to increase the speed of code they've written in Python. This points out the limits of what can be achieved in pure Python, and where you'd need to look to improve things, even if it means rewriting the whole thing in a different language. From what you've said, it sounds like you were expecting an article on something different.


> The comparison is very useful for anyone looking to increase the speed of code they've written in Python.

Python has numerous mechanisms that are much more efficient than the solution in this artice, like zip, map and filter functions, as well as the excellent built-in libraries like itertools and functools.


So a conclusion would be to use numpy instead of for loops for number crunching.

But behold! This does not scale as one could expect.

Turning a little more complex elementwise computation into vectorized numpy calls produces a lot of memory overhead. For each vectorized numpy call, which does a basic computation (like adding two operands, or negating, or sinus of, ...), you allocate a whole new array. The calls _can_ leverage parallel computation but you run into cache issues because of the massive memory overhead.

A simple C-loop computing each element completely can be much more memory local and cache friendly. Turning it into a simple chain of numpy operations can destroy that.

I think using Numba to compile ufunc for numpy can be a better approach to that problem.


> Turning a little more complex elementwise computation into vectorized numpy calls produces a lot of memory overhead. For each vectorized numpy call, which does a basic computation (like adding two operands, or negating, or sinus of, ...), you allocate a whole new array.

I believe this is no longer correct, within a single expression. In the last couple of years things were fixed so that `a = x+y+z`, for instance, will not create an intermediate array.


This is also what the 'numexpr' library does.


Isn't using numpy just an example of custom C code for a python bottleneck?


I’d be curious to see map with a lambda as well as using numba directly rather than through numpy compared as well.


In CPython, a list comprehension looks like this internally:

     dis.dis(lambda y: [x + 1 for x in y])
        1    0 LOAD_CONST               1 (<code object <listcomp> at 0x106f441e0, file "<ipython-input-2-1f82f6afa2d5>", line 1>)
              2 LOAD_CONST               2 ('<lambda>.<locals>.<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_FAST                0 (y)
              8 GET_ITER
             10 CALL_FUNCTION            1
             12 RETURN_VALUE

    Disassembly of <code object <listcomp> at 0x106f441e0, file "<ipython-input-2-1f82f6afa2d5>", line 1>:
       1      0 BUILD_LIST               0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                12 (to 18)
              6 STORE_FAST               1 (x)
              8 LOAD_FAST                1 (x)
             10 LOAD_CONST               0 (1)
             12 BINARY_ADD
             14 LIST_APPEND              2
             16 JUMP_ABSOLUTE            4
        >>   18 RETURN_VALUE
You get very similar disassembled code to this:

    def foo(y):
        def bar():
            l = []
            a = l.append
            for x in y:
                a(x + 1)
            return l
        return bar()
Obviously you can't use the `LIST_APPEND` primitive in a for loop. Also, note that CPython doesn't try to preallocate the list. `l = [None] * len(y)` and then iterating over a range might beat a comprehension for long lists.

The performance story with numba and numpy is basically that you have a boundary between Python and their internal representation of data, and crossing that boundary is slow.

For instance, in numpy, even if you specify a dtype, `np.array` must check and cast every element to that type. Likewise, `tolist` must construct new python objects for every value in the array.

Once you're dealing with ndarrays, though, operations are fast as the array has a single type and the code is heavily optimized. But iterating over ndarrays from python tends to be worse than native Python containers.

Numba is a mixed bag. It's no worse than numpy at interpreting ndarrays, and bad with Python containers. It has to inspect arguments to call a JITted function, and this can be far more surprising than numpy.

But if you can combine many JITted functions, you can combine a lot of logic and completely avoid introspection in a way you can't do with numpy alone. The difficulty of doing that is simply that it's having to reimplement many of the numpy features in an early-binding language.


If anyone does this add to it some pypy numbers ;)


The 'premature optimization is the root of all evil' quote is from Knuth's 1974 paper "Structured Programming With Go To Statements", p268, second column not "The Art of Computer Programming"


Sometimes optimization is not premature: when you have a known performance problem, have done proper profiling, and see that it's a tight loop (as usual).

Then you can shave off some execution time by using a slightly different construct, though bigger wins usually lie elsewhere.


That’s just optimization. Premature optimization is to use certain constructs that often make code more complex because you believe that they are faster.


If you've got a known performance problem it's not premature. Knuths quote is the programming version of don't buy clothes for your baby before you've had sex for the first time.

There are some important distinctions though, which gets lost if one takes the quote at face value without much context. For example, I think one should strive to avoid O(n^2) loops and similar unless one is absolutely positively certain n will forever be bounded to some sane value.

My point being that O(n^2) code can make the application completely unusable, which makes it something more than a pure optimization problem.


(As an aside, some asexual people, who never had sex and are not interested in ever having sex, are still interested in adopting, and would buy clothes for their baby before having sex for the first time.)


Sure. But that's not premature.

Bear in mind that the given quote is only the most memorable part of Knuth's longer statement:

> We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

> Yet we should not pass up our opportunities in that critical 3%.

FWIW, I've read that Knuth attributes it to Hoare, and Hoare attributes it to Knuth.

Looking now, I see that Knuth also wrote a similar comment in "Computer Programming as an Art", at http://faculty.salisbury.edu/~xswang/Research/Papers/turning... :

> The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

That was published the same year and month as the first paper, so I don't know which gets priority for any attributions.


This comparison would benefit from showing numba (https://numba.pydata.org/) results for explicit loops.


You shouldn't use the mean when doing benchmarking. Better to use the median or fastest time. Lots of random things can happen on computers (usually in the OS) that can result in some operation taking 1000x longer.


Interesting point. The %timeit functionality actually used to output the mean of the x fastest runs, they seem to have changed that at some point. The docs still explain the old behavior [1].

I assume that they feel your concerns are addressed because they display the stddev together with the mean, so you can see if there were any extreme outliers.

[1] https://ipython.org/ipython-doc/dev/interactive/magics.html#...


If you're testing a single-threaded benchmark, then the test statistics aren't going to be meaningfully different in interpretation, especially if you're only asking the question "is A or B faster?" What's more important is that you capture enough runs to characterize the distribution well; if you have that, you'll get meaningful results no matter which statistic you're actually measuring.


TL;DR: Use Numpy! Numpy rocks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: