Python: Faster Way

perlgeek · on Dec 30, 2014

And this, ladies and gentlemen, is how you not make speed comparisons.

#7 surprised me: why would 'if a:' be slower than 'if not a:'? So I ran the code myself, and ran it a few times. And lo and behold, the third digit varies greatly between runs. And the timing between the first and second variant only differ in the third digit.

I ran the first and second variant ten times, and got these values:

first: 0.138 +- 0.05

second: 0.137 +- 0.04

So there is no significant difference in my timings, though individual runs differed as much as 0.132 vs. 0.150.

Though kudos for providing the full code and the raw output. It made quite easy to reproduce!

These are the timings I measured, in octave/matlab syntax:

    t1 = [0.132781028748, 0.140866041183, 0.13587808609, 0.138177871704, 0.137129068375, 0.150414943695, 0.138676166534, 0.137069940567, 0.13484787941, 0.136043071747];
    mean(t1)
    std(t1)

    t2 = [ 0.134783983231, 0.143576860428, 0.138277769089, 0.142880916595, 0.140572071075, 0.136868000031, 0.131128787994, 0.134744167328, 0.132730007172, 0.137500047684];

    mean(t2)
    std(t2)

thomasahle · on Dec 30, 2014

I was going to post this: If you do an average of `N` independent trials, your confidence interval is `O(1/sqrt(N))`.

Hence with the 10^6 trials done in the article, we must expect a couple of points deviation on the third digit.

Matumio · on Dec 30, 2014

Taking the average is a bad idea for performance benchmarks like this running under an operating system. The minimum is much more stable. (Easy to verify experimentally: do 10x10 runs of a short benchmark and compare the variance of 10 minima to the variance of 10 averages.)

stefantalpalaru · on Dec 30, 2014

True, but the minimum might also be dependent on the various caches being primed from previous runs of the same code - a condition that will not appear during normal use.

Matumio · on Dec 30, 2014

Caches being primed is a concern, but not when choosing between minimum and average. Neither is affected much if the first run takes longer than the other nine.

gcr · on Jan 9, 2015

IPython's %timeit magic works by running the function in three loops. It returns the minimum of the three averages. This seems like a good compromise.

perlgeek · on Dec 30, 2014

... and since you don't know how big the constant factor is, you still have to do the experiment a few times.

Also there might be factors (that you are not aware of) that differ between several executions of the same program (maybe address space randomization got extremely lucky or unlucky in one execution? Or your OS has a bug the prioritizes processes differently when the PID is a power of 2? Who knows?), or some background processes (cron?) woke up, so even if your number of in-process repetitions is sufficiently high, you still run the thing a few times and observe if the numbers are stable.

Matumio · on Dec 30, 2014

Don't forget CPU frequency, which may slowly step up during the first few runs.

stefantalpalaru · on Dec 30, 2014

You're not supposed to run benchmarks on anything but the 'performance' CPU governor.

ambrop7 · on Dec 30, 2014

Though, sometimes you want to benchmark with 'ondemand' just to see how the cpu scaling affects your code.

It is known that 'ondemand' may make multi-threaded programs run slower when run on multi-core system. The problem which I have myself observed and reproduced is when two threads are alternately scheduled to run - one thread runs for some time, then wakes up the other and goes to sleep, ad infinitum. With such a workload, the scheduler/governor may enter a situation where the workload moves from one CPU core to the other (*), with no core reaching enough average load to switch it to a higher frequency.

I can't find a reference for this, but it was easy to reproduce by writing a simple program with pthreads doing what is described above. My actual application where I encountered this problem consisted of an event loop thread and a worker thread doing computational tasks (encryption). The most interesting observation was how disabling one of the two CPU cores (yes Linux can do that at runtime) sped up my application and benchmark considerably.

Obi_Juan_Kenobi · on Dec 31, 2014

I don't think it's implied that all of the listed differences are significant. I think it's designed to show you that sometimes alternatives make a difference, and other times they simply do not.

It makes the page a good resource if you're wondering if doing something a certain way has a performance impact. If it's all the same to use {} vs. dict(), why not be in the habit of using the one that's a little faster?

Moreover, the tests are organized in a meaningful way. Tests 13 and 14 are quite interesting when seen right next to each other. Showing just test 13 could be deceptive, leading you to think that operators are faster than methods. Seeing 14 right afterward, however, shows you that isn't the case, and provides some clues as to why each works the way it does. It's just enough of a hint to make you curious about what's actually going on, which I really enjoy.

I don't think the fault is with the author, but with your assumptions about the intent.

You may argue that exposition would improve this, but I think the lack of explanation is not only elegant, but better at providing insight for the reader. I don't think there's a right answer here, though.

perlgeek · on Dec 31, 2014

> I don't think it's implied that all of the listed differences are significant. I think it's designed to show you that sometimes alternatives make a difference, and other times they simply do not.

The heading, "Python: faster way" doesn't suggest that to me.

And when I see comments like https://news.ycombinator.com/item?id=8814309 I feel I'm not the only one who came to conclusion that the first column was meant to be the fastest.

And how should the reader judge the significance of timing differences without knowing any variance or spreading of the measurements?

Maybe some short headings like "80% faster", "no significant difference" would make things clearer. Or color coding.

rhgraysonii · on Dec 30, 2014

A single read from disk destroys all of them

Source: http://www.dis.uniroma1.it/~demetres/didattica/ae/upload/lec... , slide 24

jhgg · on Dec 30, 2014

Examples b and c on test #6 is something you should NEVER do in Python. Using "is" checking for integers is a huge no-no, and will only work for smaller integers (I think between -5 and 256) that are cached by the interpreter. "is" does identity checking, not equality checking.

    >>> a = 500
    >>> b = 500
    >>> a is b
    False
    >>> a == b
    True
    >>> a = 2
    >>> b = 2
    >>> a is b
    True
    >>> a == b
    True

ayrx · on Dec 30, 2014

I honestly do not get the point of this page.

None of the time differences in all the test cases are significant at all. Concerning yourself with this is premature optimization of the highest level especially in a language like Python. One should definitely be more concerned about writing clearer and more idiomatic code.

vegabook · on Dec 30, 2014

I have to agree, though there is some information to be gleaned precisely because this guy has gone very far to show us that there is no point to try to get clever with alternative, non-idiomatic ways of doing things. Mind at ease to use the cleanest idioms.

It reminds me a bit of some of my big data jobs. Sometimes you spend days and days torturing data to find interesting aspects, only to find nothing and end in "failure". Only it isn't failure because you've eliminated a bunch of dead-end avenues that you might otherwise have wondered about, freeing you to pursue more promising uses for your resources.

jarcane · on Dec 30, 2014

This is mainly what I took from it. The faster way here is almost always the most idiomatic way to do it in the examples, and the reasons why are usually pretty obvious.

And while the individual differences are small, in a language already as slow as Python, starting from the faster, idiomatic code from the start is likely to save a lot of headache down the line.

nichochar · on Dec 30, 2014

If anything then he has proven exactly what you're saying, hence that it is OK to being explicit rather than try and be picky about these things, and that code readability can easily be a priority in python.

There are also a few exceptions (the update() method on a dict and list comprehensions).

As a junior python dev < me < senior python dev, this was very cool to read, also very quick to take in, kuddos on the presentation

beaumartinez · on Dec 30, 2014

Agreed, although it's interesting to see the disassemblies.

Python is a language about readability and conciseness first, performance second.

I would have liked to have seen `map` used in the list comp examples, as well as seeing the tests run on Python 3.

romanlevin · on Dec 30, 2014

It's just a reference, it's not making a value judgement about the different ways to evaluate the truthiness of a list or something.

simonh · on Dec 30, 2014

Someone popping up and trotting out the 'no premature optimization' argument every time an article is posted about optimization is getting very tiresome.

Not all optimization is premature. Sometimes you find that you do need to optimize, and then having to wade through a morass of comments complaining about premature optimization to track down actual relevant content and comment is a pain in the arse.

sweezyjeezy · on Dec 30, 2014

But his first point still stands - none of these optimisations are very important.

TickleSteve · on Dec 30, 2014

Two points:

- The time differences displayed here are just not significant on a desktop OS such as Windows or Linux. Maybe Windows decided to swap or collect your email (or more likely virus check) as the test was running? The error bars on these numbers will make them insignificant.

- When using Python, readability>>performance. If you want performance, use C; don't even try to optimise Python, its not worth the effort.

nine_k · on Dec 30, 2014

In Python, it's worth to optimize your algorithm, e.g. replace an O(n²) algorithm with a O(n log n) one, but trying to shave 0.1% of performance in a tight loop is a wrong approach. Instead, pick Numpy / Numba, Shedskin, etc down to plain C.

acveilleux · on Dec 30, 2014

I've written a few specific python apps where optimization at this level was worth contemplating in a couple spots after profiling (basically, tight loops). That said, replacing the algorithm has usually been a much smarter play.

wging · on Dec 30, 2014

There are a few benchmarks where the difference is a factor of two or so, which is actually significant. It does require a careful reading and a healthy degree of skepticism about the close results (the majority) to get something useful from this--could do with a summary.

Tobu · on Dec 30, 2014

Those are the quirks of CPython 2.7. The best optimisation is to run with PyPy — those benchmarks run about 10 times faster.

After that, if you want to optimise, you need an applicative benchmark. A microbenchmark won't work because type specialisation runs for (traces of) the program as a whole.

_glsb · on Dec 30, 2014

This just feels like picking up pennies in front of a steam roller.

jwl · on Dec 30, 2014

Could someone please explain #Test 11. How is using %s instead of %d double as fast even though it is the same thing is being calculated? I always use %d because it seems like the prooper way to do it.

nikital · on Dec 30, 2014

I looked at the source of 2.7:

The function PyString_Format (Objects/stringobject.c) does the formatting for the % operator. For %s it calls _PyObject_Str which in turn calls "str()" on the object. For %d it calls formatint (located in the same file).

The str() implementation for ints is in int_to_decimal_string (Objects/intobject.c) and it's incredibly simple:

    do {
        *--p = '0' + (char)(absn % 10);
        absn /= 10;
    } while (absn);

The code for formatint is way more complex, and it contains two call to the native snprintf:

        PyOS_snprintf(fmt, sizeof(fmt), "%s%%%s.%dl%c",
                      sign, (flags&F_ALT) ? "#" : "",
                      prec, type);
        // ...
        PyOS_snprintf(buf, buflen, fmt, -x);

The native snprintf is heavier because it handles precision, zero-padding and stuff like that.

I believe this is why %d is slower. %s is a straight "devide-by-10-and-subtract" loop while %d are two library calls to the full-blown sprintf. However I didn't actually profile the code because I don't have a debug build, so I might be completely wrong.

Edit: Link to code: https://github.com/python/cpython/blob/a5c7a4257507a77699a1a...

thu · on Dec 30, 2014

I met someone who was specifically using `dict()` instead of `{}` because, he told me, the CPython parser was much slower on the later than on the former and that it was noticeable on larger code base.

(That being said I think using Python should be about writing maintainable code, not about micro-optimization.

I don't claim the purpose of that site is to push you to use those micro-optimizations, even if the title seems to suggest so. Including the disassembly makes for a nice visualization.)

chriswarbo · on Dec 30, 2014

> I met someone who was specifically using `dict()` instead of `{}` because, he told me, the CPython parser was much slower on the later than on the former and that it was noticeable on larger code base.

I've been told by code reviewers to use micro-optimisations like this before.

My default reply is "I think this way is faster, since the shorter code results in fewer cache misses. I'd be happy to change it if you show me some benchmarks."

Of course, the cache may or may not affect anything; the point is to burst their philosophical bubble with an example of why measurements are needed.

There's also the good old fashioned "If we cared that much about performance, we wouldn't be using the branching logic of a templating system in an interpreted language."

Spidler · on Dec 30, 2014

Optimize for readability. Optimize for maintainance. Optimize for consistency. Once those three are done, you can optimize for performance.

jqm · on Dec 30, 2014

But, you can also possibly habitually change a few things in your initial coding syntax that produces less computational effort to begin with....

I thought it was a good article.

chriswarbo · on Dec 30, 2014

1) We don't know which one takes more computational effort, hence the need for realistic benchmarks with decent confidence intervals.

2) If benchmarks show one style to be slightly faster than another, then I will take note. I've already noted that these benchmarks have no associated statistical information; they're just numbers quoted to an arbitrary precision.

3) If I ever find myself needing such slight increases in speed, it's probably a sign that there's a looming disaster, since it means that a) there's a bottleneck in the program, b) I can't improve the algorithm, c) porting that section to native code hasn't been enough.

If it's a case of death by a thousand papercuts, where slight slow-downs throughout an application are actually making it unusable, then I'll roll up my sleeves and patch the interpreter/compiler (this is easier in self-hosting languages, but doable in all).

jqm · on Dec 30, 2014

No need to get all carried away talking about rewriting compilers or pulling out statistical confidence intervals...

If a certain style has a slight performance improvement, and this can be manifested, and that style is equally as readable and maintainable, there is simply no reason not to adopt it as habit. That's all I'm saying.

marcosdumay · on Dec 30, 2014

If those few things do not break redability, maintaince, consistence, or correctness, why not?

So, first you get good enough on the language to know you are not breaking those. Then you don't adopt micro-optimizations because you are already good enough on it to know what they'll break.

Spidler · on Jan 2, 2015

Why? Because most of the ones they show is that a global lookup is more costly than a local one.

Something that may -very- well change quickly. dict() slower than {}, and so on...

jlarocco · on Dec 30, 2014

Meh. My opinion is that it's easier to just change the code. The reality is there's probably no noticeable difference, and if it's important enough to them that they marked it on a code review they might actually try the benchmark, and then you're just wasting everybody's time.

chriswarbo · on Dec 30, 2014

It depends on the team I guess, but in my situation all changes had to be checked in to git, rebased against the live branch, pass code review, have the pass logged against that git commit, have paperwork filed in triplicate indicating the pass, etc. so it was a PITA to restart that process for such insignificant changes.

Much easier to argue with the reviewer; plus, if it stops them doing it again in the future, all the better ;)

It's funny that even with all of this process in place, things broke routinely when trivial syntax errors would slip through. There were no automated tests or monitoring systems, of course. It seemed like the process was to keep developers in check rather than ensure a working product; I think someone watched Superman 3 or Office Space and got paranoid.

Needless to say, I'm not there anymore ;)

chriswarbo · on Dec 30, 2014

    >>> timeit(a, number=1000000)
    ... 0.102125167847

Measurements shouldn't be quoted without some kind of range, confidence interval, error bar, etc.

I've not come across timeit before, but maybe they could learn a thing or two from systems like Criterion http://www.serpentine.com/criterion/

Beltiras · on Dec 30, 2014

I found the most enlightening part being the module 'dis' which I was totally oblivious of, but will partake of often from hereon in.

njharman · on Dec 30, 2014

Profile dammit!

Things that improve by few tenths of a second in a million iterations are NOT WORTH YOUR TIME.

ssebastianj · on Dec 30, 2014

Make It Work -> Make It Right -> Make It Fast

TheLoneWolfling · on Dec 30, 2014

These all seem like simple peephole optimizations to me. So why doesn't the Python compiler (and yes, Python has a compiler) do these optimizations?

(I am aware that some of these are not in fact equivalent, but at least some of them are)

On an unrelated note, the website would be easier to grok if it did diff-like highlighting of changed lines.

bjourne · on Dec 30, 2014

Most of them are not equivalent. dict() is not the same as {} as someone might have overwritten the dict function to do something completely different. Realizing that dict() and {}, in fact, are equivalent in this particular case requires whole program analysis which maybe PyPy is capable of.

TheLoneWolfling · on Dec 30, 2014

At least tests 3, 4, 5, 6, the first 3 cases of test 7, the first and third cases of test 11, 13, 16, 17, 18, 20, 21 cases 1 and 2, 22, the first two cases of 24, 25, 26, 27, 28, 29 tests 2 and 4, 30, and 32 can be optimized without this issue, if not more.

As I said, "some of these are not in fact equivalent, but at least some of them are".

bjourne · on Dec 30, 2014

That's not right, #5 is dependent on what True is defined to:

    True = 0
    def a():
        a = True
        if a:
            return True
        return False
    print a()

There are likely similar subtleties with the other examples, but I don't know the Python spec well enough to see them. You asked why the Python compiler didn't optimize these snippets more and that is the answer.

TheLoneWolfling · on Dec 30, 2014

    >>> True = False
    SyntaxError: can't assign to keyword

...

bjourne · on Dec 30, 2014

The examples are for Python 2, not 3. But it's blatantly obvious that you are more interested in "not being wrong" than discussing Python's compiler so I'll stop here.

ww2 · on Dec 30, 2014

Almost none of them are significant.

vanderZwan · on Dec 30, 2014

I'm more surprised that the disassemblies differ at all with half of these tests.

eveningcoffee · on Dec 30, 2014

I am wondering what is exactly measured here (by timeit). For example for test 5, there is noticeable difference between timings of variants b and c despite them evaluating to the same list of instructions (and I did tests by myself).

im3w1l · on Dec 30, 2014

The first has

    12 COMPARE_OP               8 (is)

while the second has

    12 COMPARE_OP               2 (==)

eveningcoffee · on Jan 2, 2015

You are right! I overlooked this. Thanks!

mcsalgado · on Dec 30, 2014

You mean test 6?

nosefrog · on Dec 30, 2014

For the tests where the speed difference is significant, these are really areas where the python optimizer (I assume the bytecode compiler does some optimization?) needs to improve. Humans shouldn't be concerned with such small differences in the literal representation of code.

im3w1l · on Dec 30, 2014

What's up with test 11, variant 1 vs 3?

sbergot · on Dec 30, 2014

I guess that the float formatter has more work to do that the string one.

nichochar · on Dec 30, 2014

This is so easy and quick to read, yet very interesting. Good job and thanks for making it

jdimov · on Dec 30, 2014

Here are my annotations for how I'm reading these. Feel free to correct or update:

1. Faster to use dict literals than the `dict()` constructor

2. Faster to sort a list in-place than to use `sorted()`

3. Faster to use multiple assignment than multiple individual assignments

4. Faster to evaluate multiple individual conditions than a single multi-comparison condition

5. Faster to use idiomatic implied truth value than to compare to `True`

6. Faster to use `!=` than `is not`

7. Fastest way to check for empty list `a` is the idiomatic `if not a`

8. Fastest way to check that an object is not null is `if a`

9. `enumerate` is faster than doing it yourself

10. Constructing strings with `+=` is faster than string surgery

11. Fastest way to place a number in a string is "%s" (not "%d")

12. For built-in types, using built-in functions (e.g. `len`) is faster than calling methods (e.g. `a.__len__`)

13. For built-in types, using built-in operators (e.g. `+`) is faster than calling methods (e.g. `a.__add__(b)`)

14. But for custom-defined classes, the opposite of #13 is true.

15. Adding up a bunch of numbers using `+` is faster than using `sum`

16. When initializing a large list, it is faster to use list comprehensions than to append each element individually

17. The fastest way to build a large dict is to use a dictionary comprehension

18. Same as #17

19. Not sure what the author is trying to demonstrate here. That `bool(a)` is faster when `a` is an empty list?

20. Faster to use implied truth value of an expression than to compare explicitly to `True` or `False`

21. The fastest way to refer to a list is to use the list literal (duh...)

22. It is faster to perform individual comparisons than to check for membership with `in`

23. Faster to `map` the `str` function to a list of values and join, than to manually build up the string. Also, when doing this, faster to use `xrange` than `range`

24. Faster to unpack than to access each list element individually. Also, as in #1, faster to use dict literal than the `dict()` constructor.

25. Faster to just return the boolean value of an expression than to check for it and then decide what to return.

26. Faster to give explicit values to keyword arguments than to expand a dict with kwargs.

27. To switch the values of `a` and `b`, it is faster to use a third variable than to do it in-place using unpacking.

28. Multiple assignment is faster than unpacking, which is faster than multiple individual assignments.

29. Fastest way to get integer part of a real is to use `math.floor`

30. Similar to #26 - it is faster to give explicit values to positional arguments than to expand a list of args.

31. Faster to access local vars than globals.

32. When trying to safely access a key in a dictionary, it is faster to use the idiomatic `D[k] if k in D` than to use the `get` method.

Tobu · on Dec 30, 2014

With PyPy: - `is` is very slightly faster - '%d' is slightly faster

Yet another caveat: the differences are often less that 1%, statistically not significant.

msellout · on Dec 30, 2014

I'd expand on #9.

It's generally faster and more readable to use an idiomatic Python iteration than to use indexing inside the loop. The difference was small over 5 elements, but `enumerate(a)` will become dramatically faster than `range(len(a))` if you have a large list. A better comparison would be `enumerate` vs `xrange`.

Obi_Juan_Kenobi · on Dec 31, 2014

I'd call any difference less than 1% to be equivalent. Even if it was statistically significant, it's not worth paying attention to.

My interpretation of many of these tests is that there isn't a difference.