Hacker News new | past | comments | ask | show | jobs | submit login
Recent Performance Improvements in Function Calls in CPython (codingconfessions.com)
204 points by rbanffy 89 days ago | hide | past | favorite | 167 comments



One of the performance improvements mentioned is "Remove the usage of the C stack in Python to Python calls" [0]. Since Python 3.11, a Python-level function call can be evaluated within the bytecode interpreter loop, no longer requiring a C-level function call.

Interestingly, Lua 5.4 did the opposite. Its implementation introduced C-level function calls for performance reasons [1] (although this change was reverted in 5.4.2 [2]).

[0] https://bugs.python.org/issue45256

[1] https://github.com/lua/lua/commit/196c87c9cecfacf978f37de4ec...

[1] https://github.com/lua/lua/commit/5d8ce05b3f6fad79e37ed21c10...


Can you explain what this means? I know all these words, but it looks like retro-encabulator to me.


At the heart of the Lua and Python interpreters, there's an evaluation function, written in C, that does something like this:

    void evaluate() {
        while (1) {
            instr = get_next_bytecode_instruction()
            switch(instr.type) {
               // run the instruction
            }
        }
    }
The question is how to run a CALL bytecode instruction. One way is to use the C stack: call evaluate() recursively. The other way is to have the interpreter maintain its own call stack and the CALL instruction is responsible for updating said call stack. This way the interpreter stays in the same while(1) loop the whole time.

Using the C stack can be faster, because you take advantage of the native instruction pointer & return address instead of emulating them in C. But it can also be slower if calling evaluate() is slow.


When one Python function calls another, does that ultimately result in the CPU executing a `call` instruction, or not?

Python 3.11 changed from "yes" to "no" for performance reasons.

Lua, OTOH, changed from "no" to "yes" in 5.4.0, again for performance. It then changed back to "no" in 5.4.2 for other reasons, despite the performance regression.

(In Lua 5.1, instead of call/`return`, `coroutine.resume/yield` is implemented using C-level call/`return`. That would preclude implementing call/`return` the same way - but since 5.2 the coroutine functions are instead implemented via call/`longjmp`.)


Do you by chance know what LuaJIT does?


In JITted code? Nothing, it’s a pure tracing JIT so all calls are flattened. In the interpreter? We can always check the source[1], and it seems it implements its own stack.

[1] https://repo.or.cz/luajit-2.0.git/blob/HEAD:/src/vm_x64.dasc: look for BC_CALL[2], then ins_call before that, then ins_callt just above it.

[2] see mirror of the old wiki for the semantics: https://chrisfls.github.io/luajit-wiki/Bytecode/#Calls-and-V...


Here we go, another HN roundabout criticism of Python performance.

Here’s a real world example. I recently did some work implementing DSP data pipeline. We have a lot of code in Go, which I like generally. I looked at the library ecosystem in Go and there is no sensible set of standard filtering functions in any one library. I needed all the standards for my use case - Butterworth, Chebyshev, etc. and what I found was that they were all over the place, some libraries had one or another, but none had everything and they all had different interfaces. So I could have done it in Go, or I could have kept that part in Python and used SciPy. To me that’s an obvious choice because I and the business care more about getting something finished and working in a reasonable time, and in any case all the numeric work is in C anyway. In a couple of years, maybe that ecosystem for DSP will be better in Go, but right now it’s just not ready. This is the case with most of our algorithm/ML work. The orchestration ends up being in Go but almost everything scientific ends up in Python as the ecosystem is much much more mature.


As a ML/AI guy, Python is, imo, the greatest DSL of all time. You get great access to C/C++/Fortran/CUDA code while getting a very high level language for scripting the glue between processes.


And nowadays, Rust.


That is fair enough, but using Python as glue with existing C libraries is the acknowledged use case for Python. I think it is even admitted in this thread.

But for that use case the claimed performance improvements in pure Python (which I can rarely replicate) are not relevant. They could leave the interpreter as it was for +20 years.

For web applications, migrating to Go or PHP should not require much thought apart from overcoming one's inertia. Hopefully the scientific stack will be rewritten in better languages like Go, Lisp, Ocaml, etc.


> Hopefully the scientific stack will be rewritten in better languages like Go, Lisp, Ocaml, etc.

I don't keep my breath on that. Because Python can be used interactively, relatively mature, hard plumbing work is done by real computer scientists, and all is left to play with the code until the numbers look right for a given research.

It's just a better MATLAB in practice for some disciplines, and none of the languages we have provides this kind of flexibility with that amount of leeway for play. Zip the virtualenv and send over. If it doesn't work just delete the folder and reinstall anaconda...

Nobody is thinking about the performance of their code or the elegance of what they do. They care about the papers, deadlines and research in general. Code quality, general performance, organization can go whereever they like. All underlying high performance code is written in C anyway and only implementers of these code cares about that code. It's akin to docker for (data) scientists mostly.

How do I know? I'm an HPC admin. I see that first hand.


Julia is a REPL-driven interactive system that was specifically syntactically (to its detriment IMO) targeted at Matlab users while Python for science was on the rise. Julia code does usually run faster than Python stuff without a lot of care & feeding for either, but Julia just didn't catch on quickly enough and will probably always be playing catch-up.

Julia/Python/R is even earliest in the name Ju-Py-te-R, although that now supports like 40 PLs and "notebook" is probably a reference to the 1991 Mathematica 2.0 Notebook Front End (and obviously it was mostly just a coining/play on words, not "priorities"). Personally, I think the often trivial-to-modest extra effort to work with static types is worth the effort both for correctness, specificity and performance, but 95+% of REPL-oriented systems (by number not mindshare - more like 99.9+% by mindshare) are lax/dynamic about types by default except maybe the Haskell Hugs interpreter and a few other exceptions. Cython w/cdef | Common Lisp with (declare) do at least allow gradual typing, and arguably the "type stable/not" flavors of Julia battles are roughly in that camp. Like most "defaults", most people just use the default even if other things are available, though. That's why to me it's worth making users be a little more explicit like Nim does.

Anyway, I am just discussing alternatives floating about, not trying to be disputatious. Mostly people / scientists use whatever their colleagues do in a network effects way, with ever so occasionally generational shifts. Sometimes research communities are held hostage not just to a specific programming language, but rather to a very specific code base - certain kinds of simulators, reaction kinetics engines, and so forth. (TensorFlow or PyTorch might be a contemporary analogue more HN folk could relate with.)


Common Lisp does, but then again, only druids care about its existence and the tales of strange machines, powerful graphics workstations, that used to be fully written in languages full of parenthesis.

To the point people don't realize notebooks are how those machines REPLs used to work.


Elegant tools for a more civilised age.


> Hopefully the scientific stack will be rewritten in better languages like Go, Lisp, Ocaml, etc.

You are talking about people who still use Fortran… Scientists ain't got any time to migrate to another framework every week.


FORTRAN is actually still developed and wicked fast for the things its users do with it.

It's neither left behind nor obsolete.


Exactly, but it's undeniable that even updated it does have some ergonomics issues. It just that its users care less about developer ergonomics. Afaik, there were plenty of attempts to build a replacement (i. e. Fortress), didn't catch on.


Fortran is in a weird spot. It's incredible for writing the hardest parts of the programs it's used for, but it honestly sucks for doing the "boring" parts. Doing heavy vectorized workloads in Fortran is nice, but doing I/O, organizing code, using any data structure that isn't essentially an array of some sort, etc. all suck.


It’s a great language for calculating things, not so great for doing things.



Alongside COBOL, it’s probably going to live forever.


There is a reason Fortran has survived so long. Writing matrix/vector heavy code in Fortran is (imo) still more pleasant than C, even when accounting for the idiosyncracies around writing a Fortran program in general. The skill floor is significantly lower for writing very high-performance code.


In a way they would be better off using Fortran instead of Python.

Every scientific result that relies on a snapshot of 50 packages (whose resolution in conda is an NP-hard problem) should be qualified with an added hypothesis:

Given Python-3.x.y and [50 version numbers of packages], we arrive at the following result.

Even in Python itself the random number generator was broken for more than a year by one of the persons who actively mob people for rejecting their patches (in which they might have a monetary interest since they got paid for many of their patches, making it a gigantic conflict of interest).

This invalidated many scientific results from that period.


> This invalidated many scientific results from that period.

Did it invalidate any conclusions or it was just corrected results?


At least Fortran 2023 is much better than raw Python with CPython.


Are you somehow compiling code with a PDF of the new standard? It’s not available in real compilers.


While not everything might be available, I was making the point that this isn't FORTRAN 66 in punch cards.


There is A LOT of FORTRAN 77 code around. I think most of the FORTRAN I’ve seen is 77.


I also see more C89, C++03, Java 8 and C# 7 than I would like to.

Yet that doesn't invalidate what is actually available.


What about Julia? Fie numeric code, not sure why I'd reach for Go first


I had bad experiences trying to productionise Julia code in the past. I realise the pre-compilation JIT-ing story is now quite different but it left a bad taste in my mouth with the community basically saying that it wasn't a problem when it quite clearly was.



I'm always shocked at how much performance we leave on the table in these places. I did a super quick benchmark in python[0] and go [1] (based on necovek's[2] comment) and ran them both locally. The go implementation runs the entire benchmark quicker than the fastest of the python ones.

The deviation in go's performance is still large, but far less so than Python's. Making the "wrong" choice for a single function call (bearing in mind that this is 10k iterations so we're still in the realm of scales even a moderate app can hit) in python is catastrophic, making the wrong choice for go is a significant slowdown but still 5x faster than doing it in Python. That sort of mental overhead is going to be everywhere, and it certainly doesn't encourage me to want to use python for a project.

[0] https://www.online-python.com/9gcpKLe458 [1] https://go.dev/play/p/zYKE0oZMFF4?v=goprev [2] https://news.ycombinator.com/item?id=41196915


Development speed and conciseness is what sells Python. Something simple like being able to do `breakpoint()` anywhere and get an interactive debugger is unmatched in Go.


Development speed in Go blows Python out of the water when approaching the project as an engineering project, but Python has it beat when approaching the project as a computer science project.

And therein lies the rub for anything that is pushing the modernity of computer science (AI/ML, DSP, etc.) When born in the computer science domain, much of the preexisting work able to be leveraged is going to be built around Python, not Go (or any other language, for that matter), and it takes a lot of work to bring that to another system.

It is pretty clear that just about anything other than Python is used in engineering when the science end is already well explored and the aforementioned work has already been done, but the available libraries are Python's selling feature otherwise.



Go's compile times are comparable to a decent sized python app's startup time.

> Something simple like being able to do `breakpoint()` anywhere and get an interactive debugger is unmatched in Go.

As opposed to using a debugger and setting a break and/or tracepoint, which you can do to a running program?


In a Python debugger, you can trivially augment the state of the program by executing Python code right there while debugging.

I don't remember what you can do with Go's debugger, but generally for compiled programs and GDB, you can do a lot, but you can't really execute snippets of new code in-line.

The simplicity is in the fact that Python debugger syntax is really mostly Python syntax, so you already know that if you've written the Python program you are debugging.


This objection only works in comparisons with python Vs Java, C++ and perhaps Rust (compile times). It doesn't really hold up against more modern languages such as Kotlin or heck even Groovy, which is an ancient relic in JVM land.


Yeah, but Go isn't the only alterantive, and something like `breakpoint()` anywhere goes back to Smalltalk and Lisp, with performance Python doesn't offer.


Sounds about right, you can get a ~4x speed up by compiling Python with Nuitka or Mypyc.


Prototype in Python so you can get your ideas on the page then rewrite in something else has been my method for a while now.


I rather do it in a language that has JIT in the box, or even AOT as alternative, then I hardly bother with rewrites.

Something I learned by rewriting too much Tcl into C.


Cpython deliberately chooses to prioritize simplicity, stability and easy C extensions over raw speed. If you run the benchmark in PyPy (the performance optimized implementation) it’d be 10X faster. You could argue anything that’s performance bottlenecked shouldn’t be implemented in Python in the first place and therefore C compatibility is critical.


I don't have easy access to pypy so I can't run the benchmark on that I'm afraid. I'd love to see some relative results though.

> You could argue anything that’s performance bottlenecked shouldn’t be implemented in Python in the first place and therefore C compatibility is critical.

Honestly, I think I'm heading towards "python should only be used for glue where performance does not matter". The sheer amount of time lost to waiting for slow runtimes is often brought up with node and electron, but I ran the benchmark in node and it's about halfway between go and python.



Node isn't the slow part of electron. Node is actually pretty darn fast.


Indeed, shipping a full browser is.


Write it in Python, profile it, and move the bottlenecks into C, called with ctypes.


If you're going to rewrite large chunks anyway, I'd use Go from the start.


This is something I find quite fascinating. I've seen so much time wasted on Python this way it's incredible.

Insanely smart optimizations done. But then you rewrite the whole thing in vanilla Rust without profiling once and it's still 20x faster, and much more readable than the mix of numpy vectorization and C-routines.

It's kind of knowingly entering a sunk-cost fallacy for convenience. Before the cost is even sunk.

Python is often just the wrong choice. And it's not even that much simpler if you ask me.

In a language where you use value-based errors, have powerful sum types and strong typing guarantees, you can often get there faster, too IMO, because you can avoid a lot of mistakes by using the strictness of the compiler in your favor.


> Insanely smart optimizations done. But then you rewrite the whole thing in vanilla Rust without profiling once and it's still 20x faster, and much more readable than the mix of numpy vectorization and C-routines.

I'd argue that if you can rewrite the whole thing in Rust (or any other language), then it's not really a large project and it doesn't matter what you originally wrote it in.

On multiple occassions I had to port parts of Python projects to C++ for performance reasons, and the amount of code you have to write (and time you need to spend) is mind blowing. Sometimes a single line of Python would expand to several files of C++, and you always had to redesign it somehow so that memory and lifetime management becomes feasible.

Python is often the right choice IMO, because most of the time the convenience you mention trumps all other concerns, including performance. I bet you would get tons of examples from people if you "Ask HN" about "how many quickly written Python prototypes are still out there years later because their C++ replacement could never be implemented on time?"


I have an example to support: https://github.com/hpc4cmb/toast/pull/380/commits/a38d1d6dbc...

A one-liner in Python is replaced by hundreds of lines of changes to implement in C++. (The one-liner has some boilerplates around it too, but the C++ function itself is longer and the boilerplates around is even more.)

Edit: 230 additions and 36 deletions in 14 files.


That’s replacing one line which calls a a third party dependency with a manual implementation in native code.


Technically right, but the original code only optionally depends on Numba, and Numpy is ubiquitous in scientific Python.

The problem I was trying to solve in all those boilerplates is basically that Numpy still has overhead and dropping it to lower level is trivial if Numba is used, but the package maintainers doesn’t want to introduce Numba dependency for various reasons. And the diff is showing that change from Numpy/Numba to C++ with pybind11.


That doesn't change the fact that it's not representative of how many lines something takes in python vs C++. If you replaced it with a call to native that used a third party library it would be back to being one line.


First, no sane mind will think one single example will be representative to anything.

Second, if you want to focus on 3rd party or not, sure. But you could also think Numba jit vs C++ and in this case it puts them on a more equal footing to compare them.

As far as Numba is concerned, numpy is in its stdlib as it aren’t calling Numpy but jit compile it.

You may say Numba is not Python, true, but you could consider Numba an implementation of a subset of Python.

We can argue about the details but the example is not that long to read. My summary and your summary are both at best distilled through our own point of view a.k.a. biases. Anyone can read the diff itself to judge. That’s why I link to it.


My point is that you replaced calling a third party library from python with an implementation in native. You could easily replace a c++ third party library with a full python implementation of the same thing and come to the opposite conclusion.

The example matters because it’s bad - you’re not comparing like for like.


There's a huge swathe of languages between Python and C++. On threads about Node's performance (which is somewhere in between) people regularly say that it's unfair of developers to solely focus on devex, which is the argument here.

> On multiple occassions I had to port parts of Python projects to C++ for performance reasons, and the amount of code you have to write (and time you need to spend) is mind blowing. Sometimes a single line of Python would expand to several files of C++, and you always had to redesign it somehow so that memory and lifetime management becomes feasible.

Part of that is because it's designed for python. Using this benchmark example, the python code (without imports) is 36 lines and the go code is 70. Except, we're doing a 1:1 copy of python to go here, not writing it like you would write go. Most of the difference is braces, and the test scaffold.

I'd also argue the go code is more readable than the python code, doubly so when it scales.


> it's unfair of developers to solely focus on devex, which is the argument here.

For most companies and applications, developer time is many orders of magnitude more expensive than machine time. I agree that, if are running applications that run across thousands of servers, then it's worth writing it in the most performant way possible. If your app runs on half a dozen average instances in a dozen-node K8s cluster, it's often cheaper to add more metal than to rewrite a large app.


There is more to it than just time. If your app is customer facing (or even worse, runs on your customer's own resources), high latency and general slowness is a great way to make your users hate it. That can cost a lot more than few days of developer time for optimization.


There is, of course, a limit. OTOH, if your devex is so terrible you can’t push out improvements at the same cadence of your competitors, you are doomed as well.


No, it’s more expensive to you, and you get to ignore the externalities. It’s like climate change - you don’t care because it doesn’t affect you.

How many hundreds of hours are wasted every day by Jira’s lazy loaded UI? I spend seconds waiting for UI transitions to happen on that app. How much of my life do I spend waiting for applications to load and show a single string that should take 50ms to fetch but instead takes 30 seconds to load a full application?


I don't know, but personally I've spent orders of magnitude more time debugging memory leaks in languages without GC than the time I've wasted waiting for shitty apps to load.

Software bloat is a real thing, I agree, but it's a fact of life and in my opinion not a hill worth dying on. People just don't care.

If you want proof of this, try to find a widely used Jira alternative which doesn't suck and doesn't need a terabyte cluster to run. There isn't one because most people just don't care about it, they care about the bottom line.


> People just don’t care.

Devs don’t care, because they aren’t being forced to care, and they have loaded M3 MBPs, so they don’t notice a difference.

I firmly believe that we’ll see more of a swing back to self-hosting in some form or another, be it colo or just more reliance on VPS, and when we do, suddenly performance will matter. Turns out when you can’t just magically get 20 more cores on a whim, you might care.


Not only that - when your ticket system becomes a Jira because of all the feature creep required, it’ll perform like a Jira.

Perhaps a little better if you can implement more in less crazy ways (it seems to keep a lot of logic into a relational database, which is a big no-no for the long term evolution of your system).


How many years of our lives would be saved if Jira was rewritten in Verilog and ran on the fastest FPGAs?

Not much. I bet it spends most of the time looking up data on their massive databases, which do the best they can to cope with the data structures Jira uses internally. You could, conceivably, write a competitor from scratch in highly optimised Rust (or C++) and, by the time you have covered all the millions of corner cases that caused Atlassian to turn a ticketing system into Jira, you’ll have a similar mess.


This ad absurdem argument is ridiculous. Nobody is talking about writing web apps in verilog except you.

I can’t claim to know why Jura is so slow - I don’t work for atlassian. It was an example of an app I use daily that I spend minutes per day staring at skeletons, loading throbbers and interstitials on.

Core architectural choices like “programming language” set a baseline of performance that you can’t really fix. Python is unbelievably slow and your users pay that cost. There are lots of options in between that offer fast iteration speeds and orders of magnitude faster runtime speeds - go and .net are widely popular, have excellent ecosystems and don’t require you to flash an FPGA.


>For most companies and applications, developer time is many orders of magnitude more expensive than machine time.

For me the context was mostly working in DevOps CI/CD. Slow CI scripts slow down developer feedback and cost machine time as well as developer time.

If your customers are your developers, performance engineering is about saving the expensive developer time.

And if you ask me the effects are non-linear, too.

If your build system takes 1 minute or 10 minutes for a diff build the time lost won't be a mere factor of 10. Any sufficiently long time and your developers will lose focus and their mental model, will start making a coffee, and maybe they might come back later to that work after 3 meetings that got scheduled in between.


> Using this benchmark example, the python code (without imports) is 36 lines and the go code is 70.

Look at the gz values and notice that longer programs might be faster programs:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

~

"How source code size is measured"

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


It all depends on execution time. If you run it just a few times or the total execution time doesn't take days, you can do it faster in Python end-to-end.


> I've seen so much time wasted on Python this way it's incredible.

It sort of goes both ways. You'll see a lot of time wasted on Python when the developers probably could have know it would need better performance. On the flip side you also see teams choosing a performant language for something which will never need it. One of the reasons I've come to prefer Go over Python is that you sort of get the best of both worlds with an amazing STL. The biggest advantage is probably that it's harder to fuck things up with Go. I've seen a lot of Python code where the developer used a List rather than a generator, loops instead of build in functions and so on.

That being said, I think we'll see a lot more Rust/Zig/C/C++ as things like ESG slowly makes its way into digitalisation decisions in some organisations. Not that it'll necessarily making things more performant but it sounds good to say that you're using the climate friendly programming language. With upper management rarely caring about IT... well...


That reasoning is why I mentioned go - .net is another ecosystem that offers excellent bang for buck.


> But then you rewrite the whole thing in vanilla Rust without profiling once and it's still 20x faster

The main reason Python is used is because developer time is expensive and computer time is cheap. When that balance shifts the other way (i.e. the compute slowdown is more expensive then the dev time), Python is no longer the right choice.

There's no point in developing at half speed or less in Rust compared to Python if your code only runs a few times, or if the 20x performance gains only results in 60 minutes of total compute time saved.


>There's no point in developing at half speed or less in Rust compared to Python

In a lot of cases the Rust development speed is faster IME, because you can really get the correctness on the first go much more often than with Python. Also proc-macros make dealing with serialization and deserialization or CLI parameters a breeze.


> because you can really get the correctness on the first go much more often than with Python

Helpful only when you have a full-ish picture of the problem ahead of you and know what's correct when you start programming.


A lot comes down to familiarity, to be fair. I’m not a SWE by trade, I’m an SRE / DBRE. I can read code in just about any language, but the only one I consider myself really good in is Python.

I could and should learn another (Rust seems very appealing, yes), but at the same time, I find joy in getting Python to be faster than people expect. C being leaned on is cheating a bit, of course, but even within pure Python, there are plenty of ways to gain speedups that aren’t often considered.


You can enforce strict typing by using MyPy and Rust. Those are part of modern python development stack. If your program can be rewritten in python you are definitely not developing Fullstack Web Development or DataScience / Machine Learning. Rust isn't good at those since there are no good framework for DL/ML stuff - and working with unstructured data is very much necessary in those field where it is against Rust .


i meant Ruff in first line.


I dislike Go as a language for a variety of reasons, but nothing really objective.

I can agree with your sentiment that if performance is important from the outset, that writing everything in a language more performant by default is a smart move.

In a particular case for me, the project evolved organically, and I learned performance bottlenecks along the way. The actual functions I rewrote in C were fairly small.


If you dislike go, then c# is another great alternative. I chose go because it’s fast, simple and the iteration times for go are often quicker than python apps in my experience


I think the safest way to understand Python is that performance is a “nice to have” feature, traded off for legibility and flexibility and even simplicity of the CPython code base (go read it! You barely need to understand C and a lot of it is written in Python!).

Don’t waste time being surprised that you can do better than the default implementation. Just assume that and do so when you’ve measured that it matters.

That being said, (mostly) free performance is always nice. I’m glad they’re working on performance improvements where they can do it without sacrificing much.


Take a look at Nim.

You get C performance, with the readability of Python.

https://nim-lang.org/


Everyone need to stop comparing Nim with Python. Nim is closer to pascal, and the only thing it shares with Python is indentation base syntax. It doesn't have the expressivity and flexibility of Python. It's a nice language, but there is no good reason to compare it to Python.


I think the comparisons with Python are driven by how divisive it is to have leading whitespace be significant (though Haskell also has it without as much divisive discussion). There was some effort in Nim to have some of the standard library names mirror Python such as `strutils.startswith(str, "oops")`. That said that but for a few things like that, it is true that Pascal/Ada but made more terse are more an inspiration.

I know both languages well and personally I consider Nim to be more expressive and syntactically flexible than Python although it is admittedly a difficult thing to quantify (which "more"/your comment framing implies). As just a few examples to make the abstract more concrete, in Nim you can add user-defined operators. So, with https://github.com/SciNim/Measuremancer I can have let x = a +- b or let y = c ± d construct uncertain numbers and then say y/x to get a ratio with uncertainty propagated. Function call syntax is more flexible (and arises more often than uncertain value algebras), e.g. "f(x)", "x.f", "f x", and even "f: x" all work for any kind of routine call. The need to be specific about let/var constancy & types is slightly more verbose but adds semantic/correctness value (e.g. first use). Just generally Nim feels like "much more choice" than Python to me. In fact, the leading whitespace itself can usually be side-stepped with extra ()s which is not true in Python.

I am speculating, but I think what might make you feel Python is more expressive may come from ecosystem size/support effects such as "3 external libraries exist that do what I want and I only need to learn the names of how to invoke whichever". So, you get to "express" what you "want" with less effort (logic/coding), but that's not really PLang expressivity. Python is more dynamic, but dynamism is a different aspect and often leads to performance woes.


Both are general-purpose, high-level programming languages. Why not compare them? Say, I am considering starting a new programming project, I need to chose a language. Nim looks fine, so does Python, so which one do I chose? I need to compare those languages to make a motivated choice, right?

Anyhow, what are we allowed to compare with Python?


> Nim looks fine, so does Python, so which one do I chose?

Assuming you need to use libraries for stuff, you choose the language that has the better ecosystem, more users and more StackOverflow support. As a company you choose the language that makes it easier to hire devs in, and move them around from a project to another. Only when execution performance trumps all other concerns you select the faster/stricter language, and only for mature projects not for a POC which might be thrown away soon.

So you use the strict/performant languages for mature, execution intensive code that doesn't need much library support and where you don't need to hire many devs.


I really really want nim to catch up. It's a lovely language. I've used it for small projects and it is great but I couldn't use it for anything complex that requires third-party libraries. The ecosystem is just not there.

I wonder what is stopping community mindshare?


I bet same reason that you just outlined.

(Chicken and the egg problem)


Pretty much. When Python was at this stage, what alternatives did a developer have than to improve Python/write a library? Nim has to “compete” with a solution already being available.


Perl dominated before Python, and had a massive library of modules via CPAN.

Clearly Python eclipsed Perl. Just say'n.


Python was a curiosity/tiny niche for over a decade from the earliest 0.9 releases I used. I was advocating it to thoroughly deaf ears in the mid-90s with rejoinders of "just use Common Lisp".

Then several things happened. 1) Large businesses like Google openly saying they used Python in the early 2000s helped for things like web / Internet work. 2) Linux distros like Gentoo leaned on it heavily helped push it in systems/sysadmin. 3) the "Numeric -> NumPy/SciPy" transition helped a ton - spurred by things like `f2py` and exorbitant licensing costs of Matlab, the main competitor, which also had a lot of 1970s/early 80s language problems like 1-function per file and scoping craziness and not doing reference semantics for big matrices and junk like that. Then 4) the culture of open source in scientific software helped build out that aspect of the ecosystem as the other, 1/2/3 were growing fast. Then 5 by the late noughties/early teens universities were switching to it as a teaching language and finally 6 ML/data science as a specific area of the NumPy/SciPy world started taking off in hype cycles and even real applied work.

This is just a long-winded (but still woefully incomplete) / historical way of saying "software has network effects" (not just PLs, but OSes and applications as well), of course. As Jack Reacher likes to say "details matter in an investigation". ;-) I don't think Python was purely lucky, exactly, but with one or two those big reinforcement/feedback factors missing, it might not have become so dominant.


Reading this, I was interested to see how big the speed difference between Python and PHP is these days, so I tried it like this:

min.py:

    i = 10_000_000
    r = 0

    while i>0:
        i = i-1
        r += min(i,500)

    print(r)
min.php:

    <?php

    $i = 10_000_000;
    $r = 0;

    while ($i>0) {
        $i = $i-1;
        $r += min($i,500);
    }

    print($r);
The results:

    time python3 min.py
    4999874750
    real    0m2.523s

    time php min.php
    4999874750
    real    0m0.333s
Looks like Python is still 8x slower than PHP. Pretty significant.

I ran it with Python 3.11.2 and PHP 8.2.18


Not to be a stickler, but the only fair benchmark would be to select the most recent stable version of each. Doing so is fairly easy with docker I would think, if you don't wish to modify your system.


I wouldn't say you are a stickler. But the most productive addition to the discussion would be to do the benchmark as you see fit and then show us the results.


Very fair feedback. I'll do so once I'm at a machine!


3.11? The article explains there are improvements on that specific benchmark in newer versions. Not on the order of magnitude that it would catch up here, but still significant.



It's even slower than java compiler and jit combined! Java has never been know to be fast for command line programs.

My M1 MacBook: 1.716s for min.py AND 0.295s for javac and 0.067s for running the compiled code.

It is a straightforward conversion to java using long for i and r.


I once was using a bunch of handmade python scripts to work with terrain heightmaps and ground terrain textures in an attempt to improve FSX visuals before flight sim 2020 was announced.

I had 12Kx12K pixel images of the ground in spring, and wanted to make some winter versions for fun. Some basic toying around showed you could reasonably resample a "snow" texture into these photos wherever you had pixels that were "mostly green", ie grass or trees.

I wrote this up in python using the simple PIL python library for working with image data. It took 45 minutes per image.

I fired up a text editor and within 30 minutes had a java version of the code, most of that time was spent remembering how to write java.

30 seconds per image.

I don't care what nonsensical "optimizations" I could have done, like oh just pull it into numpy arrays and work there, oh PIL is slow for pixel manipulation so don't do that, oh you could rewrite it like X and it will be faster

I don't care. The simple, naive, amateurish attempt in Java, a language I hadn't touched in 6 years, blew the pants off of the python version, a language I used at my day job. It took zero more effort to write in java than in python. Python is great for doing simple one off scripts, but it is terrible at actual work.

Add to that, the tooling around python is mediocre, and for some reason python programmers don't seem to care, while the tooling around java is so good IDEs basically write all the supposedly awful boilerplate code for you. Stuff that I had to do by hand in python was automated in java.


I had a similar experience when i was in highschool and learning python. I wrote a image->ascii image convertor using PIL as you did, one image would take a few hundred ms and it was fine, but as i played with it and added more features/learned more i tried to do it on video frames extracted with ffmpeg and it would take atleast 4x the duration of the video or even more to convert it in ascii video. Few months ago i coded it in go since i got back into programming and it was significantly faster(The overall code was also smaller but maybe that's just because i have a bit more experience now) even tho in both cases the implementations was the most naive one i could come up with, without any optimizations.

Edit: I went ahead and tested both, took me 30 mintues to set them with identic parameters on 3602 frames(roughly 1 minute video) Keep in mind that this is just a naive implementation with 2 nested loops itereting in a step of 12 over the pixels of the image, no fancy math or numpy used in the python version. Image for reference[0]. Also keep in mind that the step and length of the ascii charmap used affects performance but here i used a simple one "_^:-=fg#%@" so it's a bit easier for the python code, here are the results on my i7 9th gen laptop

1 Minute video 720p 60fps

GO - Execution time: 359647 ms, 5.9 minutes ("asciifying the frames") + 30~ seconds ffmpeg video conversion

Python 3.12.3 - Execution time: 0:16:25.002945, minus ~25-30 seconds ffmpeg

[0]: https://i.postimg.cc/CK30z1jt/frame-000039.png [1]: Video using a more detailed ascii charmap(" .,:;i1tfgLCG08@#") in the go version -> https://streamable.com/7h8jc3


A lot of these notes follow a similar pattern: we start with Python, get a working program, and then rewrite it in X with a huge performance gain.

That’s fair. We use the easy language to get a correct program, then we use the “hard” language to get a fast program, now that we know what a correct one looks like.

This is the key here. If you try to start with the lower level language, you might spend a much longer time getting to the correct program.

It’s much easier to optimise a program that’s working than to debug one that isn’t.


Or just use something that offers both from the start.

One day, Python will finally catch up with 1990's Smalltalk, SELF, Lisp, Dylan and Scheme.


What would you recommend that gives both worlds? You can learn most of the syntax of python/go to a usable level in about a day.


Go I agree, as for Python using it on and off since version 1.6, that would be a very very basic knowledge after one day.

Since you mention Go, placing dynamic and static on the same basket.

Common Lisp, Scheme, Raket, Clojure, Scala, Kotlin, F#, OCaml, Pharo, JavaScript/Typescript.

There are other candidates, only listing the ones with good out of the box REPL experience, JIT/AOT toolchains in the reference implementation, and not going crazy with their type systems for strong typing.


One of the strengths python has is its ecosystem. All the languages you listed don’t come close to python’s ecosystem except perhaps TypeScript/nodeJS (whose quality of ecosystem is questionable although improving in recent times).


>This is the key here

No, this was very much NOT my point. The java version of my program was just as easy if not easier to write, precisely because I COULD write the naive version and trust that it would work out fine performance wise.

Hell, the java version didn't even require the extra library (Pillow/PIL) that python required!


This is more a measurement of looping, calling the min function and printing than the language itself. All benchmarks is of course that to some degree, but I really don't think this is a good comparison.


Python also uses bigints.


If someone writes this same program in COBOL and gets better performance (and I’m sure it would) would we be tempted to write more COBOL code?

I wouldn’t.


The dynamic aspect of Python is once again probably the major penalty here.

One major difference here is that in Python you can overwrite min(), so it has to access this function through more abstraction, but not in PHP. I bet PyPy is close to PHP here however.


You might be right about the min() call being the major penalty.

When I avoid the min() call like this ...

min2.py:

    i = 10_000_000
    r = 0

    while i>0:
        i = i-1
        if i<500: r += i
        else    : r += 500

    print(r)
I get a speedup of about 50%:

    time python3 min2.py
    4999874750
    real    0m1.338s
Still 4x slower than PHP though.

With PyPy I get an amazing speedup of 96%:

    time pypy3 min.py
    4999874750
    real    0m0.088s
4 times faster than the PHP version!

I'm not sure what the state of things is regarding PyPy and running a web application. Doing a bit of googling, it seems that calling C functions from PyPy is slow, so one would have to check if the application uses those (for example to connect to the database).


FWIW, if I place the code in a function, so that variable lookups are constant-time rather than a dictionary lookup in the module globals, then (on my Mac, with Python 3.12) I get a two-fold speedup. That is:

    def spam():
        t1 = time.time()
        i = 10_000_000
        r = 0

        while i>0:
            i = i-1
            if i<500: r += i
            else    : r += 500

        t2 = time.time()
        print(r)
        print("Time:", t2-t1)
reports 0.87 seconds outside a function and 0.43 seconds when inside a function.

(Using the shell's time, of course, also includes the Python startup time. When I use it I get 0.94 and 0.49 seconds, indicating a 0.06s startup penalty.)


Interesting! Same here.

Also with pypy3, putting it in a function about doubles the speed:

    time pypy3 min_in_function.py 
    4999874750
    real    0m0.045s


Back in Python 1.x, function locals used a dictionary, just like module globals. This made it possible do to:

  def f():
    from math import *
    return cos(3.0)
https://peps.python.org/pep-0227/#import-used-in-function-sc... notes "The language reference specifies that import * may only occur in a module scope. (Sec. 6.11) The implementation of C Python has supported import * at the function scope."

Python 2.1 added static nested scopes (see the PEP 227 link), which enabled a much faster locals lookup. With that one case excluded, all of the locals can be determined at parse/byte-compile time, stored in a simple vector, and referenced by offset:

    >>> z = 0
    >>> def f(x):
    ...   y = x + z
    ...   return y * 2
    ...
    >>> import dis
    >>> dis.dis(f)
      1           0 RESUME                   0

      2           2 LOAD_FAST                0 (x)
                  4 LOAD_GLOBAL              0 (z)
                 14 BINARY_OP                0 (+)
                 18 STORE_FAST               1 (y)

      3          20 LOAD_FAST                1 (y)
                 22 LOAD_CONST               1 (2)
                 24 BINARY_OP                5 (*)
                 28 RETURN_VALUE
The "LOAD_FAST" uses offset 0 to store "x" and offset 1 to store "y", while "z" required a LOAD_GLOBAL to find z in the globals() dictionary.


Yes, I stumbled across this quite recently:

https://x.com/marekgibney/status/1817495719520940236

I was surprised that you cannot access dynamically created local variables even though they exist.

Looks like the local dictionary still exists. Just that whether accessing a local variable looks it up in the local or global dictionary is determined at compile time.


See PEP 667, "Consistent views of namespaces" https://peps.python.org/pep-0667/ .

In Python 3.12, quoting https://docs.python.org/dev/whatsnew/3.13.html

> PEP 667: The locals() builtin now has defined semantics when mutating the returned mapping. Python debuggers and similar tools may now more reliably update local variables in optimized scopes even during concurrent code execution.

with more details at https://docs.python.org/dev/whatsnew/3.13.html#whatsnew313-l... , which adds:

> To ensure debuggers and similar tools can reliably update local variables in scopes affected by this change, FrameType.f_locals now returns a write-through proxy to the frame’s local and locally referenced nonlocal variables in these scopes


Interesting. As I understand it, it will not affect my example:

    def f():
        exec('y = 7') # will still create a local variable
        print(locals()['y']) # will still print this local variable
        print(y) # will still try to access a global variable 'y'


That does not work for me in Python 3.10 or 3.12:

  Python 3.12.2  ...
  Type "help", "copyright", "credits" or "license" for more information.
  >>> def f():
  ...         exec('y = 7') # will still create a local variable
  ...         print(locals()['y']) # will still print this local variable
  ...         print(y) # will still try to access a global variable 'y'
  ...
  >>> f()
  7
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "<stdin>", line 4, in f
  NameError: name 'y' is not defined
It does print 7 in Python 2.7.


That's what I meant. It tries to access a global variable 'y' and fails because there is none.

See also:

https://x.com/marekgibney/status/1817495719520940236

and

https://x.com/marekgibney/status/1817495721974583569


I see what you mean now.

Messing with locals has been a long-time danger zone. I don't know how it is supposed to interact with exec - another danger zone - and I don't know enough about the changes in 3.13 to know if it's resolved.



I got very little speed-up on pypy from being in a function (0.043 vs 0.033), but I got a 1.8X speed-up for py2 vs py3 (0.7686/.4322 = 1.78). (All programs emit 4999874750 as expected.)

EDIT: This uses a Nim program https://github.com/c-blake/bu/blob/main/doc/ru.md run as `ru -t`, but for the very fast variants you can get a more precise wall time from https://github.com/c-blake/bu/blob/main/doc/tim.md

    pypy3_10-7.3.16_p1 - (first form,2nd form) x same in a func
    TM 0.045421363 wall 0.034517 usr 0.010855 sys 99.9 % 56476 mxRS minCall
    TM 0.043447211 wall 0.029492 usr 0.013763 sys 99.6 % 56124 mxRS ifStmt
    TM 0.033408064 wall 0.020605 usr 0.012758 sys 99.9 % 55960 mxRS minCallInFunc
    TM 0.033115636 wall 0.026059 usr 0.007018 sys 99.9 % 55956 mxRS ifStmtInFunc
    
    python-2.7.18_p16
    TM 1.755861105 wall 1.751740 usr 0.001991 sys 99.9 % 5944  mxRS
    TM 0.940162912 wall 0.935362 usr 0.003988 sys 99.9 % 5916  mxRS
    TM 1.155755936 wall 1.151529 usr 0.002993 sys 99.9 % 5928  mxRS
    TM 0.432215672 wall 0.430851 usr 0.000998 sys 99.9 % 5800  mxRS

    python-3.11.9_p1
    TM 2.678891504 wall 2.675141 usr 0.000994 sys 99.9 % 7992  mxRS
    TM 1.348009364 wall 1.344612 usr 0.001995 sys 99.9 % 7864  mxRS
    TM 2.018986091 wall 2.014702 usr 0.001994 sys 99.9 % 7972  mxRS
    TM 0.768633122 wall 0.766915 usr 0.000997 sys 99.9 % 7980  mxRS
I recall when Python3 was first going through its growing pains in the mid noughties that promises of eventually clawing back performance. This clawback now seems to have been a fantasy (or perhaps they have just piled on so many new features that they clawed back and then regressed?).

Anyway, Nim is even faster than PyPy and uses less memory than either version of CPython:

doIt.nim:

    proc doIt() =
      var i = 10000000
      var r = 0
      while i > 0:
          i = i - 1
          r += min(i, 500)
      echo r
    doIt()
    #TM 0.024735 wall 0.024743 usr 0.000000 sys 100.0% 1.504 mxRM
In terms of "how Python-like is Nim", all I did was change `def` to `proc` and add the 2 `var`s and change `print` -> `echo`. { EDIT: though if you love Py print(), there is always https://github.com/c-blake/cligen/blob/master/cligen/print.n... or some other roll-your-own idea. Then instead of py2/py3 print x,y vs print(x,y) you can actually do either one in Nim since its call syntax is so flexible. }

It is perhaps noteworthy that, if the values are realized in some kind of array rather than generated by a loop, that modern CPUs can do this kind of calculation well with SIMD and compilers like gcc can even recognize the constructs and auto-vectorize. Of course, needing to load data costs memory bandwidth which may not be great compared to SIMD instruction throughput at scales past 80 MiB as in this problem.


> very little speed-up on pypy from being in a function

I believe PyPy is clever and checks for new or removed elements in module and builtin scope. If they haven't changed then lookup can re-use previously resolved information.

> clawing back performance

On my laptop, appropriately instrumented, I see Python 3.12 is faster than 2.7. (Note that they use different versions of clang):

  % python2.7 t.py
  version: 2.7.17
  module scope: 1.0321419239
  function scope: 0.575540065765

  % python3.12 t.py
  version: 3.12.2
  module scope: 0.9053220748901367
  function scope: 0.44064807891845703
Python is far from speedy, yes. For my critical performance code I use a C extension and compiler intrinsics.


Good data. Thanks. I got a bit of a speed-up on python-3.12.5 - 0.599 down from 0.76863, but still slower than 2.7 on the same (old) i7-6700k CPU. { EDIT: All versions on both CPUs compiled on gcc-13 -O3, FWIW. }

On a different CPU i7-1370P AlderLake P-core (via Linux taskset) I got a 1.5X time ratio (py3.12.5 being 0.371 and py2.7 0.245). Anyway, I should have qualified that it's likely the kind of thing that varies across CPUs. Maybe you are on AMD. And no, I sure don't have access to the CPUs I had back in 2007 anymore. :-) So, there is no danger of this being a very scientific perf shade toss. ;-)

Same pypy3 also showed very minor speed-up for being inside a function. So, I think you are probably right about PyPy's optimization there and maybe it does just vary across PyPy versions. Not sure how @mg several levels up got his big 2X boost, but down at the 10s of usec to 10s of ms scale, just fluctuations in OS scheduler or P-core/E-core kinds of things can create confusion which is part of the motivation for aforementioned https://github.com/c-blake/bu/blob/main/doc/tim.md.

I should perhaps also have mentioned https://github.com/yglukhov/nimpy as a way to write extensions in Nim. Kind of like Cython/SWIG like stuff, but for Nim.


I've tested a particular python webserver in PyPy and saw roughly the same performance as CPython on average. My guess at the time was that PyPy has a bigger overhead for network sockets than CPython. You would likely see a gain from using PyPy if the webserver did heavier processing in pure python than what mine does.


I’m not sure it’s worth to invest too much effort in such optimisations unless they can yield perceived performance improvements or cost reductions. For Google, a 0.1% decrease in power consumption pays for itself in hours, but for mere mortals (and regular companies) it often doesn’t.

It’s a fun sport though. I love it.


I just wanted to make sure I wasn't passing up on a free 400% boost or something :-) but it was already running plenty fast, so it was entirely out of curiosity.


For a web application PHP is likely much faster, because it does templating in a sane manner, while you need huge frameworks for Python.


I think you still need to do the templating in PHP, because you usually want to escape html chars. And because you want to seperate code and layout. Having "Hello <?=$name?>" in the template is usually not what you want. It is also harder to read than "Hello {NAME}".

Let's test templating in Python and PHP!

template.py:

    def main():
        t = 'Number {NR} is cool'
        i = 10_000_000
        r = 0

        while i>0:
            x = t.replace('{NR}', str(i))
            r += len(x)
            i = i - 1

        print(r)

    main()
template.php:

    <?php

    $t = 'Number {NR} is cool';
    $i = 10_000_000;
    $r = 0;

    while ($i>0) {
        $x = str_replace('{NR}', $i, $t);
        $r += strlen($x);
        $i = $i - 1;
    }

    print($r);
Results:

    time python3 template.py 
    218888897
    real    0m2.854s

    time php template.php 
    218888897
    real    0m0.994s

    time pypy3 template.py 
    218888897
    real    0m0.795s
So for this use case, PHP is 3x faster than CPython and PyPy is 20% faster than PHP.


With pypy, the code runs in 0.226s on my computer.

So yes, the comparison of OP is CPython vs PHP.


Yes, it is CPython. Added data on the PyPy performance to this comment now:

https://news.ycombinator.com/item?id=41199431


Smalltalk, SELF and Common Lisp are just as dynamic, with great JIT implementations, their research lies the foundation for all modern JITs in dynamic languages, one of those implementations grew up to become Hotspot.

Those systems are image based, at any time, you can eval code, break into the debugger, change code and redo the interrupted code, affecting everything that the JIT already improved on the whole image.


I wonder how would simply doing a `return min(heights)` compare to any of the options given?

(It sure doesn't demonstrate the improvements between interpreter versions, but that's the classic, Python way of optimizing: let builtins do all the looping)


I was curious myself, so I've ran a quick benchmark:

  import random
  import timeit

  heights = [random.randint(0, 10000)/100 for i in range(10000)]

  def benchmark1(heights):
      smallest = heights[0]
      count = len(heights) - 1
      while count > 0:
          if heights[count] < smallest:
              smallest = heights[count]
          count -= 1
      return smallest


  def benchmark1b(heights):
      a = 1
      b = len(heights) - 1
      min_height = heights[0]
      while a < b:
          if heights[a] < min_height:
              min_height = heights[a]
          a += 1

      return min_height


  def benchmark2(heights):
      smallest = heights[0]
      count = len(heights) - 1
      while count > 0:
          smallest = min(heights[count], smallest)
          count -= 1
      return smallest

  print(timeit.timeit('min(heights)', number=1000, globals={'heights': heights}))
  print(timeit.timeit('benchmark1(heights)', number=1000, globals={'heights': heights, 'benchmark1': benchmark1}))
  print(timeit.timeit('benchmark1b(heights)', number=1000, globals={'heights': heights, 'benchmark1b': benchmark1b}))
  print(timeit.timeit('benchmark2(heights)', number=1000, globals={'heights': heights, 'benchmark2': benchmark2}))
Here are the results in Python 3.11:

  0.04471710091456771
  0.21777329698670655
  0.22779683792032301
  0.6679719020612538
So, using min over a list is ~5x faster, using a single variable and a constant 0 is ~5% faster than using two for boundaries, and using min inside the loop instead of the if check is another 3 times slower: so, the old approach of looking for opportunities to use a builtin instead of looping still likely "wins" in the newer interpreters too, but if someone's got 3.14 alpha up, I'd love to see the results.

I might install 3.13 to check it out there too.


Now try with a for loop.


For science I added versions of the benchmarks as for loops: it's faster than the while loop versions above, but min(heights) is still way faster:

    def benchmark1for(heights):
        smallest = heights[0]
        for h in heights:
            if h < smallest:
                smallest = h
        return smallest

    def benchmark2for(heights):
        smallest = heights[0]
        for h in heights:
            smallest = min(h, smallest)
        return smallest
Results: min(heights) : 0.07920974999433383 benchmark1 : 0.954568124958314 benchmark1for: 0.5765543330344371 benchmark2 : 1.8662503749947064 benchmark2for: 1.5281645830255002


Good call GP, and thanks for the results!

What was the Python version you used, and what was the CPU?

My runs were on 3.11 on Ryzen 7 Pro 6850U with whatever the TDP envelope is on a Thinkpad T14s.


Python 3.12.1 on an M2 macbook air. For extra fun I also tested doing `sorted(heights)[0]` which is slightly faster than the benchmark2 variants, but slower than the benchmark1 variants.


On the general topic of Python performance, just importing needed modules can take a few seconds. The output of the code

   import time
   start = time.time()
   import os
   import numpy as np
   import pandas as pd
   import xgboost as xgb
   from sklearn.metrics import mean_squared_error
   from scipy.stats import pearsonr
   import matplotlib.pyplot as plt
   print("time elapsed (s):", "%0.3f"%(time.time() - start))
on my Windows machine is

time elapsed (s): 2.630


It can take an arbitrary amount of time. Modules are just code executed top to bottom, and might contain anything beyond mere constants and functions declaration.


If those libraries are doing something at import time, then it could take any amount of time, TBF.


Not sure why this matters too too much, this is a one-time setup cost (of some significantly large libraries). Python doesn't import everything in your virtual environment at startup, which I think we can agree is the right choice.

I'd further _guess_ that only one or two of those libraries are the actual slowdowns.

See here for how to tell: https://stackoverflow.com/a/51300944/602469


It matters when you have lots of short-lived processes.


I don't trust anyone who uses camelCase to write Python. Or, unnecessary while loops.

But, sometimes you can improve built in functions. I found that using a custom (but simple) string-to-int function in golang is a bit quicker than strconv.formatInt() for decimal numbers.

So, there's that.

Python isn't really supposed to be geared towards performance. I like the language, but only see articles like this as resume fodder.


> but only see articles like this as resume fodder.

Not super important, but i think it's worth calling out. Some of us program because it's fun. We dig into problems like this because it scratches an itch. We share in the hopes of interacting with like-minded individuals. Not everything is about resumes.


Indeed. Some of us do it because we stubbornly want to see just how fast we can make a “slow” language go.


Personally I tend to discount the possibility of such motivations for a text once that text is interrupted in the middle to sell me something (which a Substack form counts as).


Wow, very unfair. It should not be surprising that people writing Python might actually care about performance, even if it isn't so important that they must write their software in assembly.

The code wasn't the authors, and what's more it's just an example to talk about the bytecode. Everyone knows `min(some_list)` is faster and more Pythonic.

This is a very interesting article and probably exposed a few people to a few interesting Python internals.


I was just referring to the sample code at the top. I should have clarified that and I apologize for the lack of clarity.

But, I would like to see the author use a for loop. I'm not why anyone would use a while loop of they know the constraints.


So there's only three super-instructions? I wonder if there are plans for more.


They removed most of superinstructions because those prevented other more effective optimizations:

https://github.com/python/cpython/issues/105229


I’ll just leave this here:

https://paulgraham.com/avg.html


few months ago i had to optimize some old python code, .. and found that my longtime assumptions from py 1,2 and early 3.x are not anymore true. So put up some comparisons.. Check them here:

https://github.com/svilendobrev/transit-python3/blob/master/...

(comment out the import transit.* and the two checks after it as they are specific. Takes ~25 seconds to finish)

Results like below. Most make sense, after thinking deeper about it, but some are weird.

One thing stays axiomatic though: no way of doing something is faster than not doing it at all. Lesson: measure before assuming anything "from-not-so-fresh-experience".

btw, unrelated, probably will be looking for work next month. Have fun.

    $ python timing-probi.py 
    ...
    :::: c_pfx_check ::::
      f_tuple              0.10805996900126047
      f_list               0.10568888399939169
      f_tuple_global       0.10741564899944933
      f_list_global        0.10980218799886643
      f_dict_global        0.09630626599937386
      f_tuple_global20     0.6103107449998788
      f_list_global20      0.6878404369999771
      f_one_by_one         0.05088467200039304
    :::: c_func_glob_vs_staticmethod ::::
      f_glob_func          0.08005491699987033
      f_staticmethd        0.10022392999962904
    :::: c_for_loop_vs_gen_for ::::
      for_loop             0.2255296620005538
      gen_loop             0.29973782500019297
    :::: c_dictget_vs_dict_subscr ::::
      dictgetattr_get      0.05093873599980725
      dictfuncget          0.048424991000501905
      dictin_dictsubscr    0.04722780499832879
      dictsubscr           0.04069488099958107
    :::: c_listget_vs_list_subscr ::::
      listgetattr_get      0.0779018819994235
      listfuncget          0.07271830799982126
      listsubscr           0.057812218999970355
    :::: c_property_vs_funccall ::::
      property             0.08194994600125938
      funccall             0.08422214100028214
    :::: c_a_in_abc_vs_a_eq_b_or ::::
      x_in_str             0.04265176899934886
      x_in_tuple           0.0530087259994616
      x_in_tuple_global    0.05479079300130252
      x_eq_a_or            0.049807468998551485
    :::: c_tuple_vs_slots ::::
      slots                0.17708088200015482
      plain                0.18551878399921407
      tuple                0.07675717399979476
      dict                 0.14878148099887767
      namedtuple           0.2637523979992693
      dataclass            0.18731526199917425
      dataclassfrozen      0.38634534500124573
    :::: c_dictcomp_vs_dict_gen_tuples_vs_loop ::::
      dictcomp             4.476536423999278
      dict_gen_tuples      7.045798945999195
      dict_listcomp        6.461099333000675
      dictloop             4.889642943000581
    :::: c_listcomp_vs_list_gen_vs_loop ::::
      listcomp             1.4269603859993367
      list_gen             2.6340354429994477
      listloop1            1.856299590001072
      listloop2            2.1041324060006446
    :::: c_funccall_args_vs_kargs ::::
      args                 0.14374798999961058
      kargs                0.1689707850000559
      kargs_ignored        0.2658924890001799
      kargs_default        0.26562809300048684


Loops should be avoided in python. Only constant time operations should be performed.


AFAIK this is because CPython has to walk the scope up to find the import for every call in your loop, and still applies to python3, right? You can still use the built in min, just create a "closer" reference before your loop for the same speedup:

inline_min = min

while expr:

    if inline_min(blah):


That's not how imports work in python at all.

Imports are super imperative operations that first tries to find it in sys.modules, otherwise executes the module and puts the resulting dict in sys.modules. Then it grabs the symbols you asked for and shoves all those symbols into the global dict for the module.

It does have to walk the locals and the globals scopes (there are ONLY exactly two scopes in Python!) to find the function, that's true.


TIL. Apparently `nonlocal` and closures are implemented with copies, at least for my copy of dis.dis().


> It does have to walk the locals and the globals scopes

right, this is what I meant.


Also I imagine the dict lookup for the module is the slow part? So declaring in local scope just removes a dict lookup?

I am by no means a python expert :) I just use it occasionally.


15+ years ago many people would tell you that you should do “from foo import bar” instead of “import foo” to save the overhead inherent in referencing “bar” as “foo.bar”. Well, as almost everything in Python is in the end an dict, the dict implementation is ridiculously optimized (and in some counter-intuitive ways), so if you care about that overhead you probably should not be using (C)Python in the first place.

This “everything is a dict” is also a part of why removing GIL is not as straightforward as it would seem.


Actually even on a modern CPU that property lookup does add another ~10% penalty. Granted it's like a .1 second difference out of a second runtime for 10m iterations, but it's measurable! Although I wouldn't be using python on those data sizes usually :)


It makes a slight difference on my system, python 3.12.0

    In [1]: def test_1():
    ...:     i = 10_000_000
    ...:     r = 0
    ...:
    ...:     while i>0:
    ...:         i = i-1
    ...:         r += min(i, 500)
    ...:
    ...:     return r

    In [2]: def test_2():
    ...:     i = 10_000_000
    ...:     r = 0
    ...:     local_min = min
    ...:
    ...:     while i>0:
    ...:         i = i-1
    ...:         r += local_min(i, 500)
    ...:
    ...:     return r

   In [3]: %timeit -n 1 -r 10 test_1()
   2.14 s ± 58.4 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

   In [4]: %timeit -n 1 -r 10 test_2()
   1.94 s ± 47.8 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


Benchmark it, it won't be any faster.

Using the if statement: Execution time: 0.00648 seconds

Using the min function directly: Execution time: 0.02298 seconds

Using the min function with an intermediate variable: Execution time: 0.02959 seconds


How about using the `min` builtin directly over the entire list?


Sure, let's benchmark it :)

It is consistently around 8% to 15% faster on 3.10.12 and 3.11 for me. On 3.12.5 (latest) I seem to get the same result.

https://www.online-python.com/B6AgKW5zod

(please copy the code to your local to not ddos this site :D)


I agree with you. It's a weird quirk of Python where localization is more efficient.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: