One of the performance improvements mentioned is "Remove the usage of the C stack in Python to Python calls" [0]. Since Python 3.11, a Python-level function call can be evaluated within the bytecode interpreter loop, no longer requiring a C-level function call.
Interestingly, Lua 5.4 did the opposite. Its implementation introduced C-level function calls for performance reasons [1] (although this change was reverted in 5.4.2 [2]).
At the heart of the Lua and Python interpreters, there's an evaluation function, written in C, that does something like this:
void evaluate() {
while (1) {
instr = get_next_bytecode_instruction()
switch(instr.type) {
// run the instruction
}
}
}
The question is how to run a CALL bytecode instruction. One way is to use the C stack: call evaluate() recursively. The other way is to have the interpreter maintain its own call stack and the CALL instruction is responsible for updating said call stack. This way the interpreter stays in the same while(1) loop the whole time.
Using the C stack can be faster, because you take advantage of the native instruction pointer & return address instead of emulating them in C. But it can also be slower if calling evaluate() is slow.
When one Python function calls another, does that ultimately result in the CPU executing a `call` instruction, or not?
Python 3.11 changed from "yes" to "no" for performance reasons.
Lua, OTOH, changed from "no" to "yes" in 5.4.0, again for performance. It then changed back to "no" in 5.4.2 for other reasons, despite the performance regression.
(In Lua 5.1, instead of call/`return`, `coroutine.resume/yield` is implemented using C-level call/`return`. That would preclude implementing call/`return` the same way - but since 5.2 the coroutine functions are instead implemented via call/`longjmp`.)
In JITted code? Nothing, it’s a pure tracing JIT so all calls are flattened. In the interpreter? We can always check the source[1], and it seems it implements its own stack.
Here we go, another HN roundabout criticism of Python performance.
Here’s a real world example. I recently did some work implementing DSP data pipeline. We have a lot of code in Go, which I like generally. I looked at the library ecosystem in Go and there is no sensible set of standard filtering functions in any one library. I needed all the standards for my use case - Butterworth, Chebyshev, etc. and what I found was that they were all over the place, some libraries had one or another, but none had everything and they all had different interfaces. So I could have done it in Go, or I could have kept that part in Python and used SciPy. To me that’s an obvious choice because I and the business care more about getting something finished and working in a reasonable time, and in any case all the numeric work is in C anyway. In a couple of years, maybe that ecosystem for DSP will be better in Go, but right now it’s just not ready. This is the case with most of our algorithm/ML work. The orchestration ends up being in Go but almost everything scientific ends up in Python as the ecosystem is much much more mature.
As a ML/AI guy, Python is, imo, the greatest DSL of all time. You get great access to C/C++/Fortran/CUDA code while getting a very high level language for scripting the glue between processes.
That is fair enough, but using Python as glue with existing C libraries is the acknowledged use case for Python. I think it is even admitted in this thread.
But for that use case the claimed performance improvements in pure Python (which I can rarely replicate) are not relevant. They could leave the interpreter as it was for +20 years.
For web applications, migrating to Go or PHP should not require much thought apart from overcoming one's inertia. Hopefully the scientific stack will be rewritten in better languages like Go, Lisp, Ocaml, etc.
> Hopefully the scientific stack will be rewritten in better languages like Go, Lisp, Ocaml, etc.
I don't keep my breath on that. Because Python can be used interactively, relatively mature, hard plumbing work is done by real computer scientists, and all is left to play with the code until the numbers look right for a given research.
It's just a better MATLAB in practice for some disciplines, and none of the languages we have provides this kind of flexibility with that amount of leeway for play. Zip the virtualenv and send over. If it doesn't work just delete the folder and reinstall anaconda...
Nobody is thinking about the performance of their code or the elegance of what they do. They care about the papers, deadlines and research in general. Code quality, general performance, organization can go whereever they like. All underlying high performance code is written in C anyway and only implementers of these code cares about that code. It's akin to docker for (data) scientists mostly.
How do I know? I'm an HPC admin. I see that first hand.
Julia is a REPL-driven interactive system that was specifically syntactically (to its detriment IMO) targeted at Matlab users while Python for science was on the rise. Julia code does usually run faster than Python stuff without a lot of care & feeding for either, but Julia just didn't catch on quickly enough and will probably always be playing catch-up.
Julia/Python/R is even earliest in the name Ju-Py-te-R, although that now supports like 40 PLs and "notebook" is probably a reference to the 1991 Mathematica 2.0 Notebook Front End (and obviously it was mostly just a coining/play on words, not "priorities"). Personally, I think the often trivial-to-modest extra effort to work with static types is worth the effort both for correctness, specificity and performance, but 95+% of REPL-oriented systems (by number not mindshare - more like 99.9+% by mindshare) are lax/dynamic about types by default except maybe the Haskell Hugs interpreter and a few other exceptions. Cython w/cdef | Common Lisp with (declare) do at least allow gradual typing, and arguably the "type stable/not" flavors of Julia battles are roughly in that camp. Like most "defaults", most people just use the default even if other things are available, though. That's why to me it's worth making users be a little more explicit like Nim does.
Anyway, I am just discussing alternatives floating about, not trying to be disputatious. Mostly people / scientists use whatever their colleagues do in a network effects way, with ever so occasionally generational shifts. Sometimes research communities are held hostage not just to a specific programming language, but rather to a very specific code base - certain kinds of simulators, reaction kinetics engines, and so forth. (TensorFlow or PyTorch might be a contemporary analogue more HN folk could relate with.)
Common Lisp does, but then again, only druids care about its existence and the tales of strange machines, powerful graphics workstations, that used to be fully written in languages full of parenthesis.
To the point people don't realize notebooks are how those machines REPLs used to work.
Exactly, but it's undeniable that even updated it does have some ergonomics issues. It just that its users care less about developer ergonomics. Afaik, there were plenty of attempts to build a replacement (i. e. Fortress), didn't catch on.
Fortran is in a weird spot. It's incredible for writing the hardest parts of the programs it's used for, but it honestly sucks for doing the "boring" parts. Doing heavy vectorized workloads in Fortran is nice, but doing I/O, organizing code, using any data structure that isn't essentially an array of some sort, etc. all suck.
There is a reason Fortran has survived so long. Writing matrix/vector heavy code in Fortran is (imo) still more pleasant than C, even when accounting for the idiosyncracies around writing a Fortran program in general. The skill floor is significantly lower for writing very high-performance code.
In a way they would be better off using Fortran instead of Python.
Every scientific result that relies on a snapshot of 50 packages (whose resolution in conda is an NP-hard problem) should be qualified with an added hypothesis:
Given Python-3.x.y and [50 version numbers of packages], we arrive at the following result.
Even in Python itself the random number generator was broken for more than a year by one of the persons who actively mob people for rejecting their patches (in which they might have a monetary interest since they got paid for many of their patches, making it a gigantic conflict of interest).
This invalidated many scientific results from that period.
I had bad experiences trying to productionise Julia code in the past. I realise the pre-compilation JIT-ing story is now quite different but it left a bad taste in my mouth with the community basically saying that it wasn't a problem when it quite clearly was.
I'm always shocked at how much performance we leave on the table in these places. I did a super quick benchmark in python[0] and go [1] (based on necovek's[2] comment) and ran them both locally. The go implementation runs the entire benchmark quicker than the fastest of the python ones.
The deviation in go's performance is still large, but far less so than Python's. Making the "wrong" choice for a single function call (bearing in mind that this is 10k iterations so we're still in the realm of scales even a moderate app can hit) in python is catastrophic, making the wrong choice for go is a significant slowdown but still 5x faster than doing it in Python. That sort of mental overhead is going to be everywhere, and it certainly doesn't encourage me to want to use python for a project.
Development speed and conciseness is what sells Python. Something simple like being able to do `breakpoint()` anywhere and get an interactive debugger is unmatched in Go.
Development speed in Go blows Python out of the water when approaching the project as an engineering project, but Python has it beat when approaching the project as a computer science project.
And therein lies the rub for anything that is pushing the modernity of computer science (AI/ML, DSP, etc.) When born in the computer science domain, much of the preexisting work able to be leveraged is going to be built around Python, not Go (or any other language, for that matter), and it takes a lot of work to bring that to another system.
It is pretty clear that just about anything other than Python is used in engineering when the science end is already well explored and the aforementioned work has already been done, but the available libraries are Python's selling feature otherwise.
In a Python debugger, you can trivially augment the state of the program by executing Python code right there while debugging.
I don't remember what you can do with Go's debugger, but generally for compiled programs and GDB, you can do a lot, but you can't really execute snippets of new code in-line.
The simplicity is in the fact that Python debugger syntax is really mostly Python syntax, so you already know that if you've written the Python program you are debugging.
This objection only works in comparisons with python Vs Java, C++ and perhaps Rust (compile times). It doesn't really hold up against more modern languages such as Kotlin or heck even Groovy, which is an ancient relic in JVM land.
Yeah, but Go isn't the only alterantive, and something like `breakpoint()` anywhere goes back to Smalltalk and Lisp, with performance Python doesn't offer.
Cpython deliberately chooses to prioritize simplicity, stability and easy C extensions over raw speed. If you run the benchmark in PyPy (the performance optimized implementation) it’d be 10X faster. You could argue anything that’s performance bottlenecked shouldn’t be implemented in Python in the first place and therefore C compatibility is critical.
I don't have easy access to pypy so I can't run the benchmark on that I'm afraid. I'd love to see some relative results though.
> You could argue anything that’s performance bottlenecked shouldn’t be implemented in Python in the first place and therefore C compatibility is critical.
Honestly, I think I'm heading towards "python should only be used for glue where performance does not matter". The sheer amount of time lost to waiting for slow runtimes is often brought up with node and electron, but I ran the benchmark in node and it's about halfway between go and python.
This is something I find quite fascinating. I've seen so much time wasted on Python this way it's incredible.
Insanely smart optimizations done. But then you rewrite the whole thing in vanilla Rust without profiling once and it's still 20x faster, and much more readable than the mix of numpy vectorization and C-routines.
It's kind of knowingly entering a sunk-cost fallacy for convenience. Before the cost is even sunk.
Python is often just the wrong choice. And it's not even that much simpler if you ask me.
In a language where you use value-based errors, have powerful sum types and strong typing guarantees, you can often get there faster, too IMO, because you can avoid a lot of mistakes by using the strictness of the compiler in your favor.
> Insanely smart optimizations done. But then you rewrite the whole thing in vanilla Rust without profiling once and it's still 20x faster, and much more readable than the mix of numpy vectorization and C-routines.
I'd argue that if you can rewrite the whole thing in Rust (or any other language), then it's not really a large project and it doesn't matter what you originally wrote it in.
On multiple occassions I had to port parts of Python projects to C++ for performance reasons, and the amount of code you have to write (and time you need to spend) is mind blowing. Sometimes a single line of Python would expand to several files of C++, and you always had to redesign it somehow so that memory and lifetime management becomes feasible.
Python is often the right choice IMO, because most of the time the convenience you mention trumps all other concerns, including performance. I bet you would get tons of examples from people if you "Ask HN" about "how many quickly written Python prototypes are still out there years later because their C++ replacement could never be implemented on time?"
A one-liner in Python is replaced by hundreds of lines of changes to implement in C++. (The one-liner has some boilerplates around it too, but the C++ function itself is longer and the boilerplates around is even more.)
Technically right, but the original code only optionally depends on Numba, and Numpy is ubiquitous in scientific Python.
The problem I was trying to solve in all those boilerplates is basically that Numpy still has overhead and dropping it to lower level is trivial if Numba is used, but the package maintainers doesn’t want to introduce Numba dependency for various reasons. And the diff is showing that change from Numpy/Numba to C++ with pybind11.
That doesn't change the fact that it's not representative of how many lines something takes in python vs C++. If you replaced it with a call to native that used a third party library it would be back to being one line.
First, no sane mind will think one single example will be representative to anything.
Second, if you want to focus on 3rd party or not, sure. But you could also think Numba jit vs C++ and in this case it puts them on a more equal footing to compare them.
As far as Numba is concerned, numpy is in its stdlib as it aren’t calling Numpy but jit compile it.
You may say Numba is not Python, true, but you could consider Numba an implementation of a subset of Python.
We can argue about the details but the example is not that long to read. My summary and your summary are both at best distilled through our own point of view a.k.a. biases. Anyone can read the diff itself to judge. That’s why I link to it.
My point is that you replaced calling a third party library from python with an implementation in native. You could easily replace a c++ third party library with a full python implementation of the same thing and come to the opposite conclusion.
The example matters because it’s bad - you’re not comparing like for like.
There's a huge swathe of languages between Python and C++. On threads about Node's performance (which is somewhere in between) people regularly say that it's unfair of developers to solely focus on devex, which is the argument here.
> On multiple occassions I had to port parts of Python projects to C++ for performance reasons, and the amount of code you have to write (and time you need to spend) is mind blowing. Sometimes a single line of Python would expand to several files of C++, and you always had to redesign it somehow so that memory and lifetime management becomes feasible.
Part of that is because it's designed for python. Using this benchmark example, the python code (without imports) is 36 lines and the go code is 70. Except, we're doing a 1:1 copy of python to go here, not writing it like you would write go. Most of the difference is braces, and the test scaffold.
I'd also argue the go code is more readable than the python code, doubly so when it scales.
> it's unfair of developers to solely focus on devex, which is the argument here.
For most companies and applications, developer time is many orders of magnitude more expensive than machine time. I agree that, if are running applications that run across thousands of servers, then it's worth writing it in the most performant way possible. If your app runs on half a dozen average instances in a dozen-node K8s cluster, it's often cheaper to add more metal than to rewrite a large app.
There is more to it than just time. If your app is customer facing (or even worse, runs on your customer's own resources), high latency and general slowness is a great way to make your users hate it. That can cost a lot more than few days of developer time for optimization.
There is, of course, a limit. OTOH, if your devex is so terrible you can’t push out improvements at the same cadence of your competitors, you are doomed as well.
No, it’s more expensive to you, and you get to ignore the externalities. It’s like climate change - you don’t care because it doesn’t affect you.
How many hundreds of hours are wasted every day by Jira’s lazy loaded UI? I spend seconds waiting for UI transitions to happen on that app. How much of my life do I spend waiting for applications to load and show a single string that should take 50ms to fetch but instead takes 30 seconds to load a full application?
I don't know, but personally I've spent orders of magnitude more time debugging memory leaks in languages without GC than the time I've wasted waiting for shitty apps to load.
Software bloat is a real thing, I agree, but it's a fact of life and in my opinion not a hill worth dying on. People just don't care.
If you want proof of this, try to find a widely used Jira alternative which doesn't suck and doesn't need a terabyte cluster to run. There isn't one because most people just don't care about it, they care about the bottom line.
Devs don’t care, because they aren’t being forced to care, and they have loaded M3 MBPs, so they don’t notice a difference.
I firmly believe that we’ll see more of a swing back to self-hosting in some form or another, be it colo or just more reliance on VPS, and when we do, suddenly performance will matter. Turns out when you can’t just magically get 20 more cores on a whim, you might care.
Not only that - when your ticket system becomes a Jira because of all the feature creep required, it’ll perform like a Jira.
Perhaps a little better if you can implement more in less crazy ways (it seems to keep a lot of logic into a relational database, which is a big no-no for the long term evolution of your system).
How many years of our lives would be saved if Jira was rewritten in Verilog and ran on the fastest FPGAs?
Not much. I bet it spends most of the time looking up data on their massive databases, which do the best they can to cope with the data structures Jira uses internally. You could, conceivably, write a competitor from scratch in highly optimised Rust (or C++) and, by the time you have covered all the millions of corner cases that caused Atlassian to turn a ticketing system into Jira, you’ll have a similar mess.
This ad absurdem argument is ridiculous. Nobody is talking about writing web apps in verilog except you.
I can’t claim to know why Jura is so slow - I don’t work for atlassian. It was an example of an app I use daily that I spend minutes per day staring at skeletons, loading throbbers and interstitials on.
Core architectural choices like “programming language” set a baseline of performance that you can’t really fix. Python is unbelievably slow and your users pay that cost. There are lots of options in between that offer fast iteration speeds and orders of magnitude faster runtime speeds - go and .net are widely popular, have excellent ecosystems and don’t require you to flash an FPGA.
>For most companies and applications, developer time is many orders of magnitude more expensive than machine time.
For me the context was mostly working in DevOps CI/CD. Slow CI scripts slow down developer feedback and cost machine time as well as developer time.
If your customers are your developers, performance engineering is about saving the expensive developer time.
And if you ask me the effects are non-linear, too.
If your build system takes 1 minute or 10 minutes for a diff build the time lost won't be a mere factor of 10. Any sufficiently long time and your developers will lose focus and their mental model, will start making a coffee, and maybe they might come back later to that work after 3 meetings that got scheduled in between.
It all depends on execution time. If you run it just a few times or the total execution time doesn't take days, you can do it faster in Python end-to-end.
> I've seen so much time wasted on Python this way it's incredible.
It sort of goes both ways. You'll see a lot of time wasted on Python when the developers probably could have know it would need better performance. On the flip side you also see teams choosing a performant language for something which will never need it. One of the reasons I've come to prefer Go over Python is that you sort of get the best of both worlds with an amazing STL. The biggest advantage is probably that it's harder to fuck things up with Go. I've seen a lot of Python code where the developer used a List rather than a generator, loops instead of build in functions and so on.
That being said, I think we'll see a lot more Rust/Zig/C/C++ as things like ESG slowly makes its way into digitalisation decisions in some organisations. Not that it'll necessarily making things more performant but it sounds good to say that you're using the climate friendly programming language. With upper management rarely caring about IT... well...
> But then you rewrite the whole thing in vanilla Rust without profiling once and it's still 20x faster
The main reason Python is used is because developer time is expensive and computer time is cheap. When that balance shifts the other way (i.e. the compute slowdown is more expensive then the dev time), Python is no longer the right choice.
There's no point in developing at half speed or less in Rust compared to Python if your code only runs a few times, or if the 20x performance gains only results in 60 minutes of total compute time saved.
>There's no point in developing at half speed or less in Rust compared to Python
In a lot of cases the Rust development speed is faster IME, because you can really get the correctness on the first go much more often than with Python. Also proc-macros make dealing with serialization and deserialization or CLI parameters a breeze.
A lot comes down to familiarity, to be fair. I’m not a SWE by trade, I’m an SRE / DBRE. I can read code in just about any language, but the only one I consider myself really good in is Python.
I could and should learn another (Rust seems very appealing, yes), but at the same time, I find joy in getting Python to be faster than people expect. C being leaned on is cheating a bit, of course, but even within pure Python, there are plenty of ways to gain speedups that aren’t often considered.
You can enforce strict typing by using MyPy and Rust. Those are part of modern python development stack. If your program can be rewritten in python you are definitely not developing Fullstack Web Development or DataScience / Machine Learning.
Rust isn't good at those since there are no good framework for DL/ML stuff - and working with unstructured data is very much necessary in those field where it is against Rust .
I dislike Go as a language for a variety of reasons, but nothing really objective.
I can agree with your sentiment that if performance is important from the outset, that writing everything in a language more performant by default is a smart move.
In a particular case for me, the project evolved organically, and I learned performance bottlenecks along the way. The actual functions I rewrote in C were fairly small.
If you dislike go, then c# is another great alternative. I chose go because it’s fast, simple and the iteration times for go are often quicker than python apps in my experience
I think the safest way to understand Python is that performance is a “nice to have” feature, traded off for legibility and flexibility and even simplicity of the CPython code base (go read it! You barely need to understand C and a lot of it is written in Python!).
Don’t waste time being surprised that you can do better than the default implementation. Just assume that and do so when you’ve measured that it matters.
That being said, (mostly) free performance is always nice. I’m glad they’re working on performance improvements where they can do it without sacrificing much.
Everyone need to stop comparing Nim with Python. Nim is closer to pascal, and the only thing it shares with Python is indentation base syntax. It doesn't have the expressivity and flexibility of Python. It's a nice language, but there is no good reason to compare it to Python.
I think the comparisons with Python are driven by how divisive it is to have leading whitespace be significant (though Haskell also has it without as much divisive discussion). There was some effort in Nim to have some of the standard library names mirror Python such as `strutils.startswith(str, "oops")`. That said that but for a few things like that, it is true that Pascal/Ada but made more terse are more an inspiration.
I know both languages well and personally I consider Nim to be more expressive and syntactically flexible than Python although it is admittedly a difficult thing to quantify (which "more"/your comment framing implies). As just a few examples to make the abstract more concrete, in Nim you can add user-defined operators. So, with https://github.com/SciNim/Measuremancer I can have let x = a +- b or let y = c ± d construct uncertain numbers and then say y/x to get a ratio with uncertainty propagated. Function call syntax is more flexible (and arises more often than uncertain value algebras), e.g. "f(x)", "x.f", "f x", and even "f: x" all work for any kind of routine call. The need to be specific about let/var constancy & types is slightly more verbose but adds semantic/correctness value (e.g. first use). Just generally Nim feels like "much more choice" than Python to me. In fact, the leading whitespace itself can usually be side-stepped with extra ()s which is not true in Python.
I am speculating, but I think what might make you feel Python is more expressive may come from ecosystem size/support effects such as "3 external libraries exist that do what I want and I only need to learn the names of how to invoke whichever". So, you get to "express" what you "want" with less effort (logic/coding), but that's not really PLang expressivity. Python is more dynamic, but dynamism is a different aspect and often leads to performance woes.
Both are general-purpose, high-level programming languages. Why not compare them? Say, I am considering starting a new programming project, I need to chose a language. Nim looks fine, so does Python, so which one do I chose? I need to compare those languages to make a motivated choice, right?
Anyhow, what are we allowed to compare with Python?
> Nim looks fine, so does Python, so which one do I chose?
Assuming you need to use libraries for stuff, you choose the language that has the better ecosystem, more users and more StackOverflow support. As a company you choose the language that makes it easier to hire devs in, and move them around from a project to another. Only when execution performance trumps all other concerns you select the faster/stricter language, and only for mature projects not for a POC which might be thrown away soon.
So you use the strict/performant languages for mature, execution intensive code that doesn't need much library support and where you don't need to hire many devs.
I really really want nim to catch up. It's a lovely language. I've used it for small projects and it is great but I couldn't use it for anything complex that requires third-party libraries. The ecosystem is just not there.
Pretty much. When Python was at this stage, what alternatives did a developer have than to improve Python/write a library? Nim has to “compete” with a solution already being available.
Python was a curiosity/tiny niche for over a decade from the earliest 0.9 releases I used. I was advocating it to thoroughly deaf ears in the mid-90s with rejoinders of "just use Common Lisp".
Then several things happened. 1) Large businesses like Google openly saying they used Python in the early 2000s helped for things like web / Internet work. 2) Linux distros like Gentoo leaned on it heavily helped push it in systems/sysadmin. 3) the "Numeric -> NumPy/SciPy" transition helped a ton - spurred by things like `f2py` and exorbitant licensing costs of Matlab, the main competitor, which also had a lot of 1970s/early 80s language problems like 1-function per file and scoping craziness and not doing reference semantics for big matrices and junk like that. Then 4) the culture of open source in scientific software helped build out that aspect of the ecosystem as the other, 1/2/3 were growing fast. Then 5 by the late noughties/early teens universities were switching to it as a teaching language and finally 6 ML/data science as a specific area of the NumPy/SciPy world started taking off in hype cycles and even real applied work.
This is just a long-winded (but still woefully incomplete) / historical way of saying "software has network effects" (not just PLs, but OSes and applications as well), of course. As Jack Reacher likes to say "details matter in an investigation". ;-) I don't think Python was purely lucky, exactly, but with one or two those big reinforcement/feedback factors missing, it might not have become so dominant.
Not to be a stickler, but the only fair benchmark would be to select the most recent stable version of each. Doing so is fairly easy with docker I would think, if you don't wish to modify your system.
I wouldn't say you are a stickler. But the most productive addition to the discussion would be to do the benchmark as you see fit and then show us the results.
3.11? The article explains there are improvements on that specific benchmark in newer versions. Not on the order of magnitude that it would catch up here, but still significant.
I once was using a bunch of handmade python scripts to work with terrain heightmaps and ground terrain textures in an attempt to improve FSX visuals before flight sim 2020 was announced.
I had 12Kx12K pixel images of the ground in spring, and wanted to make some winter versions for fun. Some basic toying around showed you could reasonably resample a "snow" texture into these photos wherever you had pixels that were "mostly green", ie grass or trees.
I wrote this up in python using the simple PIL python library for working with image data. It took 45 minutes per image.
I fired up a text editor and within 30 minutes had a java version of the code, most of that time was spent remembering how to write java.
30 seconds per image.
I don't care what nonsensical "optimizations" I could have done, like oh just pull it into numpy arrays and work there, oh PIL is slow for pixel manipulation so don't do that, oh you could rewrite it like X and it will be faster
I don't care. The simple, naive, amateurish attempt in Java, a language I hadn't touched in 6 years, blew the pants off of the python version, a language I used at my day job. It took zero more effort to write in java than in python. Python is great for doing simple one off scripts, but it is terrible at actual work.
Add to that, the tooling around python is mediocre, and for some reason python programmers don't seem to care, while the tooling around java is so good IDEs basically write all the supposedly awful boilerplate code for you. Stuff that I had to do by hand in python was automated in java.
I had a similar experience when i was in highschool and learning python. I wrote a image->ascii image convertor using PIL as you did, one image would take a few hundred ms and it was fine, but as i played with it and added more features/learned more i tried to do it on video frames extracted with ffmpeg and it would take atleast 4x the duration of the video or even more to convert it in ascii video.
Few months ago i coded it in go since i got back into programming and it was significantly faster(The overall code was also smaller but maybe that's just because i have a bit more experience now) even tho in both cases the implementations was the most naive one i could come up with, without any optimizations.
Edit:
I went ahead and tested both, took me 30 mintues to set them with identic parameters on 3602 frames(roughly 1 minute video)
Keep in mind that this is just a naive implementation with 2 nested loops itereting in a step of 12 over the pixels of the image, no fancy math or numpy used in the python version. Image for reference[0].
Also keep in mind that the step and length of the ascii charmap used affects performance but here i used a simple one "_^:-=fg#%@" so it's a bit easier for the python code, here are the results on my i7 9th gen laptop
1 Minute video 720p 60fps
GO - Execution time: 359647 ms, 5.9 minutes ("asciifying the frames") + 30~ seconds ffmpeg video conversion
Python 3.12.3 - Execution time: 0:16:25.002945, minus ~25-30 seconds ffmpeg
A lot of these notes follow a similar pattern: we start with Python, get a working program, and then rewrite it in X with a huge performance gain.
That’s fair. We use the easy language to get a correct program, then we use the “hard” language to get a fast program, now that we know what a correct one looks like.
This is the key here. If you try to start with the lower level language, you might spend a much longer time getting to the correct program.
It’s much easier to optimise a program that’s working than to debug one that isn’t.
There are other candidates, only listing the ones with good out of the box REPL experience, JIT/AOT toolchains in the reference implementation, and not going crazy with their type systems for strong typing.
One of the strengths python has is its ecosystem. All the languages you listed don’t come close to python’s ecosystem except perhaps TypeScript/nodeJS (whose quality of ecosystem is questionable although improving in recent times).
No, this was very much NOT my point. The java version of my program was just as easy if not easier to write, precisely because I COULD write the naive version and trust that it would work out fine performance wise.
Hell, the java version didn't even require the extra library (Pillow/PIL) that python required!
This is more a measurement of looping, calling the min function and printing than the language itself. All benchmarks is of course that to some degree, but I really don't think this is a good comparison.
The dynamic aspect of Python is once again probably the major penalty here.
One major difference here is that in Python you can overwrite min(), so it has to access this function through more abstraction, but not in PHP. I bet PyPy is close to PHP here however.
You might be right about the min() call being the major penalty.
When I avoid the min() call like this ...
min2.py:
i = 10_000_000
r = 0
while i>0:
i = i-1
if i<500: r += i
else : r += 500
print(r)
I get a speedup of about 50%:
time python3 min2.py
4999874750
real 0m1.338s
Still 4x slower than PHP though.
With PyPy I get an amazing speedup of 96%:
time pypy3 min.py
4999874750
real 0m0.088s
4 times faster than the PHP version!
I'm not sure what the state of things is regarding PyPy and running a web application. Doing a bit of googling, it seems that calling C functions from PyPy is slow, so one would have to check if the application uses those (for example to connect to the database).
FWIW, if I place the code in a function, so that variable lookups are constant-time rather than a dictionary lookup in the module globals, then (on my Mac, with Python 3.12) I get a two-fold speedup. That is:
def spam():
t1 = time.time()
i = 10_000_000
r = 0
while i>0:
i = i-1
if i<500: r += i
else : r += 500
t2 = time.time()
print(r)
print("Time:", t2-t1)
reports 0.87 seconds outside a function and 0.43 seconds when inside a function.
(Using the shell's time, of course, also includes the Python startup time. When I use it I get 0.94 and 0.49 seconds, indicating a 0.06s startup penalty.)
Python 2.1 added static nested scopes (see the PEP 227 link), which enabled a much faster locals lookup. With that one case excluded, all of the locals can be determined at parse/byte-compile time, stored in a simple vector, and referenced by offset:
I was surprised that you cannot access dynamically created local variables even though they exist.
Looks like the local dictionary still exists. Just that whether accessing a local variable looks it up in the local or global dictionary is determined at compile time.
> PEP 667: The locals() builtin now has defined semantics when mutating the returned mapping. Python debuggers and similar tools may now more reliably update local variables in optimized scopes even during concurrent code execution.
> To ensure debuggers and similar tools can reliably update local variables in scopes affected by this change, FrameType.f_locals now returns a write-through proxy to the frame’s local and locally referenced nonlocal variables in these scopes
Interesting. As I understand it, it will not affect my example:
def f():
exec('y = 7') # will still create a local variable
print(locals()['y']) # will still print this local variable
print(y) # will still try to access a global variable 'y'
Python 3.12.2 ...
Type "help", "copyright", "credits" or "license" for more information.
>>> def f():
... exec('y = 7') # will still create a local variable
... print(locals()['y']) # will still print this local variable
... print(y) # will still try to access a global variable 'y'
...
>>> f()
7
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in f
NameError: name 'y' is not defined
Messing with locals has been a long-time danger zone. I don't know how it is supposed to interact with exec - another danger zone - and I don't know enough about the changes in 3.13 to know if it's resolved.
I got very little speed-up on pypy from being in a function (0.043 vs 0.033), but I got a 1.8X speed-up for py2 vs py3 (0.7686/.4322 = 1.78). (All programs emit 4999874750 as expected.)
I recall when Python3 was first going through its growing pains in the mid noughties that promises of eventually clawing back performance. This clawback now seems to have been a fantasy (or perhaps they have just piled on so many new features that they clawed back and then regressed?).
Anyway, Nim is even faster than PyPy and uses less memory than either version of CPython:
doIt.nim:
proc doIt() =
var i = 10000000
var r = 0
while i > 0:
i = i - 1
r += min(i, 500)
echo r
doIt()
#TM 0.024735 wall 0.024743 usr 0.000000 sys 100.0% 1.504 mxRM
In terms of "how Python-like is Nim", all I did was change `def` to `proc` and add the 2 `var`s and change `print` -> `echo`. { EDIT: though if you love Py print(), there is always https://github.com/c-blake/cligen/blob/master/cligen/print.n... or some other roll-your-own idea. Then instead of py2/py3 print x,y vs print(x,y) you can actually do either one in Nim since its call syntax is so flexible. }
It is perhaps noteworthy that, if the values are realized in some kind of array rather than generated by a loop, that modern CPUs can do this kind of calculation well with SIMD and compilers like gcc can even recognize the constructs and auto-vectorize. Of course, needing to load data costs memory bandwidth which may not be great compared to SIMD instruction throughput at scales past 80 MiB as in this problem.
> very little speed-up on pypy from being in a function
I believe PyPy is clever and checks for new or removed elements in module and builtin scope. If they haven't changed then lookup can re-use previously resolved information.
> clawing back performance
On my laptop, appropriately instrumented, I see Python 3.12 is faster than 2.7. (Note that they use different versions of clang):
Good data. Thanks. I got a bit of a speed-up on python-3.12.5 - 0.599 down from 0.76863, but still slower than 2.7 on the same (old) i7-6700k CPU. { EDIT: All versions on both CPUs compiled on gcc-13 -O3, FWIW. }
On a different CPU i7-1370P AlderLake P-core (via Linux taskset) I got a 1.5X time ratio (py3.12.5 being 0.371 and py2.7 0.245). Anyway, I should have qualified that it's likely the kind of thing that varies across CPUs. Maybe you are on AMD. And no, I sure don't have access to the CPUs I had back in 2007 anymore. :-) So, there is no danger of this being a very scientific perf shade toss. ;-)
Same pypy3 also showed very minor speed-up for being inside a function. So, I think you are probably right about PyPy's optimization there and maybe it does just vary across PyPy versions. Not sure how @mg several levels up got his big 2X boost, but down at the 10s of usec to 10s of ms scale, just fluctuations in OS scheduler or P-core/E-core kinds of things can create confusion which is part of the motivation for aforementioned
https://github.com/c-blake/bu/blob/main/doc/tim.md.
I should perhaps also have mentioned https://github.com/yglukhov/nimpy as a way to write extensions in Nim. Kind of like Cython/SWIG like stuff, but for Nim.
I've tested a particular python webserver in PyPy and saw roughly the same performance as CPython on average. My guess at the time was that PyPy has a bigger overhead for network sockets than CPython. You would likely see a gain from using PyPy if the webserver did heavier processing in pure python than what mine does.
I’m not sure it’s worth to invest too much effort in such optimisations unless they can yield perceived performance improvements or cost reductions. For Google, a 0.1% decrease in power consumption pays for itself in hours, but for mere mortals (and regular companies) it often doesn’t.
I just wanted to make sure I wasn't passing up on a free 400% boost or something :-) but it was already running plenty fast, so it was entirely out of curiosity.
I think you still need to do the templating in PHP, because you usually want to escape html chars. And because you want to seperate code and layout. Having "Hello <?=$name?>" in the template is usually not what you want. It is also harder to read than "Hello {NAME}".
Let's test templating in Python and PHP!
template.py:
def main():
t = 'Number {NR} is cool'
i = 10_000_000
r = 0
while i>0:
x = t.replace('{NR}', str(i))
r += len(x)
i = i - 1
print(r)
main()
Smalltalk, SELF and Common Lisp are just as dynamic, with great JIT implementations, their research lies the foundation for all modern JITs in dynamic languages, one of those implementations grew up to become Hotspot.
Those systems are image based, at any time, you can eval code, break into the debugger, change code and redo the interrupted code, affecting everything that the JIT already improved on the whole image.
I wonder how would simply doing a `return min(heights)` compare to any of the options given?
(It sure doesn't demonstrate the improvements between interpreter versions, but that's the classic, Python way of optimizing: let builtins do all the looping)
So, using min over a list is ~5x faster, using a single variable and a constant 0 is ~5% faster than using two for boundaries, and using min inside the loop instead of the if check is another 3 times slower: so, the old approach of looking for opportunities to use a builtin instead of looping still likely "wins" in the newer interpreters too, but if someone's got 3.14 alpha up, I'd love to see the results.
For science I added versions of the benchmarks as for loops: it's faster than the while loop versions above, but min(heights) is still way faster:
def benchmark1for(heights):
smallest = heights[0]
for h in heights:
if h < smallest:
smallest = h
return smallest
def benchmark2for(heights):
smallest = heights[0]
for h in heights:
smallest = min(h, smallest)
return smallest
Python 3.12.1 on an M2 macbook air. For extra fun I also tested doing `sorted(heights)[0]` which is slightly faster than the benchmark2 variants, but slower than the benchmark1 variants.
On the general topic of Python performance, just importing needed modules can take a few seconds. The output of the code
import time
start = time.time()
import os
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
print("time elapsed (s):", "%0.3f"%(time.time() - start))
It can take an arbitrary amount of time. Modules are just code executed top to bottom, and might contain anything beyond mere constants and functions declaration.
Not sure why this matters too too much, this is a one-time setup cost (of some significantly large libraries). Python doesn't import everything in your virtual environment at startup, which I think we can agree is the right choice.
I'd further _guess_ that only one or two of those libraries are the actual slowdowns.
I don't trust anyone who uses camelCase to write Python. Or, unnecessary while loops.
But, sometimes you can improve built in functions. I found that using a custom (but simple) string-to-int function in golang is a bit quicker than strconv.formatInt() for decimal numbers.
So, there's that.
Python isn't really supposed to be geared towards performance. I like the language, but only see articles like this as resume fodder.
> but only see articles like this as resume fodder.
Not super important, but i think it's worth calling out. Some of us program because it's fun. We dig into problems like this because it scratches an itch. We share in the hopes of interacting with like-minded individuals. Not everything is about resumes.
Personally I tend to discount the possibility of such motivations for a text once that text is interrupted in the middle to sell me something (which a Substack form counts as).
Wow, very unfair. It should not be surprising that people writing Python might actually care about performance, even if it isn't so important that they must write their software in assembly.
The code wasn't the authors, and what's more it's just an example to talk about the bytecode. Everyone knows `min(some_list)` is faster and more Pythonic.
This is a very interesting article and probably exposed a few people to a few interesting Python internals.
few months ago i had to optimize some old python code, .. and found that my longtime assumptions from py 1,2 and early 3.x are not anymore true. So put up some comparisons.. Check them here:
(comment out the import transit.* and the two checks after it as they are specific. Takes ~25 seconds to finish)
Results like below. Most make sense, after thinking deeper about it, but some are weird.
One thing stays axiomatic though: no way of doing something is faster than not doing it at all.
Lesson: measure before assuming anything "from-not-so-fresh-experience".
btw, unrelated, probably will be looking for work next month.
Have fun.
AFAIK this is because CPython has to walk the scope up to find the import for every call in your loop, and still applies to python3, right? You can still use the built in min, just create a "closer" reference before your loop for the same speedup:
Imports are super imperative operations that first tries to find it in sys.modules, otherwise executes the module and puts the resulting dict in sys.modules. Then it grabs the symbols you asked for and shoves all those symbols into the global dict for the module.
It does have to walk the locals and the globals scopes (there are ONLY exactly two scopes in Python!) to find the function, that's true.
15+ years ago many people would tell you that you should do “from foo import bar” instead of “import foo” to save the overhead inherent in referencing “bar” as “foo.bar”. Well, as almost everything in Python is in the end an dict, the dict implementation is ridiculously optimized (and in some counter-intuitive ways), so if you care about that overhead you probably should not be using (C)Python in the first place.
This “everything is a dict” is also a part of why removing GIL is not as straightforward as it would seem.
Actually even on a modern CPU that property lookup does add another ~10% penalty. Granted it's like a .1 second difference out of a second runtime for 10m iterations, but it's measurable! Although I wouldn't be using python on those data sizes usually :)
It makes a slight difference on my system, python 3.12.0
In [1]: def test_1():
...: i = 10_000_000
...: r = 0
...:
...: while i>0:
...: i = i-1
...: r += min(i, 500)
...:
...: return r
In [2]: def test_2():
...: i = 10_000_000
...: r = 0
...: local_min = min
...:
...: while i>0:
...: i = i-1
...: r += local_min(i, 500)
...:
...: return r
In [3]: %timeit -n 1 -r 10 test_1()
2.14 s ± 58.4 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
In [4]: %timeit -n 1 -r 10 test_2()
1.94 s ± 47.8 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
Interestingly, Lua 5.4 did the opposite. Its implementation introduced C-level function calls for performance reasons [1] (although this change was reverted in 5.4.2 [2]).
[0] https://bugs.python.org/issue45256
[1] https://github.com/lua/lua/commit/196c87c9cecfacf978f37de4ec...
[1] https://github.com/lua/lua/commit/5d8ce05b3f6fad79e37ed21c10...