Hi! I'm the author of this article. Thanks for posting it.
The GIL is an old topic, but I was surprised to learn recently that it's much more nuanced than I thought. So, here is my little research.
This article is a part of my series that dives deep into various aspects of the Python language and the CPython interpreter. The topics include: the VM; the compiler; the implementation of built-in types; the import system; async/await. Check out the series if you liked this post: https://tenthousandmeters.com/tag/python-behind-the-scenes
In the opening paragraph you state that the GIL prevents speeding up CPU-intensive code by distributing the work among multiple threads.
My understanding is that distributing work across multiple threads would not speed up CPU-intensive code anyways. In fact it would add overhead due to threading.
There are two things you should consider here, wall clock time, and cpu time. Making code faster using multiple threads will increase CPU time by some amount, but because that work is now distributed between several cores it should actually reduce wall clock time.
There are many CPU bound tasks which can be made multithreaded and faster, but it does depend on the task and how much extra coordination you’re adding to make it multithreaded.
The author is speaking about the general concept of threading, outside of Python (without using C extensions to help out as discussed in the article). In general, if you don't have a GIL, and you have 2 or more cores then if you run additional threads you will see a speedup for CPU-intensive code. The actual speedup will vary. A sibling comment mentions embarrassingly parallel problems, those are things like ray tracing, where each computation is independent of all the others. In those cases, you get near linear speedup with each additional core and thread. If there is more coordination between the threads (mutexes and semaphores, for instance, controlling access to a shared datum or resource) then you will get a less-than-linear speedup. And if there is too much contention for a common resource, you won't get any speedup and will see some slowdown due to the overhead introduced by threads.
If it was being distributed amongst python threads (which run on one hardware thread), then CPU performance can't improve since they're just taking turns using the CPU. If you're running on multiple hardware threads (what I assume the author meant), that can causes better CPU performance since it will distribute work across real threads that can run in parallel.
The GIL restricts use of multiple hardware threads.
You can very well parallelise CPU-intensive problems. Look at e.g. "Embarassingly parallel" on Wikipedia.
Intuitively if you can divide your work into chunks that are large enough, the scheduling overhead becomes negligible.
what are you talking about? worker thread pools are the most common way to take advantage multiple cores. Typically can see speedups (for highly parallel codes) of nX for n cores.
Sure, but I think if you are discussing this type of thing in the context of python you have to use the threads/processes terminology to avoid confusion.
The reason why this is true in Python is the GIL. In other languages without a GIL, multiple threads will run on multiple cores and can speed up CPU bound code.
I don't understand why people care about the CPython GIL. For computationally intense stuff (numerics) running in the interpreter, the language is generally 60 times slower than PyPy and often 100 times slower than an equivalent C program. That means if you want performance, you would need to dodge Amdahl's law and have 60-100 processors for a hypothetical GIL-less CPython to match a single threaded program in PyPy or C.
If all it was used for was scripting it would be fine, but what is strange is that Python has emerged as a major language for complex data analysis. So I am sometimes confronted with partial solutions in Python that solve a piece of the puzzle that is purely numerical (so all runs on numpy etc) but then extending it to actually solve the real problem ends up either extremely awkward (the whole architecture getting bent around Python's limitations) or being a complete rewrite. It's frustrating that the ecosystem for certain problems is so bent towards a language that is unsuitable for large parts of the job. (mainly, I would add, because it sucks the oxygen away from alternatives that would be better in the longer term).
If you're doing complex data analysis in python, the actual number crunching isn't happening in python it's happening in a C extension. Using numpy or tensor flow, you're really just calling C code.
I think the OP's point was that it forces you to do all the number crunching via those extensions, and if what you want to do doesn't match up with their APIs it's a bit of a pain. Granted, they have fairly complete and flexible APIs but I can still imagine it being very annoying sometimes.
> I think the OP's point was that it forces you to do all the number crunching via those extensions [...]
Almost. If you're using C extensions for performance, those extensions can release the GIL, and then you get parallelism in CPython. Although if you've got a bunch of small steps, Amdahl's law is going to bite you as each step needs to re-acquire the the GIL to proceed.
Note that for Numpy, I don't know which functions/methods actually bother to release the GIL, but it's possible in theory.
Yes, that's exactly what's frustrating about Python.
It's not the performance per se, that doesn't really matter for Python's use cases, it's that the performance charateristics of your program are so erratic it doesn't feel like the pieces are fitting together. Denotationally equivalent operations having drastically diverging operational interpretations means that changing anything is a nightmare.
If you have an API call that does what you want, Python is basically C++. If you don't, you're paying a 100x penalty.
Most of the times performance doesn't matter, and when it doesn't, life is great. When it does matter, the situation feels worse than it actually is because optimizing Python is such a fight.
In general optimising Python is:
* understanding that pure Python is not a language to write CPU-heavy computational code in it and was never intended as such
* understanding that Python lets you write 3-4 solutions to a problem easy and 1 will be slow, two will be OK and one will be optimal - often that involves using generators, collections, sets and is a bit nuaced
* I'd say that optimising Python, compared to something like C or C++ is far easier - I can set up a decent testing and profiling environment for most of my problems easy with pip and ipython which is borderline simple compared to valgrind
I'm not talking about technical difficulties, the Python dev experience is top notch and I have no complaints in that area. What I mean is that it's hard to form a mental model of what's going on because it's as though you were using two different languages: the "fast" API calls and the "slow" munging of the results.
The frustrating part is that the language doesn't seem to "compose" any longer, it doesn't feel uniform: you can't really do things the simple way without paying a hefty peformance penalty.
With that said, the #1 reason why any code in any language is slow is "you're using the wrong algorithm", either explicitly or implicitly (wrong data structure, wrong SQL query, wrong caching policy, etc.), and in that sense Python definitely helps you because it makes it easier (or at least less tedious) to use the correct algorithm.
I think that complaint applies to the (vast, essential) ecosystem, not to the language and standard library. AFAIK, there aren't serious performance potholes to hit in the language and standard library any more. You're not paying a huge performance penalty for writing things the simple way, but things are uniformly pretty slow.
It's when you pull in fast libraries that composition breaks down e.g. the many-fold difference between numpy's sum method and summing a numpy array with a python for-loop and an accumulator.
I agree, the standard library doesn't have those problems. As you said though, the ecosystem is kind of the point of using Python in the first place.
Again, I'm not really a fan of Python as a general purpose language for large projects, but it really, really excels at being a kind of "universal glue", and it's why I use it again and again.
In that role, "uniformly slow" is better than "performance rollercoaster".
spaCy might be a good example. It's a natural language processing library that's written in Cython, so yeah, a lot of the hard computation work is essentially just calling into C code.
But it's usually being driven from a Python script that's in charge of loading the data and feeding it into spaCy's processing pipeline. That code is subject to the GIL, and, since it cannot be effectively multithreaded, it tends to become the program's bottleneck.
So what you typically end up doing instead is using multiprocessing to parallelize. But that comes with its own costs. You end up wasting a lot of computrons on inter-process communication.
Modern C has the restrict keyword for that. There isn't any competitive advantage left for Fortran over C or C++, the only reason why Fortran is still part of the modern numeric stack is that BLAS, LAPACK, QUADPACK and friends run the freaking world and nobody is ever going to rewrite them to C without a compelling reason to do so.
Fortran's primary competitive advantage over C and C++ is that it's a conceptually simpler language that's easier for humans to use safely and effectively for numerical computing tasks.
It's not just being tied to BLAS and friends. Fortran lives on in academia, for example, because academics don't necessarily want to put a lot of effort into chasing wild pointers or grokking the standard template library.
For my part, I'm watching LFortran with interest because, when I run up against limitations on what I can do with numpy, I'd much rather turn to Fortran than C, C++, or Rust if at all possible, because Fortran would let me get the job done in less time, and the code would likely be quite a bit more readable.
I didn't downvote you, if that's what you were hinting at, however your phrasing was more likely to be interpreted as "C can't do that" as opposed to "C doesn't default to that".
Note that while BLAS and friends aren't getting rewritten in C, there is an effort underway to write replacements in Julia. The basic reason is that metaprogramming and better optimization frameworks are making it possible to write these at a higher level where you basically specify a cost model, and generate an optimal method based on that. The big advantage is that this works for more than just standard matrix multiplies. The same framework can give you complex matrices, Integer and boolean matrices, matrices with different algebras (eg max-plus).
The initial results are that libraries like LoopVectorization can already generate optimal micro-kernels, and is competitive with MKL (for square matrix-matrix multiplication) up to around size 512. With help on macro-kernel side from Octavian, Julia is able to outperform MKL for sizes up to to 1000 or so (and is about 20% slower for bigger sizes). https://github.com/JuliaLinearAlgebra/Octavian.jl.
Also, when Travis Oliphant was writing numpy he liked the Fortran implementations of a lot of mathematical functions, so it was an "easy" borrow from there instead of reimplementing.
You cant both claim that it is major language for complex data analysis and that it is "unsuitable for large parts of the job". If it is a major language for complex data analysis it is apparently not unsuitable. It can be not optimal/not the best/bad but it cannot be unsuitable, because then your first statement isn't true. Which one is it?
It's a great language for exploratory prototyping of complex data analysis and a terrible language for robust implementation of complex data analysis. Well-managed organisations prototype in Python but take the time to rewrite the prototype for production use; poorly-managed ones skip step 2.
Does this seem really inefficient to anyone else? Why is there not a language that can do the prototype and offer a high performance ceiling? Julia seems like a candidate.
There's a language design tradeoff - Python is is the result of the designers almost never saying "We should do this to make Python faster, although it would make the language slightly harder to program in."
This even goes as far as choosing not to make improvements that improve performance but make the implementation more complex (for example adding a JIT), which allows Python to continue to add language features.
The language isn't the hard part. One of the more programming-oriented data scientists was actually happy to write Scala to start with, but honestly getting from a prototype to a production-quality implementation is pretty much a rewrite anyway (e.g. you have to consider all the error cases).
It doesn’t work like that in any other industry, why should it work for software? Mechanical engineers don’t put prototypes into production, why should software engineers?
>I recently tried to migrate my R code to Julia. Even though I already knew R data.table is faster than DataFrames.jl, I was totally blown away by how slow Julia is. So I quickly gave up. I think I will have to write unavoidable hard loop in cpp, which I really don't want to do...
They're saying that Python has gained a reputation for being the language for complex data analysis, but in fact it is objectively "unsuitable to large parts of the job".
> it sucks the oxygen away from alternatives that would be better in the longer term
It's always frustrated me that the computing world isn't as it should be. Python as a language seems to have found a sweet spot for accessibility to a lot of people, but CPython the interpreter steals all the air from PyPy, which is a vastly better implementation.
The PyPy FAQ addresses this. To my eyes, it looks like this is partly to provide the same promises CPython does for C extensions, and partly because their transactional memory approach hasn't gotten the attention it would need to succeed. I think both of those show CPython stealing the air from PyPy.
Though it's not only about number crunching. I tried to implement a streaming protocol for certain ip cameras: You may already see the effects of GIL + the general slowness when reading many small packets from a network socket in one thread and parse/process the packets in another. While Python is my favorite tool for many years and I was seemingly aware of it's downsides, I really didn't expect how bad it performed in this case.
> [...] when reading many small packets from a network socket [...]
Yeah, I should've said it more carefully. That's what I was hinting at with the mention of Amdahl's law. The separate bits of I/O are done in C extensions which can release the GIL, and those can be done in parallel (with each other and your computation), but if each bit of work is small, the synchronization cost (reacquiring the GIL for each packet) becomes relatively large.
I do care about the CPython GIL. I develop a program with a lot of analysis and output to custom formats. The things that can go to the underlying C extension go there, but the development cost is far higher for migrating stuff there. I'd like to be able to parallelize the different sections, but multiprocessing is a mess (only pickable data can be used, no shared state), threading doesn't give me any benefit due to the GIL, and PyPy hasn't worked the last three times I tried. Also, improvements can stack. If I get threading working, I can still move more things to the C extension to improve performance, or migrate finally to PyPy (if they manage to remove the GIL).
Could you share a little more about your experiences with mp? I'm in the other camp - I use multiprocess in several python applications and am more or less happy with it. My real curiosity I suppose is what data you are trying to pass around that isn't a good candidate for mp.
My own use cases normally resemble something like multiple worker-style processes consuming their workloads from Queue objects, and emitting their results to another Queue object. Usually my initial process is responsible for consuming and aggregating those results, but in some cases the initial process does nothing but coordinate the worker and aggregator processes.
My main issue with multiprocessing is that the pickle module is incredibly unhelpful when something can't be picked. Given that there's quite a lot of context shared between types of analysis, and that context tends to change a lot, using pickle means I'm looking at quite some time debugging every now and then to see what's failing.
Regarding the data itself in this case, there are some objects that are references to in-memory structures of a C extension. Last time I checked I didn't see simple ways to share those structures via pickle. Shared memory was an option, but it required me to implement and change quite a lot of things just to get things working, not to mention that whenever the C extension gets new features I need to invest extra time in making those compatible with pickle. It wasn't worth it, specially knowing that I would still run into problems every now and then due to pickling the context.
> threading doesn't give me any benefit due to the GIL
If your heavy lifting is done in C extensions, and those extensions release the GIL before doing that work, you can still get the benefit of parallelism within CPython.
Part of the heavy lifting is in C extensions, but part is in Python code that's hard to migrate to C, and that's the part that takes more time now after migrating the low hanging fruit. I'd get a bit of benefit, but far from what I'd get if the GIL wasn't a thing.
Depending on the nature of the interpreted Python part of the code, you could get a 60-100 times speedup by using PyPy (no rewriting needed). Removing the GIL in CPython would only beat that if you could get 60-100 times parallelism.
I tried using PyPy several times, but I always run into problems due to the C extension and other libraries I use. Also, last time I checked it wasn't very up-to-date with respect to the mainstream Python version.
In the article, Victor brings up a good point that the GIL can lead to some nasty scheduling with IO. If you have some thread doing computationally expensive stuff and other threads responding to HTTP requests, the computationally intensive thread ends up hogging the CPU.
When a thread has been waiting for IO, you often want to switch to it as soon as the IO is finished in order to keep the system responsive. So the biggest issue with the GIL seems to be more about scheduling and less about actually having true concurrent threads in Python.
In ML you often need to stream data from the training set because you can't hold it all in memory. Many times I noticed that my GPU would sit idle because python was busy loading more training examples. With proper multi threading you can have one or multiple threads loading data into a queue feeds the GPU thread. This is something extremely simple to do in any programming language that has doesn't have a GIL. And you can't just use a library for this, loading and preprocesding data is very specific to your application, and writing just this part in C completely defeats the purpose of choosing python in the first place.
For number crunching it doesn’t matter, but I imagine it’s a much bigger issue if you want to run some kind of application server/webapp on a big beefy machine with lots of cores. That’s a thing you would want to run in something like Python, and the lack of parallelism is a real issue.
If your application is I/O bound (web app), then you can get parallelism, even with the GIL. Each I/O call can (should) release the GIL before it waits on it's network packets.
128 thread processors are already available at the prosumer level (hyperthreading is presumably a good use for python since it does so much waiting with pointer chasing, but even if not there will be 128 core consumer chips before long).
The Tcl thread library allows you to create multiple threads in a single interpreter, and use mutexes. locks etc. to manage them. In this case of course you only need to initialize your interpreter once.
The library is most used in an "easy" mode however, where each thread starts up with its own interpreter and thus the usual mutex/lock dance is unnecessary. Various forms of message passing can be done among these interpreters, including tools provided by the thread library itself, Tcl's built-in synthetic channel feature which makes communication look like standard file I/O, or, since each interpreter has its own event loop, Tcl's socket I/O features can be used to set up server/client comms.
Functions (procs) and other state are not shared between interpreters. So they do need to be reinitialised in each thread. This is not hard to do though. The messages which can be passed between threads can be arbitrary scripts to be executed in the target interpreter, such as proc definitions.
For anyone looking for something on the same topic but a little bit lighter / more practical, I really enjoyed this talk[1] by Raymond Hettinger at PyBay 2017, where he discussed the GIL as well.
I wouldn't say that ruby has "got past its GIL problems with Ractor" -- you can run very little actually pre-existing libraries with ractors, because it requires no references to global state, and most non-trivial existing code ends up referencing global state (for configuration, caching, etc).
It requires a different approach to writing code than has historically been done in ruby, to scrupulously identify all global state and make it ractor-safe. The ruby ractor implementation doesn't allow existing code to "just work".
Some of these patterns and mitigations and best approaches for cost/benefit are still being worked out.
At this point ractor is early in the experimental stage. It's not like existing applications can currently just "switch to ractor", and problem solved. There is little if any real-world production code running on ractors at present.
But yes, this article's description of the work on "subinterprters" does sound very much like ruby ractors:
> The idea is to have multiple interpreters within the same process. Threads within one interpreter still share the GIL, but multiple interpreters can run parallel. No GIL is needed to synchronize interpreters because they have no common global state and do not share Python objects. All global state is made per-interpreter, and interpreters communicate via message passing only
Yep, that's how ruby ractors work... except that I'm not sure I'd say "all global state is made per-interpreter", ractors still have global state... they just error if you try to access it in (traditional) unsafe ways! I am curious if python's implementation has differnet semantics.
In general, I'd be really excited to see someone write up a comparison of python and ruby approaches here. They are very similar languages in the end, and have a lot to learn from each other, it would be interesting to see how differences in implementation/semantic choices here lead to different practical results. But it seems like few people are "expert" in both languages, or otherwise have the interest, to notice and write the comparison!
Ok, not "solved" in a "everything just works and nobody notices" way, but they have a way forward. The "shared-nothing Actors" model is well demonstrated to work well in Erlang, Elixir etc., so it seems like a reasonable way forward for a language like Ruby or Python.
It's way easier when you bake it into the design of the language from the start, and make it part of the conventions of the ecosystem.
But yeah, "seems like a reasonable way forward" is a far cry from "has got past it"!
I agree it seems like a reasonable way forward, but whether it will actually lead to widepread improvements in actually-existing ruby-as-practiced, and when/how much effort it will take to get there, is I think still uncertain. It may seem reasonable way forward, but it's ultimate practical success is far from certain.
(If this were 20 years ago, and we were designing the ruby language before it got widespread use -- it would be a lot easier, and even more reasonable! matz has said threads are one of the things he regrets most in ruby)
Same for python. It does seem Python is exploring a very similar path, as mentioned in the OP, so apparently some pythonists agree it seems like a reasonable way forward! I will be very interested to compare/contrast the ruby and python approaches, see what benefits/challenges small differences in implementation/semantics might create, or differences in the existing languages/community practices, especially with regard to making actual adoption in the already-existing ruby/python ecosystems/communities more likely/feasible.
As far as I understand, Ruby's Ractors are what Python is now trying to do with subinterpreters (https://lwn.net/Articles/820424/). I mention them in the post.
I just skimmed through your link. Subinterpreters have been in the C api for a long time and never got around the GIL, and reading over the post that you linked — nothing has really changed.
> Ruby seems to have got past its GIL problems with Ractor
Ractor doesn’t eliminate the GVL, it just introduces a scope wider than that covered by the GVL and smaller than a process (but bigger than a thread.)
It enables a model between simple multithreading and multiple OS-level processes. It is heavier parallelism than a thread-safe core language would allow, but should have less compatibility impact on legacy code and performance impact on single-threaded code (plus, its a lot easier to get to starting with an implementation with a GIL/GVL.)
Honestly, not sure why people care so much about the GIL in languages like Ruby, Python or JS.
Let's face it, nowadays when you deploy software it's over multiple machines, nodes, whatever you want to call it. So you're almost always talking about spawning processes. Which any language can do, GIL or not. Removing the GIL to enable parallel processing on a single machine adds more complications than it's worth, especially when any Python workload is either going to be forking C processes or forking Python processes.
Not every language needs the same kind of control over OS threads of a C, C++ or Rust.
People don't care as much about removing the GIL in JS because JS never exposed threads. When the time came for spreading compute load to multiple cores JS decided to simply create multiple contexts first with just message passing and then later with shared memory. As a result you can still code JS in the typical thread unaware way but utilize multiple cores without having to go multi-process or multi-machine. For in thread stuff the async model (or historically callbacks) is good enough if you're not compute bound.
In Python threads are exposed but only one runs at a time and you have to go full multiprocessing to use the other cores. This is both confusing and annoying respectively hence the increased noise about the GIL.
We already have 64 core/128 thread prosumer machines and it is only going to be more in the future. Python is so slow that message passing between processes can often be slower than just handling it in the same process.
What if you have a big array of python objects that you want to sort? In other languages it is trivial to get a huge speedup with more cores.
Well many developers end up using different languages for this reason. E.g. Go and node.js are popular and both do asynchronous stuff effortlessly (even though they are single threaded typically). And if you use Jython (python on the jvm), you can actually do threads properly as the Gil is not a thing there: https://stackoverflow.com/questions/1120354/does-jython-have... Likewise, Iron Python also has no Gil apparently. Both are running quite a bit behind PythonC at this point unfortunately.
For properly multi threaded stuff, people use more higher level languages like Erlang, Scala, Java, Kotlin, Rust, C++, etc.
In general batch processing data is both IO and CPU intensive (e.g. parsing, regular expressions) and a common use-case for Python. You could say, data science is actually the primary use case driving its popularity.
Python is effectively single threaded for that stuff. Using multiple processes fixes the problem well enough though. Conveniently the module for using processes looks very similar to the one for using threads, so switching is easy once you figure out why shit doesn't scale at all (been there done that).
But using processes also adds some complexity. The good thing is that it makes doing why the Gil is needed (sharing data between threads/processes) a lot harder/impossible. The bad news is that it's hard. Typically people use queues and databases for this.
ETL systems like Airflow tend to use multiple processes and queues instead of threads as using python threads is kind of pointlessly futile (from a making things faster point of view). However, with Airflow a pattern that you see a lot is that python process workers are typically not used and instead it is configured to use one of several cloud based schedulers (e.g. kubernetes, ECS, etc) to schedule work. Basically, airflow is just used to orchestrates work where the actual work happens elsewhere and may or may not involve python code. A lot of the performance critical stuff is native anyway.
> ... Go and node.js are popular and both do asynchronous stuff effortlessly (even though they are single threaded typically)
I know that I'm being very pedantic here, but Go is not a "typically single threaded" language. Go has as fine control over it's threads (goroutines) as C does over it's threads. Go is very capable of true parallellism.
> Literally, there is an initiative to introduce multiple GILs to CPython. It's called subinterpreters. The idea is to have multiple interpreters within the same process. Threads within one interpreter still share the GIL, but multiple interpreters can run parallel. No GIL is needed to synchronize interpreters because they have no common global state and do not share Python objects.
This sounds very much like Ruby's recently introduced "ractor", no?
I've been bitten by the GIL multiple times since switching to a company writing mostly python. Not that I write that much multithreaded code, but the GIL influences everything so lots of patterns in other languages no longer work.
For instance, a local cache of some values is basically useless when python is used as backend. Since there will be multiple processes handling the requests (in the same node), they don't share memory. So basically everything has to be stashed at redis. Of course, with scale and multiple nodes lots of stuff should be cached in redis anyway. But not everything makes sense to reach out to an external service for. Like fetching a token I need to use multiple times, instead if having it stored locally in a variable. There are of course ways around this, using some locally shared file or mmap or whatever, but it's a hassle compared to just setting a variable.
Same with background tasks or spinning up extra threads to handle the workload of an incoming request. Can't easily be done, so basically every python project includes celery or something similar. Which works fine, but again, a different and forced way to solve something.
Asyncio solves many of the daily problems I would have used threads for, though. As I often used it other places to fire off some http requests to other services in parallel. But unfortunately async doesn't work properly with everything yet.
So everything has a solution, but you are forced to do it "the python way" and with some hurdles.
It's really a shame you are getting down voted. You are basically just saying, "Sometimes shared memory model makes a more efficient and simpler solution over enforced actor model. Like in this case." This shouldn't be controversial, engineering choices come with tradoffs and the GIL certainly pushes you for making a specific choice.
Yes, if you are going to be doing caching in python, you are almost certainly going to be caching it in a separate service and serializing across process boundaries. If you are going to be holding onto a connection pool as well (PG bouncer being the most typical). More generally, any sort of shared resource that would require or benefit strongly from shared memory model needs to be pushed into a sidecar service written in a language that actually supports efficient shared memory model. These are precisely the sorts of problems that the GIL causes. There are standard ways to work around it and most python developers can reach for these solutions, but this is not a contradiction to the fact that the GIL has limited your choice in the matter.
> You are basically just saying, "Sometimes shared memory model makes a more efficient and simpler solution over enforced actor model. Like in this case."
Is this actually showing that it's less efficient though? People rarely complain about Erlang doing the same thing. "You have to spin up a proper task queue rather than firing off an unmanaged background thread" does not sound like an unambiguous negative to me.
Well, there are shared ETS in Erlang which we actually use for caching lots of things (e.g., the formatted timestamp with 1 second lifetime -- it's faster to look it up than recalculate in every process, -- or, yes, auth tokens).
> It’s really a shame you are getting down voted. You are basically just saying, “Sometimes shared memory model makes a more efficient and simpler solution over enforced actor model. Like in this case.”
If he was just saying that, he’d probably be downvoted for being true but irrelevant and not meaningfully contributing.
But he’s actually saying that plus “therefore, Python is bad (implying that Python can’t use shared memory between processes).” Which would be a productive, relevant contribution except for the fact that it’s not true.
> More generally, any sort of shared resource that would require or benefit strongly from shared memory model needs to be pushed into a sidecar service written in a language that actually supports efficient shared memory model.
That’s I guess technically true, if you included the stdlib modules which directly support this and are themselves written in (I assume) C. But, “you might have to use the standard library” is…not much of an impediment.
I didn't say python was bad. And I didn't say that it's impossible to share data between processes (I even mentioned some options).
My point was just that the GIL, while not a dealbreaker or anything, actually _does_ affect how a program is written. So it's implicitly noticeable in my day-to-day.
I've been developing python for all manner of use-cases the last 15 years and have not once had the GIL impact me in any meaningful way. I have however been impacted way more by the amount of misconceptions about python that I've had to explain over the years.
That's the point, though: One doesn't see the impact directly. But after 15 years you just do it the python way by default, not considering other ways that stuff could be solved if it weren't for the limitations imposed by the platform.
The python ways work fine, so it's OK even with GIL, but there are other ways as well.
> For instance, a local cache of some values is basically useless when python is used as backend. Since there will be multiple processes handling the requests (in the same node), they don't share memory.
That's like claiming the ability to share JS ArrayBuffer's between workers supplants the need for the ability to share high level objects like a Map.
What I'm getting at is that JS had the exact same problem, and it offers the same ability for sharing blocks of raw memory. But the same pitfalls for any actual structured object.
Instead, we have to serialize/unserialize (creating a copy) from a separate worker, process, or disk source, it's performance is shite for objects of large size.
Message passing is not ideal for large pieces of data. Shared memory is hundreds of times faster.
I see what you mean. Does diskcache still have to do deserialization of the mapped data? Nested objects (large disparate heap pointer graphs) would suffer performance penalties.
I have to ask, what is then the point of using Python as a backend (not questioning for data crunching)?
Funny thing here is that PHP works exactly the same with its share nothing architecture. You always need to go outside of PHP handle state between requests.
When I started doing PHP years ago this was one of the first things I stumbled upon on, used to being able to easily share state in a backend. Juniors or non-programmers does usually not think about this at all and thus is not limited by it, whereas programmers with experience from another language, but without any deep PHP knowledge, will always trip up on this, finding it awkward and annoying (like I did).
And yes, the share nothing architecture will change how you write programs. Now many years later I'm used to this and don't see it as problem anymore (almost the opposite).
But if you also need to follow the share nothing architecture on Python, what benefits does it give vs PHP? Is it the community only? Because Python is generally slower than PHP.
I don't consider myself as a Python developer, but based on my knowledge I have hard time finding compelling arguments for picking Python as (web) backend. I rather pick PHP (I have bias), Golang or Java. Nodejs vs Python it a tough one, perhaps nodejs because I can write in TypeScript, but nodejs is not a technology I like.
> Edit: One thing that would be nice in PHP is decorators.
15 years I used to say "I wish PHP had feature X just like in language Y". But there's a moment where it's better not to wait and just to use language Y.
asyncio doesn't solve any "problems" of threads in Python. it just provides programming patterns that a lot of people like, which are well suited towards having hundreds or thousands of arbitrarily slow IO bound tasks at once. A threadpool in Python can do the same thing with mostly similar performance. asyncio certainly isn't affecting the GIL issue, asyncio is "GIL-ish" by definition since all CPU work occurs in a single thread.
meaning, if the GIL were removed from Python, the performance of threaded concurrency would soar, asyncio would not move an inch.
It "solves" the problem of me having to do five calls in parallel. They can with asyncio in theory be done at the same time, since python can move on and fire off the next calls while waiting for IO.
Until people discovered the security and program stability issues that come with using them.
It was a cool idea that turned out not so be so great after all.
There is a reason why all major language runtimes are moving into thread agnostic runtimes, while security critical software and plugins are back into processes with OS IPC model.
They do, and in all of them is an anti-pattern to use raw threads instead of the higher level frameworks like goroutines, java.util.concurrent or TPL/DataFlow.
Go in their typical wisdom, doesn't even support thread local storage.
Additionally, Java and .NET have thrown away their security managers, and guess what is the best practice to regain their features in Java and .NET applications?
Sorry, we are discussing shared memory multiprocessing. I'm certainly not advocating unrestricted use of pthread_create, but even more structured way of parallelization still rely on shared memory.
But of course, for security address space separation is pretty much a requirement.
> For instance, a local cache of some values is basically useless when python is used as backend. Since there will be multiple processes handling the requests (in the same node), they don’t share memory.
Obviously, you can choose to do it that way, but there is no reason that multiple processes can’t share memory in Python; there is, in fact, explicit support for this in the standard library.
This is perhaps the most difficult aspect about discussions involving the GIL, the level of nitpicking while ignoring the actual point when someone feels that python is being slighted.
> you can choose to do it that way, but there is no reason that multiple processes can’t share memory in Python; there is, in fact, explicit support for this in the standard library.
Yes, and you can also just use threads and directly go shared memory so you are not literally "forced", you are forced because the choice comes with a cost you don't want to pay. Depending on the OS for managing shared memory and using an API isn't much different from using a sidecar process and has its own cost over native shared memory model. In the end, most engineers will choose and have chosen redis or memcached over the alternatives. The GIL is still a huge factor in these decisions.
If you stare too long into the sun you get blind spots. The GIL is worked around so much people don't even notice it anymore. True, the Python ecosystem delivers a lot of value for a vast amount of programmers. With the lack of true multi threading due to the GIL, however there is a glass ceiling in a wide range of applications and quite a few people got bitten by it. Still, the ecosystem outweighs the performance limits in most cases that I know of.
Your phrasing makes it sound like the GIL is a net-negative, and thats just false.
There is a tiny subset of applications that would benefit from its removal, everyone else would have their performance reduced by ~40% (last attempt to remove it ended with a 40% performance penalty).
the project is generally open to PR to remove it as long as the performance doesn't suffer.
and i'll be honest: if you're writing code thats so performance critical that you're worrying about the GIL... why the hell do you use python in the first place? use at least a compiled language.
There's a big difference between "the GIL is net neutral or positive in terms of performance" and "removing the GIL from Python after 30 years of having it baked in is pretty hard to do cleanly." What you're observing here appears to be the latter.
If lacking a GIL actually incurred a 40% performance penalty in general and not as a consequence of design decisions in Python's internals, we'd expect to see the fastest languages all sporting GILs, and that is really not the case.
It’s unclear what you mean by design decisions of python’s internals.
Certainly python could have instead been a compiled language but then it’d be fundamentally different language. Are you implying that all of Pythons features could be preserved without a GIL and no performance loss if say a full rewrite was possible?
I don't think it's possible without breaking changes.
But if we're willing to accept some changes to the C API, it's at least possible to make the GIL less global.
It really should be obvious that it's possible to have multiple Python interpreters in the same process, where multiple independent threads run Python code in parallel, with occasional calls into a shared data structure implemented as a multi-threaded C extension module.
Currently the GIL makes this impossible: you either can have the multi-threaded C extension module in a single process (but then the Python threads will be limited by Amdahl's law unless you spend >99.9% of CPU time inside the extension module), or you can use multiprocessing which lets the Python code run in parallel, but than the C extension module will have to figure out how to use shared memory.
This is a massive, massive problem for us. We're already invested almost a year of developer time into changing C++ libraries to tell all those pesky std::strings to allocate their data in a shared memory segment. Even then, we've only managed this for some of our shared data structured so far, so this only allows us to parallelize around half of our Python stuff. So in effect, the GIL is causing massive extra complexity for us, and still prevents us from using more than ~2 CPU cores.
We would have been massively better off if 15 years ago we picked Javascript instead of Python. JS is also "single-threaded" like Python, but the JS interpreter lock is much less of a problem because it's not global.
Now Python currently has a "sub-interpreter" effort underway that refactors the interpreter to avoid all that global state and make it per-interpreter. But I somehow doubt that this approach to remove the GIL will be accepted, because again, it will slow down single-threaded execution: e.g. obtaining the `PyObject*` of `None` is currently simply taking the adress of the `_Py_None` global variable. Python will need to split this into one `None` object per interpreter, because (like any Python object) None has a reference count that is mutated all the time. But existing extension modules are accessing the `Py_None` macro -- they don't provide any context which interpreter's `None` they want! This can be solved with some TLS lookups ("what interpreter is the current thread working with"), but I suspect that will be enough of a slowdown that the whole approach will be rejected (since "no performance loss for single-threaded code" seems to be a requirement).
I think Python will get rid of the GIL eventually, but it will take a Python 4.0 with a bunch of breaking changes for C extensions to do so without significant performance loss.
Right, basically the GIL is a ceiling on what-you-can-do with python.
Python has now passed Java in TIOBE, which means it has a HUGE number of people using it. That represents a tremendous amount of collective investment and work in that ecosystem.
C/C++ obviously provides nigh-unbounded access to the hardware for doing "serious stuff" like databases, high performance IO/big data/etc.
Java has pretty-darn-good abilities to do "serious stuff" with its threading models and optimizing JVM.
But Python is basically capped by the GIL. So while the amount of code written in Python (and Javascript I'd argue) is massive... does any of it substantively push forward the state of the art? Will layers of the OS improve? Will new databases or cache layers or large scale data processing be improved/advanced?
We're basically at the end of Ghz scaling, it's all increasing wafer sizes and transistor counts ... lots of processors. LOTS of processors.
If the #2 software language in TIOBE can't fundamentally properly use the hardware that the next decade will be running on outside of a core or two... that's a problem.
All our devices have been sidestepping this at the consumer level with two or four cores for the last decade, while server counts have been expanding.
Four cores can ignored at the application level because other cores can do OS maintenance tasks, UI rendering tasks, etc etc etc.
But if your 2025 consumer device has 12-32 cores, and your software can only run on one of them... and it's the #2 programming language, and the reason is the GIL and it's baked so deeply and perniciously into your entire software ecosystem...
Then to your point, why the hell will people use python? And then it will tumble to irrelevance.
If your claim is true - that removing the GIL would harm the vast majority of programs - that speaks to some deep-seated quality and architecture issues in CPython. I happen to agree. I love Python, but the limits of its implementation sent me packing to Go years ago and I haven't looked back.
The problem is not python programs, it's existing C language extensions. Removing the GIL while retaining compatibility with the extensions and single threaded performance is difficult.
It speaks to some deep-seated tradeoffs in CPython. I don't know that reference-counting or using the C stack were the right choices, but Python's existence as approximately the only sane language where you can count on every library being available suggests that they're doing something right.
> and i'll be honest: if you're writing code thats so performance critical that you're worrying about the GIL... why the hell do you use python in the first place? use at least a compiled language.
Fair enough, but sometimes your freedom of choice may be limited. We have a large push for Python at our org, and for most people that's probably fine (or even really good), but I'm holding out as long as I can with Matlab and Julia.
On a tangent to this, its incredible to read so many being vocal on how Python/any other functioning programming system is "broken", "useless" etc due to on or another characteristic that apparently means a lot for the person saying it, but means nothing for me and/or for the hundreds of thousands using that language. Its weird that discussion about a programming languange need to resort to hyperboles like that.
Python is great. It works like it always has though. Also, dask can do some wonders for parallel processing even locally.
But I want even more parallelism, even computing say 20 plots with seaborn.. I'd like it to be parallel and use all my cores, not hang around at 150% cpu usage
> everyone is doing microservices and serveless, it is really not an issue
If a serverless function needs 1 second of the actual CPU work (i.e. real computation, not CRUD or feeding the GPU), wouldn't it be nice to cut user-visible latency to 250-300 milliseconds?
"Serverless" is about processes, while this article is about threads. Different levels.
Readers, do not listen to this false prophet! Heed not the lies of his silver tongue!
The GIL has brought me great strife and suffering. If you do anything remotely performance then Python is pain. Also, Python performance is so cataclysmically bad that when you use Python it has a nasty habit of becoming needlessly performance critical.
if you're doing performance-critical work in Python, you're using the wrong tool and it's been always known and reiterated by Python developers themselves. Python is a glue language at heart. Complaining that it isn't performant is like complaining you can't drill a hole with a knife. Get a different tool and glue your stuff together.
The issue is that a lot of things written in python very quickly becomes performance critical as it is so slow compared to anything around it.
edit: I have ranted more than usual about python slowness in the last few weeks. The fact is that python is an otherwise great language, powerful, expressive but with a gentle learning curve and a great ecosystem, but the slowness of its primary implementation really drags it down (and because of the ecosystem it is hard to switch to a different implementation).
> Python is a glue language at heart. Complaining that it isn't performant is like complaining you can't drill a hole with a knife. Get a different tool and glue your stuff together.
I wish I could get a different tool. Unfortunately machine learning and data science live and breath Python. It’s an absolute disaster and there is no meaningful alternative.
Jupyter Notebooks can also die in a fire. Go ahead and call Python glue and say to use a different tool if you want.
Unfortunately Python is a black hole or inertia. The one and only good thing about Python is some really great libraries are written in Python. This has resulted in an environment where Python sucks but the cost to move to a language that doesn’t suck is astronomical.
Maybe I am complaining I can’t drill a hole with a knife. Call me in 20 years when drills have been invented and knife lobby has been abolished.
I don't think I can agree that the GIL is mostly unnoticeable. I don't do all that much in Python and mostly use it as a scripting language for little personal stuff. I haven't had all that much need to dive into multithreading, but the few times I have, the GIL has always become vital to keep in mind.
If you are CPU-bound in pure Python code you’ve already lost.
However, most of the time you’re doing CPU-heavy tasks in Python you’re likely using libraries that are ultimately implemented in another language for performance. The GIL can be (and is) released in those cases. Not much of a problem there.
It’s also released when you’re waiting for I/O. Not much of a problem there either.
Fear of the GIL have never stopped me from using threading in Python, and it has never been much of a problem. In the rare occasions I’ve noticed, I was doing something wrong, not using the appropriate library, or both. Or it might be the case that Python was just a poor choice for the problem, no matter how flexible it is.
I also wonder what would happen if CPython got rid of the GIL. Python has certain characteristics that make it hard to have a parallel interpreter, and people have tried (and failed) to remove te GIL before.
This is for real, but this doesn't mean you can't use parallelism. Only that you can't use thread for them most of the time .
Parallelism strategies in Python usually revolve around using the multprocessing module, a c extension (numpy will release the gil, and hence all the science stack) or using a 3rd party tools (dask is pretty popular, and allow easy parallelism: https://dask.org/).
Even python's parallelism is subpar because objects passed across processes have to be pickled. This both takes time and limits the kinds of objects that can be sent across processes (no lambdas, for instance).
Aren't PHP and JS also single-threaded in the exact same way? Regardless, if you look at the history of Python it was supposed to be a teaching language and to be used for very simple, user-written scripts. Using it for actual production code makes about as much sense as writing production stuff in bash, you can do it but you're just making things harder on yourself for no good reason. Even Ruby is a bit better, but these days there's no reason not to use Go or Rust.
PHP can be built as either thread safe (TS (the term ZTS, Zend Thread Safe, is also used)) or non thread safe (NTS), depending how you deploy PHP (how the webserver interacts with PHP like FastCGI or SAPI, if it spawns processes or threads) you run one or the other, most common is that you use NTS builds on Linux and TS builds on Windows.
PHP does not have a GIL, thus PHP does not suffer from any slowdown when using threads, proper use of threading improves PHP performance without the need to escape to a C library. Internally PHP uses something called Thread-Safe Resource Manager (TSRM) when built as thread safe. TSRM architecture was improved with PHP 7 and to my understanding performance as well.
However PHP does not ship with any API to handle threads, this is because normally you, the web programmer, don't do any process/thread handling in a web setup, concurrency is handled by the webserver for each request. It does exist third party extensions that enables threading API, like in a cron job.
> Using it for actual production code makes about as much sense as writing production stuff in bash, you can do it but you're just making things harder on yourself for no good reason. Even Ruby is a bit better, but these days there's no reason not to use Go or Rust
Good luck doing any deep learning in Go or Rust :)
TFA makes it sound like no two Python programs can execute simultaneously on two different cores. I would be very, very surprised if that would hold true for PHP or JS.
I have the feeling what the TFA really means is something different. But I cannot really figure out what.
You can only have a single core running Python code per process.
You can also have a single Python process running multiple threads in parallel, as long as all but one of those threads are currently running C code (numpy, system calls, ...)
But if your program spends more than a tiny fraction of time in interpreted Python code, the GIL will slow you done if you try to use all cores.
You can have multiple processes running Python at the same time -- and indeed, many Python programs work around the GIL by forking sub-processes. This tends to introduce significant extra complexity into the program (and extra memory usage, as the easiest solution is typically to copy data to all processes).
If your inputs are immutable and you don't want to copy them that works too since if you're careful. Copy on write means that it won't be duplicated by default. Of course reference counting changes means you will be writing to the inputs by default but if you're just branching for a computation you can temporarily turn off garbage collection while you diverge. Disabling garbage collection for a short lived forked process can even make sense on its own terms.
> TFA makes it sound like no two Python programs can execute simultaneously on two different cores
It does not. It states that two python threads can’t execute bytecode simultanously, the implication is that that’s within a single process (which is the case) otherwise talking about threads makes no sense.
> I have the feeling what the TFA really means is something different. But I cannot really figure out what.
TFA means exactly what it says. Two threads can’t execute python bytecode in parallel, because there’s a big lock around the main interpreter loop (that is in essence what the GIL is). That is quite obviously within the confines of a single Python interpreter = runtime = process.
What the article means is that a single Python OS process can have multiple OS threads but only one of them can execute Python bytecode at any given time. So if you have a multithreaded Python application (on CPython), only one of those threads will run your code at a time while the others are waiting.
Naturally you can have multiple OS processes of Python that run in parallel. This is how Python web services usually work. Then, of course, you need to share memory explicitly via some mechanism like Redis or SysV shared memory, so it leads to different design considerations.
> The GIL allows only one OS thread to execute Python bytecode at any given time [within each OS process]
Different Python processes can certainly use different CPU cores - the sentence is definitely misleading.
(As others have mentioned, it also only applies to interpreting Python byte code itself, not OS calls or C extensions modules called from that bytecode.)
The problem is that performance is variable. For example, it's not just wrong but meaningless to say "6x slower” because there are cases where Python is either faster or slower than PHP or JavaScript and you need to know what specific things you're comparing it to provide a number.
For example, I've seen people write entire programs in C and, after taking several weeks, realize their code was nowhere near as fast as the Python version because the code was bottlenecked on dictionary operations and Python's stdlib version was far more tuned than the C implementation they'd written (repeat for image handling where they wrote their own decoder, etc.). This is not to say that a C programmer shouldn't be able to beat Python — only that performance is not a boolean trait and it's important to know where the actual work is happening.
The important thing to know is that the GIL can be and is commonly released by C extensions. If your programs tend to be I/O bound (e.g. many web applications) or are using the many extensions which release the GIL as soon as they start working, your performance characteristics can be completely different from someone else's program which is CPU-bound and has different thread interactions.
This is why you can find some people who think the GIL is a huge problem and others who've only rarely encountered it. These days, I would typically move really CPU-bound code into Rust but I would also note that in the web-ish space I've worked in the accuracy rate of someone blaming the GIL for a performance issue has been really low[1] compared to the number of times it was something specific to their algorithm.
1. I'm reminded of the time someone went on a rant about how Python was too slow to use for a multithreaded program working with with gzipped-JSON files — 3 months and an order of magnitude more lines of Go later they had less than a 10% win because Python's zlib extension drops the GIL when the C code starts running and was about the same speed as Go's gzip implementation, both running much faster than storage. The actual problem was that the Python code was originally opening and closing file handles on a network filesystem for every operation instead of caching them.
No, because in a webserver environment where you have 4 cores, you should be running 4 instances of your program, load-balanced behind nginx or caddy or w/e.
Also what's your source for the claim that Python is 6x slower than JS or PHP?
Ruby has the same issue. But a "web server environment" scales pretty well "horizontally", meaning in ruby the solution/workaround is generally to run 4 separate web worker processes if you have 4 cores, so they won't be limited by the shared GIL. I am a rubyist and don't know much about python, but I'd assume python web server environments do the same?
Is this as good as a high-performance concurrency platform (like, say, the JVM has become)? Will it lead to absolutely optimal utilization of all cores? No. But the result is not simply "24 times slower", and all cores will be in use.
Performance numberes like "6x slower" are almost always contextual when it comes to the real world, it's not like every single program written in python takes 6x longer/6x as much CPU as the nearest-equivalent program in PHP, that's not how it works, it will depend on what the program is doing and how it's deployed.
The devil is in the detail, especially the "execute python byte code" part, and how often that happens. Partiluarly, it does not happen when you are waiting for some io, or if you are inside some c library like numpy. But yeah, python is not fast, https://www.techempower.com/benchmarks/ have quite a lot of benchmarks for those interested.
Is there any reference to delve in the details of GIL, multithreading, multiprocessing, high performance computing on top of Python eco-system?. There are some interesting projects such nimba, Dask, etc, but the resources are a bit scattered.
Rather, don't use threading for the sake of parallelism in Python.
Threads in Python are mostly useful to avoid blocking:
- you need to make io calls and don't want to bring asyncio
- you need to process things in the background without blocking the main thread (e.g: zippping things while you use a GUI)
- you want to run a long process on a c-extension based data structure (they release the gil, so you get the perf boots), such as a numpy array or a pandas DF
Weren't a Python project called interpreters that had the aim of allowing different interpreters of Python, thus mitigating the the issues the GIL causes?
The GIL is an old topic, but I was surprised to learn recently that it's much more nuanced than I thought. So, here is my little research.
This article is a part of my series that dives deep into various aspects of the Python language and the CPython interpreter. The topics include: the VM; the compiler; the implementation of built-in types; the import system; async/await. Check out the series if you liked this post: https://tenthousandmeters.com/tag/python-behind-the-scenes
I welcome your feedback and questions, thanks!