Hm, this solution seems very cumbersome, inelegant and not like python's "batteries included" approach at all. This means that python will have native threads that behhave as expected minus true parallel execution, so you shouldn't use those, even though the interface is fairly simple. Instead, you should learn to use this weird contraption that is neither multiprocessing nor intuitive multithreading and comes with a cumbersome interface.
I get that the GIL is a very hard problem to solve, but this solution is so inelegant in my eyes that python would be better off without it. I'd feel better if this was a hidden implementation detail that coukd be improved transparently. Just my two cents.
>This means that python will have native threads that behhave as expected minus true parallel execution, so you shouldn't use those, even though the interface is fairly simple.
Python already has exactly that, and has had that for ages.
>Instead, you should learn to use this weird contraption that is neither multiprocessing nor intuitive multithreading and comes with a cumbersome interface.
It also comes with performance improvements over multiprocess, so there's that.
Besides the "cumbersome interface" is irrelevant, as it would be easy to wrap and forget about it, the same way nobody really uses urllib directly.
>And the point is that this should be fixed, instead of adding yet another clunky way of doing the same thing.
Well, unless we have a (1) capable (2) volunteer the point is moot.
Those who actually work on Python development accessed both scenarios and found adding the "yet another clunky way" if not optimal, a more feasible, easier and better use of their limited time.
>Then why bother having it, if the only reason for it to exist is so that it can be wrapped?
Because lower level primitives can be used in many useful ways than some constraining higher level construct, but the latter is still nice to have.
You could implement requests without urllib. You cannot implement C without assembly. That's the distinction between useful layers of abstraction and pointless ones. If a better interface to subinterpreters is bound to be developed, then that should've been the stdlib interface in the first place.
Coming from the JS world, asyncio feels fine to me. I've noticed a lot of Python programmers who have a strong background in C and Python struggle with asyncio. But in their case, I think it's more them disliking the async programming paradigm rather than the specific asyncio API.
personally I have JS experience (browser/node.js), as well as asio in C++/Boost, EventMachine in ruby and Twisted/Tornado in python. I strongly disagree that people who dislike asyncio only do so because they don't like the event loop. That just sounds like a typical quick dismissal of criticism. No, its a objectively awful convoluted API.
But if you think you can write correct asyncio applications of any significant complexity, that's great. Doesn't change the fact that a lot of highly experienced developers struggle to do so.
> But if you think you can write correct asyncio applications of any significant complexity, that's great.
To be fair, I've never written an asyncio application of any significant complexity (at least compared to the async JS application I've written). But nothing jumps out at me as horrific with asyncio. The worst thing is the split between futures and coroutines, but once you realize that there's no real need to use futures any more with asyncio (although you can integrate futures-based libraries if needed), you're good to go.
But fair criticism, I've only written small asyncio applications. The only sizable async-Python application I've made was made using Tornado and ZeroMQ async APIs(although even that eventually ran on the asyncio event loop and used an asyncio-based library).
I completely disagree - Python threads are basically "green threads", so they have their place but aren't related to parallelisation. But true multiprocessing is ugly when you have hundreds of cores, which is where CPUs are going. There is no standard UI convention on most OSes to group those processes per app, in terms of signals or stats or whatever.
So besides the unproven possibility of removing the GIL, subinterpreters are the best way forward, better than threads or the multiprocessing package.
> they have their place but aren't related to parallelisation
You can parallelize all sorts of things with Python threads--just not some things you'd expect to be able to parallelize, due to the GIL. Waiting on or buffering I/O, calling out to compiled code, doing cryptographic operations--all of those can be parallelized, as (in many cases) they entail releasing the GIL.
> But true multiprocessing is ugly when you have hundreds of cores
Why?
> There is no standard UI convention on most OSes to group those processes per app, in terms of signals or stats or whatever.
I have no idea what this means. What do UIs have to do with process groups? Do you know how many processes your Chrome instance is running on the operating system? There are very solid conventions regarding process management, at least on Unix-ish systems: process groups and parent-child relationships are well established and well understood, as is their relationship with signals and signal handling.
That's true, behind the scenes they can be running on multiple CPUs, which is the definition of parallelism.
I think threads are a bad abstraction for doing parallelism personally, though. Programs designed to run in parallel should be deterministic, unless they need concurrency for some other reason. I think trying to shoehorn parallel programming into Python's threads isn't necessarily the best approach.
As far as I'm concerned, if I'm using threads and they happen to get scheduled on to multiple cores, then that's a nice optimization, but isn't necessary for what I use threads for.
If that’s what he’s saying, he’s wrong. Python threads are based on native threads. They are parallel executed, they just don’t do so very well because of the GIL.
It's not that Python shouldn't have subinterpreters, but that they should only be added when they have a better API - at least as good multiprocessing.
It's somewhat similar to the GIL removal effort in Ruby [1]
They are isolating the GIL into Guilds there, which are containers for language threads sharing the same GIL. They are providing two primitives for communication between threads in different guilds. Send, for immutable data (zero copy) and move, for mutable data (copy). They remove the need for the boiler plate code for marshalling and unmarshalling. However I bet that there will be some library to hide that code in Python too.
I proposed something similar for Python 9 years ago.[1] Guido didn't like it.
Objects would be either thread-local, shared and locked, or immutable.
Thread-local objects must be totally inaccessible from other threads, and not leakable across thread boundaries, for memory safety. (Python has "thread local" objects now, but it's just naming, and not airtight against leaks. You can assign a thread-local object to a global variable.) Shared and locked objects lock when you enter, unlock when you leave. Objects are thread-local by default, so single-thread programs work as before.
Minimize shared and locked, while using thread-local or immutable objects as much as possible. Locking is needed only for shared and locked objects.
This is almost conventional wisdom today, but 9 years ago, it was too radical.
Retrofitting concurrency is never pretty. But we have to. Individual CPUs are about the same speed per thread that they were a decade ago.
Move is hard without ownership. Rust can do moves because it knows that there's nothing else pointing to the thing being moved. Python doesn't know that.
Yeah, I’d much prefer a traditional sub interpreter or isolate design with shared nothing than the current „raise an exception if shared non frozen objects are touched“.
It’s much easier to map existing multi process code onto shared nothing sub interpreters.
> This, in turn, means that Python developers can utilize async code, multi-threaded code and never have to worry about acquiring locks on any variables or having processes crash from deadlocks.
Dangerous advice. Whether this is true or not depends on lots of things such as how many and which operations you're doing on those variables.
Sure, CPython might do lots of simple operations atomically, but this is not enough to avoid the need for all locks. Threads can still interleave their execution in many ways.
The current state of threading and parallel processing in Python is a joke. While they are still clinging to the GIL and single core performance, the rest of the world is moving to 32 core (consumer) CPUs.
Python's performance, in general, is a crappy[1] and is beaten even by PHP these days. All the people that suggest relying on multiprocessing probably haven't done anything that's CPU and Memory intensive because if you have a code that operates on a "world-state" each new process will have to copy that from a parent. If the state takes ~10GB each process will multiply that.
Others keep suggesting Cython. Well, guess what? If I am required to use another programming language to use threads, I might as well go with Go/Rust/Java instead and save the trouble of dabbling with two languages.
So where does that leave (pure-)Python? It can only be used in I/O bound applications where the performance of the VM itself doesn't matter. So it's basically only used by web/desktop applications that CRUD the databases.
It's really amazing that the machine learning community has managed to hack around that with C-based libraries like SciPy and NumPy. However, my suggestion would be to drop GIL and copy the whatever model has been working for Go/Java/C#. If you can't drop GIL because some esoteric features depend on that, then drop them as well.
Cython is nice, but debugging it requires gdb. For the PyCharm-loving end-users it may be quite cumbersome.
Those recommending to use multiprocessing have probably never been in that bitter spot when serializing something and computing something takes exactly the same time.
The consistent requirement has been that Python will drop the GIL for anything that doesn't make single-threaded performance suffer. There has been substantial work to this end but no solution to date has achieved this goal.
> If the state takes ~10GB each process will multiply that.
In POSIX there is such thing as copy-on-write memory during forks.. So if that state is mostly read-only, additional memory required by each slave should be minimal.
This is essentially the same concurrency model as Workers in JS engines - on the one hand it’s a fairly limiting crutch[1], on the other hand it is harder to create a bunch of different classes of concurrency bugs.
[1] vs fully shared state of C-like, .NET, JVM, etc, etc. Rust’s no-shared-mutable state model allows it to do some fun stuff but python (and JS) don’t really have a strong concept of mutable vs immutable, let alone ownership so I don’t think it would be applicable?
Very limited shared state (only c types, including structs), which in practice means (for non trivial apps) some form of marshaling from python classes to the C structs. Boilerplate abound!
In other words, nothing behind mmap with simple size calculations, which is what you'd really want to abstract away.
As @maayank has already said, python’s shared state is limited to pure c-structures, rather than actual python objects. Web workers have the same through SharedArrayBuffer (I am unsure though whether the security pains from spectre, etc have allowed them to be turned on again).
SharedArrayBuffer is fundamentally the same as python’s C-only sharing - a bunch of raw bytes not tied to the host environment’s type system.
This is just a way to do the same thing as "multiprocessing", but with less memory usage. You still have multiple Python instances that send messages back and forth.
I wonder if they ever fixed the CPickle bug which broke it if you were using CPickle from multiple threads.
Yeah, it's got some of the same weaknesses as multiprocessing (and several new ones). Conceivably you could provide an API for handing off objects to the other interpreter without copying. I'm imagining an API like:
my_foo = interpreterX.pass_object(my_foo)
(The assignment being required to delete the originating reference from the source interpreter.) The interface would be obligated to check that there are no references that escape to the current interpreter and then my_foo and all referenced objects could be handed off to the other interpreter in whole.
I don't have any intuitions for if that would be cheaper than copying or not, and getting it right is certainly more difficult than serialization. (Because of the complexity, it's not worth having if it isn't cheaper.)
If they live in the same process (address space), then it's a lot harder to correctly manage concurrent memory access (at both low level and high).
It's not a theoretical problem, just a very likely pragmatic observation. (Meaning, we'll see bugs in both the interpreter and the code using this feature.)
Plus if one of the threads crashes, the whole process aborts. (Sure, the interpreter can handle a lot of faults gracefully, but not all.)
The main weakness is the one pas described in the sibling comment. Historically, Python C modules, inside and outside stdlib, have more or less safely been able to assume there was one global interpreter. This means they may have global state (apparently even some stdlib C libraries assume this). With multiple interpreters in the same process address space, any global state in C modules conflicts.
I'm not claiming any originality here — the author of the article recognizes this problem and describes it, starting with:
> Because CPython has been implemented with a single interpreter for so long, many parts of the code base use the “Runtime State” instead of the “Interpreter State”, so if PEP554 were to be merged in it’s current form there would still be many issues.
Less memory usage, and - hopefully - without all the quirks that crop up with multiprocessing. Off the top of my head: subprocesses don't always want to die along with the main process; error conditions can cause the underlying IPC layer to end up in a permanently stalled state.
It avoids some quirks but also introduces new quirks multiprocessing doesn't have, like broken C modules (including parts of the core interpreter and stdlib) that have global state, rather than per-interpreter state. There's a huge ecosystem of Python libraries in the world and most have been able to more or less ignore the distinction between per-interpreter state and global state prior to this proposal. (Not true if you actually used the C API to embed many interpreters in a process, but most people don't do that.)
I don't think you can do this with traditional dynamic linkers — the separate memory space is not so difficult (requires relocatable libraries, -fPIC, which is usually already enabled on ASLR systems) — but you would want to be able to load the same .so twice without symbol naming conflicts. I don't think most dynamic linkers (ld-linux / rtld-elf) support that. I could be mistaken, I am not very familiar with any implementation.
There is nothing preventing you from adding this support to an existing dynamic linker and using it for your program, though.
> but you would want to be able to load the same .so twice without symbol naming conflicts
This is supported in ld-linux by using the dlmopen() glibc function and distinct namespaces. Loading the same .so file multiple times is one of the use cases explicitly mentioned in the manual page[1].
That requires making the native parts of the stdlib position independent (relocatable). I don't know what obstacles are there currently, but basically you'd need to have a runtime linker that gives a reference to the loaded object/module to access the symbols, but these are usually handled by the C compiler and linker. So it's doable, just a lot of effort.
Theoretically, if this proposal lands, all of the Python stdlib will have to be fixed not to have global state — the main concern would be 3rd party C libraries, I think.
No, Mr. Click-baity-title it’s not. They’re still there just you can use many interpreters now like one would when using the multiprocessing module. I do like the idea of Go-like queues for message passing.
From my limited understanding, I think Eric Snow’s push to use subinterpreters is to move an orchestration layer for multiple Python processes from the service layer to the language layer. It may also modularize Pythons’s C API scope. It may also be one of the cheapest ways in order to provide for true CPU bound concurrency in Python, which is important given Python’s limited resources.
Wow, just like perl threads since perl 5.8 (1)
When in doubt, look at the granddaddy of scripting languages, all your trials and tribulations in scripting land have been considered in the past..
let's all sing 'living in the past' by Jethro Tull (2) this one is also good (3)
Tcl has had threads that were subinterpreters since a decade ago or more. I find it quite ironic that Python, it would seem, is reinventing it, only in a less elegant way.
I'm personally glad that Python is (poorly) copying this feature from Tcl. This means it's closer to the time when JavaScript (poorly) copies it from Python ! ;-)
This sounds like an application (or variation) of the apartment threading model[0]. Given the problem and it’s desrciption/characteristic (Global Interpretter Lock), this sounds like an elegant approach.
There's nothing wrong with the GIL as long as you know its there. It makes writing concurrent code in Python semi-magical and thats a huge benefit. Concurrent != parallel though, so if there's really a need to scale up to multiple cores there's always the option of forking with multi-processing or "sub interpreters."
I can think of maybe having network code run in their own process and the UI in another. That way there's no risk of bottle necks slowing down the UI and transfers are likewise protected. If you look at bottle.py it seems that this approach could add A LOT of performance for managing downloads / uploads if it's done right.
It means you never have to worry about message passing or locks, because they aren't a thing at all. On the other hand, what looks like concurrent code often isn't, and is actually run slower than single process code, also because of the GIL.
That reduces to "you don't have to worry about performance because the GIL doesn't let you write performant code". That's not what I think of as "helpful".
> Another issue is that file handles belong to the process, so if you have a file open for writing in one interpreter, the sub interpreter won’t be able to access the file (without further changes to CPython).
Wouldn't just using CLONE_FILES when forking off interpreters solve this problem?
It could be better phrased: "whilst CPython can be multi-threaded, only 1 thread can be executing Python code at any given time." Other threads can be doing other things at the same time -- just not actively interpreting Python bytecode.
It is because only one thread at a time holds the lock in order to avoid race conditions. The keynote[1] by Raymond Hettinger from PyBay '17 will be a great place to start if you are new to this.
Not all operations are CPU bound. For anything that is IO bound, such as reading a file, db access, network calls, etc, CPython threads work just fine.
Until someone adds a callback from C back into Python and the unexpected GIL contention makes performance dramatically worse even as compared to the single-threaded version.
Parallelism allows multiple threads interleave with each other. It does not guarantee parallelism (two or more threads executing at the same time). It's similar to multiple threads operating on a uniprocessor system, with the difference that I/O can happen in parallel.
I'm not the person you were originally responding to, but I know Python is popular in data science and such where there would be a lot more tied up in pure computation/number crunching.
I know that you already got an answer to this effect, but to provide some more information, most data science workflows in Python rely heavily on calls into numerical libraries (numpy, scipy, pandas, tensorflow, pytorch, matplotlib) that are Python wrappers over compiled binaries (mostly C and Fortran, a not-inconsiderable amount of handwritten Assembly), that have been constructed so that the wrapper safely yields the GIL before invoking the underlying binary. This is all the more important when considering libraries like tensorflow or pytorch that may involve complex long-running interaction with training resources across a network. Control is yielded to allow the interpreter to continue carrying out tasks like displaying the ongoing training progress, or loading training data.
I should have clarified: when I meant "doesn't work particularly well", I meant that its incredibly difficult to saturate say a 100 thread box. (Which with Rome isn't crazy!)
Only being able to run a single thread of "logic" basically makes that impossible, as you need to usually do some computation to figure out what bytes to send/receive, do something with the results, and so on.
The program doesn't have to be massively I/O bound for threads to be useful. Even with only a UI thread and a single background thread they're useful for animating a progress indicator in the UI thread while waiting for a slow network request to finish in the background thread, which works just fine in Python.
I know nothing about Python internals, but my understanding from the article is that this "overhead" is about creating a new sub-interpreter (loading modules is particularly slow in Python), not the performance of executing code after it's created.
The article also makes it clear that each sub-interpreter still has its own GIL, but two sub-intepreters can run at the same time without having to care about each other's GILs.
Message passing via serialization / copy is inherently more expensive just passing a reference between threads. So any benefit vs threaded Python depends on the ratio of IPC to actual Python bytecodes interpreted.
IPC-heavy programs with low concurrency may suffer worse under this model than threaded with traditional single GIL. As threads approach infinity, though, anything that scales beats the single GIL model.
Other overheads include spinning up a full interpreter state (including object and malloc caches, GC, etc) per sub-interpreter. And there are some modules with process-global semantics, such as signal-handling — it's unclear how that will be coordinated between co-interpreters, if at all.
Exactly. This seems more like a way to advertise that the GIL is gone while not actually having shared memory parallelism via threads.
It reminds me of how python advertising tricks people into thinking python has real parallelism by talking about ayncio and import multiprocessing; people then actually try those out only to discover the sad state of affairs.
Are there any overall benchmarks for Python 3.8 yet? I know there are a bunch of performance improvements for calling functions and creating objects, but I have no idea how that translates to real software.
Huh. This sounds a lot like Ruby Guilds. This looks it will land sooner, though likely in less complete form, as even the prototype Guild implementation has inter-guild communication.
Wouldn't it be good to have Python 4.x next with all these workarounds cleaned up and with only one right pythonic way for parallel procesing? Surelly with a bit of backward compatibility sacrificed like 2 vs 3.
> Surelly with a bit of backward compatibility sacrificed like 2 vs 3.
After 10 years, the 2to3 transition is still ongoing, and lots of companies will be dealing with python2 code for a long, long time. Breaking backward compatibility was a terrible decision. Breaking it again now that most libraries and open source projects have finally moved to python3 would be suicide.
Programming languages should take backward compatibility as one of their most important features, because of the hundreds of millions of lines of code already written and that would need to be changed and validated. This is especially true in a dynamic language where you can't even rely on the compiler to point out the issues.
Getting rid of the GIL would make the 2 to 3 transition look like kindergarten. There's lots and lots of Python code (and C extensions) that makes assumptions enabled by the GIL.
For starters:
If you get rid of CPython's atomic operations on lists and other simple objects, lots of Python code would need more locks than it needs now.
If, on the other hand, you wish to keep those operations atomic, without the GIL that means introducing fine-grained locks on specific objects, which would worsen even single-threaded performance. Imagine taking a pthread mutex each time you append an item to every list or store a number in every variable.
Considering this will require basically a total rewrite of the C extension API, without having had a chance to look yet, the idea it could be done in a minor release seems shocking
It will probably come to something like that (for the Unicode issue alone) but it would be a disaster. Python has always had an issue with ongoing changes that break existing code. That was the idea behind Py3, that we would do all the breakage at once and be done with it forever. Ironically that made Py2 a much better target for a time as it was not changing very much.
Whoever is downvoting this, please realize this design stinks strongly of something Perl5 had a long time ago, that almost nothing could make use of due to a variety of limitations.
The question in my mind is how the Python design addresses the problems that made the perl equivalent so unusable. It's only had ~20 years of context to do better, so I'm hopeful
"The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that make them easy to misuse. Few people know how to use them correctly or will be able to provide help.
The use of interpreter-based threads in perl is officially discouraged."
I get that the GIL is a very hard problem to solve, but this solution is so inelegant in my eyes that python would be better off without it. I'd feel better if this was a hidden implementation detail that coukd be improved transparently. Just my two cents.