How fast can we make interpreted Python?

cwzwarich · on June 26, 2013

There are a couple of things you want to do (some of which overlap with the article):

1) Use a register-based VM (with a sliding and growing register file) instead of a stack-based VM. In theory you can make a stack-based VM fast with lots of macroinstructions that fuse smaller operations together, but it isn't worth it.

2) Use inline caching for method calls, property accesses, and primitive operations that do type checks. In an interpreter you can modify the instruction stream even on platforms that disallow modification of executable code. I know this isn't the origin of the technique in bytecode interpreters, but here's a paper describing it in case it's not obvious:

http://www.lirmm.fr/~ducour/Doc-objets/ECOOP10/papers/6183/6...

3) Pick your value encoding carefully. You almost always want fast immediate integers. On 64-bit platforms it is quite common these days to repurpose some of the NaN range in IEEE doubles for type tags to enable storing doubles in immediate values.

4) Write your interpreter in assembly. Compilers generate terrible code for interpreters, even (especially?) with the use of computed goto / labels-as-values extensions. The register allocators of traditional compilers are designed to optimize loops by moving spill code outside of them and to reduce the impact of function calls. They will not be able to realistically allocate registers across different instruction bodies, and they won't be able to make the correct tradeoff about how much work to push into the slow path of instruction bodies.

5) Rearrange your instruction bodies based on execution / transition frequencies to improve instruction cache performance.

6) Pay close attention to the boundaries between your interpreter and the runtime libraries / the FFI. You don't want to take a bigger hit than you need to every time you call out to native code.

JoshTriplett · on June 26, 2013

> 3) Pick your value encoding carefully. You almost always want fast immediate integers. On 64-bit platforms it is quite common these days to repurpose some of the NaN range in IEEE doubles for type tags to enable storing doubles in immediate values.

That technique applies more to JavaScript, which uses doubles as the standard number type, than in Python, which has both integers and floats. Still a good idea to make sure that both native integers and native floats end up as unboxed native types in registers, though.

cwzwarich · on June 26, 2013

Thanks for the correction, I wasn't aware that Python uses floats. I would guess that most new languages starting today would use doubles instead of floats.

laurencerowe · on June 26, 2013

'''...almost all platforms map Python floats to IEEE-754 “double precision”.''' http://docs.python.org/2/tutorial/floatingpoint.html

MostAwesomeDude · on June 26, 2013

Python as a language requires the float type to have at least double precision. This is sadly not documented explicitly, but the big four Python implementations all respect this.

JoshTriplett · on June 26, 2013

I didn't mean that Python uses "float" instead of "double"; I meant that Python has both integer and floating-point types, not just floating-point types as in JavaScript.

iskander · on June 26, 2013

>Rearrange your instruction bodies based on execution / transition frequencies to improve instruction cache performance.

Do you mean...group all the frequent operations together so they overlap on cache lines? It's hard to tell how much this would help, have you tried it?

cwzwarich · on June 26, 2013

When I was working on WebKit we would rearrange instruction bodies to influence the generated code based on opcode statistics, but back then the interpreter was using computed goto, so there wasn't quite a direct connection between the placement of the input code and the generated code. It's unlikely that any two instruction implementations will overlap on cache lines, since they are all typically larger than a cache line, but more temporal coherency throughout code execution will improve performance, especially on CPUs with smaller caches.

You can do it automatically by gathering statistics on frequent instruction pairs. In practice greedy algorithms for code scheduling work fairly well, assuming you have meaningful statistics.

mentat · on June 26, 2013

Looks like he's contributed significantly to the Webkit javascript interpreter: https://www.webkit.org/blog/196/cameron-zwarich-is-a-webkit-...

krichman · on June 26, 2013

This is great, it's taken me months to learn all the things you just listed. Do you know of anywhere where this type of thing is discussed?

cwzwarich · on June 26, 2013

I mostly picked it up by working on interpreters and compilers, doing performance analysis on the generated code, trying to find all papers on the subject (some of them good, some of them really bad - you have to filter them yourself), and reading about what other implementations have been doing. It's also important to be able to run experiments quickly; you don't want to have to get every patch production-worthy before doing basic performance analysis.

krichman · on June 27, 2013

Aha! So I am on the right track! :)

iskander · on June 26, 2013

Language implementation sometimes gets discussed up on lambda-the-ultimate . Also, searching for anything and everything Mike Pall has written about VM design is probably worthwhile. Mozilla developers also have some pretty interesting blog posts about TraceMonkey, IonMonkey and all the other monkeys.

krichman · on June 27, 2013

I've actually seen most of those, I guess I'm more advanced than I thought :)

iskander · on June 27, 2013

You're probably ready to write your own JIT -- I'd recommend making one for Python ;-)

I think after the low-hanging fruit above, there's lot of weird interpreter/compiler folklore on old usenet posts, Forth VM designs, random papers (like the Register vs. Stack machine showdown).

There are also some enlightening books like "Lisp in Small Pieces".

I'm at SciPy right now and I was just talking to the Julia developers yesterday about the need for a textbook, website, or wiki to gather all this disparate info in one place.

Want to help us get it started?

krichman · on June 28, 2013

JIT? ...as soon as I get over my perfectionism long enough to finish the parser :)

I'd enjoy being on a project like that. Can I send you an email?

iskander · on June 28, 2013

Sure, it's alex dot rubinsteyn atop google's excellent email service.

jerf · on June 26, 2013

It seems to me over the past ten years I've heard this story so many times: "Python sucks. Let's do the obvious thing that makes it faster." Then, a month or two later, "I did the obvious thing and it's sometimes faster but often slower, net no gain or possible loss." to which the response is obviously "No sale."

I say this merely as an interesting observation. I've come to consider this a de facto counterargument to the claim that languages aren't slow, only implementations are. It may be theoretically true, but in practice, as nice as Python may be to use, it has proved a very difficult language to speed up. (PyPy has taken a very good run at it, but it sure wasn't a case of "I'll just do this easy, obvious thing." PyPy seems to have hit Python's performance with multiple PhD-thesis level attacks, and it's still certainly not C in the general case.)

iskander · on June 26, 2013

>It may be theoretically true, but in practice, as nice as Python may be to use, it has proved a very difficult language to speed up.

I disagree with you. Python isn't much harder to speed up than Lua and in some ways it's better behaved than JavaScript. Still, both of those languages enjoy implementations significantly faster than CPython. Really, it's not the semantics of the language which hold back Python's performance but rather the fact that extension modules are extremely tightly coupled with a particular interpreter implementation. Lua has a clean interface with C, JavaScript implementations generally force the outside world to use doubly indirect handles on objects. CPython, on the other hand, is shameless in flaunting its internals for the whole world to see.

The only reason that PyPy has taken "multiple PhD-thesis level attacks" to near completion is because their approach is insanely ambitious. They didn't write a JIT. Instead they wrote a toolkit for partially evaluating interpreters on source files and generate native code by tracing an interpreter while it itself runs a program. It's nuts! It's amazing that PyPy works and the amount of effort is totally unsurprising.

Had they gone a more traditional route, the whole thing could have been done in a year or two. They would, however, still face resistance from a Python community that wants to neither give up nor rewrite their PyObject-laced libraries.

fijal · on June 26, 2013

I think you underestimate the complexity of Python language.

Note that PyPy is not the only project that did that - remember psyco? There are reasons why after 3 years Armin said "I give up, let's do PyPy". It's not the "well behaved" part, this can be worked around, Python is simply more complex than Javascript or Lua and by complex I mean just bigger. All the extension modules that everyone naturally expects to be fast (even just the stdlib), descriptor protocol, crazy frame access semantics. That does make it very labour intensive to do the right thing. Look what happened to Unladen Swallow - they did not get anywhere really within a year. Several of PyPy optimizations that took forever to do are really new stuff, whether you do JIT by hand or generate it automatically.

nostrademons · on June 26, 2013

The last time I talked with the Unladen Swallow guys (a couple years ago), they were pretty clear that one of their main stumbling blocks was supporting the Python/C API, and wanting to have complete compatibility with C extension modules. While we can't really know how hard the task would've been if they'd lifted that requirement - it was baked into their design from an early stage - when I'd floated the idea of doing a from-the-ground-up LLVM-based implementation of Python, they seemed significantly more optimistic. It wouldn't be all that useful for most people, but it would've worked fine for my use case.

Alas, it's doubtful that a Python implementation that sacrifices C extensions would get all that far with mainstream adopters, as so many useful libraries are done as C extensions.

fijal · on June 26, 2013

They were optimistic at the beginning too (even with C extensions). How would it fill your usecase in a way that pypy does not?

nostrademons · on June 26, 2013

We were looking for something easily embeddable, but all host modules were provided by the application, so there was no need for outside C extensions. And the set of libraries that was importable was restricted and coding styleguides banned advanced language features like metaprogramming, so we could afford to cut corners on corner cases of the language. "Decent" performance (i.e. more like Java than CPython) was a requirement, as was multithreading support and lack of a GIL, and RAM usage was also at a premium (which was probably the largest argument against PyPy...also, this was a couple years ago, when PyPy was not as mature).

MostAwesomeDude · on June 26, 2013

You are wrong.

Python's object model is incredibly rich in ways that JavaScript and Lua don't even come close to touching. Let me list some things that you'll see in Python code that you're not gonna see in JS or Lua:

* Objects that don't extend the object hierarchy (you don't have to extend from `object`)

* Types that don't extend the type hierarchy (Python has full metaclassing)

* Any object can elect to become callable; calls are almost message passes

* Two different levels of message-passing method/attribute rewriting (__getattr__ and __getattribute__)

* Descriptors, such as properties (no, real properties) are baked into the object model

* The table of globals can be altered at any time, frustrating static analysis

* The table of locals can be altered too!

* The table of builtins can be altered!! (Is nothing sacred?)

In addition, PyPy did not start out as a partial evaluator and meta-tracing JIT generator. What you're seeing is the result of about a decade of work and a half-dozen iterations. They started out with something much like the thing that you would expect to see, but just like every other Python JIT project, they learned that Python is complex and difficult to optimize.

So, uh, you're wrong. Sorry.

silverlake · on June 26, 2013

is Python more complicated than Lisp or Scheme or Smalltalk or Self? All are dynamically typed and infinitely flexible.

enupten · on June 26, 2013

There is some truth here; even the militantly dynamic Common lisp has a blisteringly fast implementation (SBCL).

bjourne · on June 26, 2013

Pythons byte code interpreter loop is very tight. As previously mentioned, it's a simple stack-based vm with a small instruction set. The upshot to that is that is that it is very short, meaning that the executable is very small, meaning that most of it fits in the cpu cache.

Cache misses are incredibly expensive, and any "obvious optimization" of Python's core will inevitably introduce more of them because the code gets longer. So most of the optimizations wins big in the area in which they are targeted, but loses in general performance.

For the same reason gcc -Os (optimizing for small binary) is often faster than gcc -O3.

On top of that, there is significant resistance from the devs to complicate the core. They prefer a slower, but easier to understand, easier to analyze interpreter over a complex one with harder to predict runtime performance. It's the same with reference counting and theoretically superior garbage collection.

beagle3 · on June 26, 2013

If you haven't already, LuaJIT's source code (and Mike Pall when asked) is a treasure trove of speedup ideas.

One idea that stood out to me (and which I first saw in LuaJIT, and as far as I know originated with Pall) is: when rewriting loop code, unroll at least 2 iterations of the loop. (The first executes and conditionally continues into the second; the second loops onto itself). So far, just extra work.

However, any kind of constant folding algorithm is now immediately elevated into a "code hoisting out of loop" algorithm at no extra cost - e.g., SSA form gets that kind of code motion.

I'm not sure Python can make much use of that, because it is nearly impossible to guarantee idempotence of operations - but in case you can somehow make that guarantee, that can be very significant for e.g. function name lookups.

A possible way to use that is to have the loop opcode have two branch targets: "namespaces modified" (which goes to the first iteration, which reloads values) and "namespaces unmodified" (which loops at the 2nd iteration, relying on the constant folding and not looking up in dicts again). This could make calls like "a.b.c.d.e.f" require 0 lookups in most iterations of most loops -- but would also require a global "namespace modified" flag.

fijal · on June 26, 2013

We have this optimization in PyPy (so yes, Python can do it). There is even a paper [0]. LuaJIT is a great source of inspiration but Python is just a huge language, which makes it much harder.

http://www.maths.lth.se/matematiklth/vision/publdb/reports/p...

xenonflash · on June 26, 2013

I don't think its correct to say the CPython is slow. What you can more accurately say about CPython is that the performance is highly variable. Some things are very fast, while others are comparatively slow.

The slow things tend to be the sort of numerical loops that you see in micro-benchmarks. It's no coincidence that the version of Python in the linked article saw its greatest speed up in a numerical loop, but only modest improvement elsewhere. It's exactly this sort of simple repetitive operation where interpreter overhead matters the most.

Language features that encapsulate complex functionality tend to be harder to speed up in CPython because the VM operates at a fairly high level. In effect you're just kicking off a large subroutine that is written in C, and you're really executing native code until that operation is complete. You're not going to improve very much on that no matter how much you try.

What this means is that speed will depend heavily on the type of application program being written, and also on how much the programmer takes advantage of the unique language features. It also makes realistic cross language benchmarks difficult because the right way to do something in Python may not have a direct equivalent in another language. The result tends to be "lowest common denominator" benchmarks, which are exactly the sort of algorithms which CPython does worst at.

toolslive · on June 26, 2013

it really is a slow interpreter. Python comes with a pystone benchmark, and CPython is invariably the slowest of all interpreters. This being said, if you take a look at it's implementation, then it's immediately obvious why. The interpreter is a basically a simple C switch, with no optimization whatsoever. Simply threading the interpreter would make it about a factor 2 faster (at least that's what the experts claim you gain by threading)

xenonflash · on June 26, 2013

Pystone isn't a performance benchmark, or at least it isn't a useful one. It's more of a regression test to see if anything has changed between versions. It's not useful as a performance benchmark because it doesn't weight the results according to how much the individual features matter in real life. There are three versions of Python besides CPython that are in commercial use. Two are much slower than CPython (up to three times slower), and Pypy is (currently) faster in some applications and slower in others.

The CPython interpreter is not a simple switch. It uses computed gotos if you compile it with gcc. Microsoft VC doesn't have language support needed for writing fast interpreters, so the Python source is written in a way that will default to using a switch if you compile it with MS VC. So, on every platform except for one, it's a computed goto.

Modern CPU performance is very negatively affected by branch prediction failure and cache effects. A lot of the existing literature that you may see on interpreter performance is obsolete because it doesn't take those factors into account, but rather assumes that all code paths are equal. Threading worked well with older CPUs, not so well with newer ones.

I am current working on an interpreter that recognises a subset of Python for use as a library in complex mathematical algorithms. As part of this I have bench marked multiple different interpreter designs for it and also compared it to native ('C') code. It is possible to get a much faster interpreter, provided you limit it to doing very simple things repetitively. These simple things also happen to be the sorts of things which are popular with benchmark writers (because they're easy to write cross language benchmarks for), but which CPython does not do well in.

A sub-interpreter which targets these types of problems should give improved performance in this area. Rewriting the entire Python interpreter though would probably have little value, as the characteristics of opening a file or doing set operations, or handling exceptions are entirely different from adding two numbers together.

There is no such thing as a single speed "knob" which you can crank up or down to improve performance. There are many, many, features in modern programming languages, all of which have their own characteristics. Picking out a benchmark which happens to exercise one or a few of them will tell you nothing about how a real world application will perform unless it corresponds to the actual bottlenecks in your application. For that, you need to know the application domain and the language inside and out.

One thing about Python developers is that they tend to be very pragmatic. When someone comes to them with an idea, they say "show me the numbers in a real life situation". More often than not, the theoretical advantage of the approach being espoused evaporates when subjected to that type of analysis.

toolslive · on June 26, 2013

http://hg.python.org/cpython/file/16fe29689f3f/Python/ceval....

looks like a switch to me.

Anyway, I've let them tell me the CPython interpreter is very simple on purpose to allow it to function as a standard 'definition' of the language behaviour. A simple jit does wonders, as does a less brain dead gc. Superinstructions, threading, ... are all possible. But you're absolutely right: It's really difficult to predict how much each improvement would contribute.

xenonflash · on June 27, 2013

Have a look at the lines starting at line 821 in the very file you referenced. I have quoted a bit of it here:

"Computed GOTOs, or the-optimization-commonly-but-improperly-known-as-"threaded code" using gcc's labels-as-values extension (...) At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture."

They also have an explanation of the branch prediction effect which I mentioned earlier.

They have both methods (switch and computed goto) since some compilers don't support computed gotos, and some people want to use alternative compilers (e.g. Microsoft VC).

In my own interpreter, I tried both switch and computed gotos, as well as another method called "replicated switch". I auto-generate the interpreter source code (using a simple script) so that I could change methods easily for comparison. In my own testing, computed gotos were about 50% faster than a simple switch, but keep in mind that is strictly doing numerical type code. More complex operations would water that down somewhat, as less of the execution time would be due to dispatch overhead.

Computed gotos aren't really any more complex than a switch once you understand the format, and as I said above you can convert between the two with a simple script. What does get complex is doing Python level static or run time code optimization to try to predict types or remove redundant operations from loops. CPython doesn't do that, while Pypy does this extensively. It's these types of compiler and run-time re-compile optimizations which make the big difference.

Overall, my interpreter is currently about 5.5 times faster than CPython with the specific simple benchmark program I tested. However, keep in mind it only does (and only ever will do) a narrow subset of the full Python language. Performance is never the result of a single technique. It's the result of many small improvements each of which address a specific problem.

toolslive · on June 27, 2013

so the conclusion really is: CPython is way slower than it should be. Question: if the subset is small, isn't it better to use something like 'shed skin' ? http://code.google.com/p/shedskin/

I once looked at it, and it does a fairly literal translation. The only problem is that it changes semantics of the primitive types. For example a python integer becomes a C++ int. (and overflow semantics change)

bsimpson · on June 26, 2013

As a web developer, I've often heard neckbeards bickering about Python's performance, but haven't had a real point-of-reference to understand how bad it can be until recently.

I've started working on a side project that processes geo data in AppEngine. My dataset includes many long lists of numbers (lats, longs, altitudes, timestamps, etc.). A 700 route dataset is about 25MB in a sqlite database, but trying to access any significant portion of it quickly maxes out the 4GB of RAM available on either of my dev machines (which is more than I could reasonably expect to be provisioned in the cloud). I mentioned this as a potential bug to the relevant Googler at I/O this year and he basically said "that's not us, that's Python."

It's mindboggling how quickly you can burn through your RAM in CPython. Hopefully you can prove something that will eventually make its way back into CPython and lift everyone's boats. Unfortunately, even if Falcon helped on my dev machine, I can't imagine it being taken up on cloud platforms like AppEngine.

xenonflash · on June 26, 2013

If you don't have a lot of experience with Python you may not know how some of the "magic" parts actually work inside. Some of the things you can do however will end up allocating a lot of memory with extra copies of data that you don't need. If you do that a lot, you will chew through lots of memory. There are often two or more ways of accomplishing the same thing, with one way creating duplicates of data, and the other not. There are valid applications for both, and if you're only dealing with small amounts of data then the differences don't really matter.

Where a lot of people who are new to Python run into problems is that the language is deceptively easy. They try writing Python code that's simply a direct analogue of how they would write Java or C#. The resulting code will run, but it often be slow and a lot more verbose than necessary. Very often the way you would do something in Java or C# is the worst possible way to do it in Python. Conversely, the best way to do it in Python often has no direct analogue in Java or C#. With Python, the learning curve is shallow, but it's very long, and there's lots to learn if you want to reap all the benefits.

Without knowing what your data or algorithms are, it's pretty difficult to give any sensible detailed advice. However, if you are dealing with long "lists" of numbers, perhaps what you really want is long "arrays" of numbers. Lists and arrays are not the same thing in Python.

mercuryrising · on June 26, 2013

It's easy to burn through RAM in Python because it's easy to keep unnecessary data around. Iterating through a large dataset is better than storing it all (and all the subsequent 'filtered' data) in memory at once.

PommeDeTerre · on June 26, 2013

Did the person from Google that you talked to happen to mention an issue number in Python's tracker, or anything of that sort?

Something seems very wrong if every 1 MB of data read from your SQLite database ends up consuming 150+ MB of memory in some way.

Are you able to provide any sample code and a SQLite database that exhibit this problem, so attempts can be made to fix it?

gregorsamza · on June 26, 2013

A dataset with "many long lists of numbers" sounds like an ideal use case for NumPy, have you tried using that?

bsimpson · on June 26, 2013

I haven't. I don't have any complex math in mind (yet), just some simple transformations. The problem is that even something as simple as checking a list for potential duplicates becomes really RAM intensive for sufficiently large lists. (I'm not even doing deep equality, just comparing metadata.)

I still have plenty more work to do on the project. I think I'll end up fanning out each list iteration into a series of smaller chunks to keep me from blowing through all the RAM on any one request.

maxerickson · on June 26, 2013

Numpy supports lots of array math, but another way to think of it is as an api for working directly with memory (and values stored as platform types instead of python objects).

(Which you may well realize...)

_h8ft · on June 26, 2013

Or even simply the array module.

scardine · on June 26, 2013

GAE allows only pure Python. No binary modules like NumPy.

kirubakaran · on June 26, 2013

Numpy is supported https://developers.google.com/appengine/docs/python/tools/li...

corey · on June 26, 2013

Python has an array class too:

http://docs.python.org/2/library/array.html

tehwalrus · on June 26, 2013

I often find this problem with python, although usually it is the parsing code; the code that loads all the data up, that actually uses the high-water-mark of RAM.

For example, I had a 100MB JSON file that I tried to use the stdlib json library to load. It quickly used >8GB (my machine's RAM) and started paging, dragging everything to a halt. This is partly because the stdlib JSON parser is written in python.

Now, if you switch to a small, clever implementation called cjson[1], it can load the whole thing without bumping 3-400MB in RAM, and the high watermark is the data at the end. Much better!

So, in summary, be careful that the important part of your code is the one that uses all the RAM - and that it's not some "hello world" quality stdlib code that's killing you. If it is, and there isn't a cjson for the job, I've found wrapping C/C++ libraries with Cython[2] a simple way to solve the problem without too much hassle (generally only a couple of days work at a time if you're tight, and only wrap the functions you actually need to use yourself.)

[1] https://pypi.python.org/pypi/python-cjson - although there's a 1.5.1 out there somewhere with a fix for a bug that loses precision on floats...which is the only one I use personally. It's so hard to find that I keep a copy of the source in my Dropbox for when I need it!

[2] http://cython.org/ - although of course actually using cython means you can't take advantage of pypy, IronPython, and other "faster" implementations because you're tied to the cpython C interface forever.

pekk · on June 26, 2013

On the contrary, App Engine bears responsibility for the design of db.Model while Python does not.

dchichkov · on June 26, 2013

Can we make function calls cheaper?

From my observations in pretty much any unoptimized Python (CPython interpreted) code function calls is nearly always a bottleneck. And speed is directly bound by the number of function calls being performed, not by ponderous data structures.

iskander · on June 26, 2013

The ponderousness of these data structures isn't just about memory consumption or having to use boxed numbers. As far as performance goes, PyObjects infect everything in the interpreter. For example, when you're calling a Python function, after a long run-around in ceval, PyObject_Call, the function object's function_call method, you'll finally get back to ceval which creates a frame via a lengthy call to PyFrame_New. The whole process is a mess of allocating, deconstructing, increfing, decrefing, and tag-checking.

fijal · on June 26, 2013

The title is very misleading - it should be called "How fast can we make CPython", because it talks about compatibility with PyObject* and stuff.

While I would argue that we can't make interpreted Python particularly fast, the actual topic in question is much harder than that.

jaegerpicker · on June 26, 2013

While this is an undeniably cool project from the tech side, I think it's usually a better idea to rewrite bottlenecks of the kind that this helps with as a c extension. It's a fairly easy process (MUCH easier then in Java for example) and only a small amount of code needs to be in c itself but you can get huge performance increases without changing the Cython environment. I've always thought that was one of python/ruby's greatest strength's is the easy C integration.

fernly · on June 26, 2013

I think you mean the "CPython environment"? "Cython" is something else.

fernly · on June 26, 2013

No mention of the somewhat confusingly named Cython [1]? It addresses the same issues in a different way.

[1] http://en.wikipedia.org/wiki/Cython

freework · on June 26, 2013

Python is typically used to build applications that are IO bound. Squeezing more performance out of the interpreter is not going to translate to any real gains for most Python users these days.

kkowalczyk · on June 26, 2013

That's circular thinking.

Because Python is slow, Python is not used in scenarios where speed is crucial. That much is true.

However, if Python was faster, it would be used in those scenarios, so more people would be using it for speed critical code so it would provide real gains for a great many Python programmers.

This is exactly what happened with JavaScript: before V8 JavaScript was in exactly the same position as Python. Not many people were writing large programs in JavaScript because JavaScript was too slow. V8 sped up JavaScript 10x+ and people started writing much larger apps that do require that speed. If JavaScript speed suddenly dropped to pre-V8 speeds, we would all find the most popular web apps unusably slow.

cookiecaper · on June 26, 2013

It's probably worth noting that Google already attempted a V8-like transformation for Python with Unladen Swallow, and that that attempt mostly failed. Perhaps it was just prioritized differently, and that's why V8 was successful and Unladen Swallow wasn't.

dexterous · on June 26, 2013

I'm curious, what leads you to the conclusion that 'Python is typically used to build applications that are IO bound'? Also, what exactly are you referring to when you say 'IO bound'?

bluedino · on June 26, 2013

Python is commonly used in web services, where retrieving information from a database and transmitting the results via slow network connections are what takes up the most time, not processing the data in between with Python.

dexterous · on June 26, 2013

I don't buy that. I've had enough people relate Python's roots in the sysad tool belt and it's application in processing large data sets to believe that it's intended use case could be that limited.

DrJosiah · on June 26, 2013

And you shouldn't :)

Python is used just about everywhere for just about everything (sometimes properly, sometimes poorly). While there are a lot of web sites that run Python, there are also countless other applications that use it that are unrelated to the web. See Scipy as an example.

That said, the earlier poster mentioning that many people are using Python for IO bound processes is not too far a stretch. Why else would Twisted Python exist, and why would the new Tulip async IO stuff be developed?

dexterous · on June 26, 2013

I'm not denying that IO is important to a certain class of Python programs too; but I still can't see what leads to the conclusion that IO bound programs are the primary use case for Python. There are enough evented and async io libraries being built for just about every platform right now, that doesn't make IO the center of any of them.

mateo411 · on June 26, 2013

And then there is the GIL. So if you want to squeeze more performance out of your CPU bound task. You need to use the multiprocess module, because Python threading does not work well for CPU bound tasks.

pekk · on June 26, 2013

Since you are talking about CPU-bound tasks, it is relevant that numpy array operations release the GIL.

mateo411 · on June 27, 2013

True, you would have to write or use a Python extension written in C or C++.

xenonflash · on June 26, 2013

It's also used in CPU intensive applications, but the "standard" numerical libraries (Numpy/SciPy) aren't distributed with the language standard libraries. However, they're FOSS, available pretty much anywhere Python itself is, and anyone doing numerical work with Python will almost certainly use those libraries. They're not distributed with CPython itself in order to avoid being tied to a slower release schedule.

Fomite · on June 26, 2013

I'm sure this will come as a surprise to everyone using Python's rather well developed scientific computing stack.

lmm · on June 26, 2013

You mean the one that's implemented mostly in FORTRAN with python as a mere coordinating layer on top?

(And don't get me wrong, it's an effective approach that plays to the strengths of both languages. But it's not doing "heavy lifting" in python)

pekk · on June 26, 2013

Why should language choices be guided by the bizarre artificial scenario where other languages do not exist?

lmm · on June 27, 2013

What? I never said anything of the sort.

int3 · on June 26, 2013

Interesting approach. One criticism: The paper mentions that compile times max out at 1.1ms "for the most complex function" in the benchmark (AES), and therefore it is sufficient to just compile everything. However, those benchmarks seem too small to justify that conclusion.

dschiptsov · on June 26, 2013

The answer is very simple:

1) Know what ought to be done - do it and send the patches.

2) Need "speed" - write that part in C.)

sherjilozair · on June 26, 2013

    2) Need "speed" - write that part in C.)

That's not so easy. Interfacing Python and C code is also incredibly hard, and no one true way exists.

gizmo686 · on June 26, 2013

>That's not so easy. Interfacing Python and C code is also incredibly hard, and no one true way exists.

Can you elaborate on this. I've worked on python C extensions (just minor updates and fixes, I've never been the one to write significant chunks of it), and it seems like interfacing python with C is pretty straight forward.

lmm · on June 26, 2013

If you use the CPython C API it's easy - but you also bind yourself closely to CPython. If you know performance is going to be important it's probably better to bite the bullet and use PyPy - which means you have to use the somewhat cruder cffi to interface with C code.

briancurtin · on June 26, 2013

I've done it numerous times over the last several years and while it's probably not trivial, it's definitely not hard. If you're looking for a "one true way", it's Python's C API.

pekk · on June 26, 2013

It might be incredibly hard if you do not know C.

mappu · on June 26, 2013

This is bikeshedding, but "The only hard problems in computer science are cache invalidation and naming things" - there's already a quite popular, multi-paradigm programming language named Falcon[1].

1. http://www.falconpl.org/

simonh · on June 26, 2013

Sorry, but the two hard problems in programming are actually cache invalidation, naming things and off by one errors.

catmanjan · on June 26, 2013

Haha oh my god Terry A. Davis commented on it!

SQUAWK SQUAWK SQUAWK