Hacker News new | past | comments | ask | show | jobs | submit login
Pyston v2: Faster Python (pyston.org)
262 points by kmod on Oct 28, 2020 | hide | past | favorite | 206 comments



I'm not versed in the details of the politics of CPython, but why did this project fork instead of just contributing to CPython? Is CPython really slow in integrating community contributions?

Edit: I read the blog post closer, and found this: "Our plan is to open-source the code in the future, but since compiler projects are expensive and we no longer have benevolent corporate sponsorship, it is currently closed-source while we iron out our business model."


I would say there's a small minority of our changes that we can upstream, and eventually we'd like to upstream them. For the rest, I would say the different set of priorities for the two projects means they probably will stay separate:

- In the blog post I linked to an issue of someone trying to upstream a quickening implementation and meeting resistance due to the added complexity

- CPython prefers portability over performance, and we added a number of big build dependencies that may not work on all the platforms that CPython supports, though from our perspective it works on all the important ones

- CPython has included a number of performance-degrading features over the years and we plan to start cutting out some that are only used for debugging but still hurt release mode. I don't know for sure but I would expect resistance to backing out those changes.

There are more, but the general idea is that the two projects will make different tradeoffs. Maybe there is a world where the code all lives in the same repository but is gated between different behaviors, but the CPython maintainers have already rejected a proposal like that.


As far as I know CPython is quite welcoming.

However, per the blog post, this version of Pyston is closed source. So CPython won't be interested, and many others won't be interested either.


Python is a welcoming community, but they are decidedly not welcoming of contributions that significantly increase the complexity of the CPython reference implementation.

This has been discussed extensively: https://news.ycombinator.com/item?id=11125769


This is why I see little hope for Python, which is to say that while I'm sure it will continue to have a large following for many years a la C, C++, etc, I don't have hope for it being an exciting language or one that is particularly productive. Python already has performance and packaging problems which don't seem to be easily divorced from CPython, since virtually the whole reference implementation is depended upon directly by much of the ecosystem due to the sprawling C-extension interface.

Pypy has done yeoman's work in improving performance while maintaining compatibility with an impressive amount of the ecosystem, and even still there are many important packages which aren't compatible with Pypy and for which Pypy-compatible analogs don't exist (or aren't supported/maintained). For example, the only Pypy-compatible Postgres drivers were unsupported last I checked.

Moreover, the Python community (or at least its leadership) seems to have very little energy around tackling these longstanding problems. Meanwhile, there are many other languages which are not only performant, but which are rapidly encroaching on Python's historically unique(ish) "easiness" (in the sense that Python is considered "easy", which is to say for people who don't have to manage build/deploy/packaging/etc or otherwise have performance issues). Further, many of these languages continue to improve at a remarkable pace, while Python is content to rest on the laurels of its scientific computing mindshare--and given the rather poor nature of the numeric computing package APIs and their somewhat low performance ceiling, I don't expect Python to be so dominant in this domain in another 5-10 years, especially as more companies need to figure out how to productionize scientific workloads.


A different perspective is that Python leadership has been overwhelmed addressing the concerns of the enormous and growing Python community, for whom generally performance is not yet the primary concern - believe it or not. It's probably fair to suggest PSF has stumbled in executing some of their goals, most notably and publicly the transition to v3, but overall it seems like the general Python community is most interested in language and ecosystem features, while performance-critical workloads are already being addressed in a number of projects (not just PyPy, but Numba, Cython, and others).

Data science may be at the heart of Python's strengths, but it's simply not accurate to suggest Python (whoever that is!) is resting on its laurels, as evidenced by the very active, albeit sprawling, package ecosystem. And while the CPython devs seem to have found their stride in 3.x releases.

It's also inaccurate to suggest numeric computing in Python has a "low performance ceiling," considering you can get near-C performance via JIT or AOT using a package like Numba, and in most cases it's not even necessary because of Numpy and the many other highly optimized compute packages that can do most of the heavy lifting.

I think the main draw of Python is not just that the syntax and language features are approachable, but that the package ecosystem is so broad and active, you are likely to make lighter work of the same job done in another language. I think to displace Python you would have to displace the package ecosystem, which seems as big and broad as it's ever been.


I agree with your overall characterization. As someone who works a lot with Python and is very familiar with its package ecosystem, I find the lack of leadership from PSF to be discouraging and upsetting. The past few Python releases have what I would characterize as cosmetic improvements while repeatedly missing opportunities to improve interpreter, packaging, and interface fundamentals. The position that CPython has to be kept simple as a reference implementation is untenable in the absence of an active collaboration or effort to produce a performance oriented implementation.

As far as I'm concerned, CPython as an interpreter technology has not advanced in the past decade - and Python has grown thanks to the data science community efforts and excellent library ecosystem you mention, not due to PSF. PSF can only miss so many opportunities before something comprehensively better starts to eclipse it.


I do get the impression that position is not the overwhelming consensus among CPython core devs (nor does it really make sense, given the huge dependency of many packages to the C API), whether or not it is what PSF may have publicly communicated. There are a few glimmers that interpreter improvements are on the horizon, with some proposals like subinterpreters and GIL mitigation getting a lot of attention, all which (to my knowledge) are necessary and prerequisite to serious, bold performance improvements.

I agree performance is important, and I think we have reason to be optimistic, but with the understanding that that level of improvement, even if currently underway, will just take a lot of effort and time. Meanwhile, as someone fairly new to Python and working on several performance critical pieces, I've been pretty impressed with what you can do with the current compute packages (after taking months to work my way through most of them).


> It's probably fair to suggest PSF has stumbled in executing some of their goals, most notably and publicly the transition to v3

In a recent post linked on HN, Steve Yegge basically nailed it:

> How much new software was written in something other than Python, which might have been written in Python if Guido hadn’t burned everyone’s house down? It’s hard to say, but I can tell you, it hasn’t been good for Python. It’s a huge mess and everyone is miserable.

Note to future language maintainers: don't burn everyone's house down.


I did RTFA, but I haven't been on the Python scene long enough to have a truly educated opinion about the transition to v3. It's not unusual for a language to make breaking changes in early versions. What happens when you realize you need to do it late in the game? Tough call. I think it's probably impossible to know whether cost of breaking changes in the short term (people leaving Python for Go, etc.) was justified relative to the cost of a crufty and inadequate Python in the future (people leaving Python for Go, etc.). I'm generally sympathetic, but I'm glad I didn't get into Python until the transition was well underway.


Link?



Not just the package ecosystem, but you also have to have a large number of developers and jobs, so you can do hiring. Developing a project in a non-mainstream language means more difficult hiring.


> Meanwhile, there are many other languages which are not only performant, but which are rapidly encroaching on Python's historically unique(ish) "easiness"

What are those languages? I may have a blindspot, but the languages that get enough buzz for me to notice are either not competing with Python in important dimensions (e.g. Rust) or have a narrower focus (e.g. Julia). Elixir maybe? JavaScript and its derivatives? But these lack the scientific programming ecosystem that's helped drive Python recently.

I have no doubt that languages exist that might fit the bill of being both as easy as and more performant than Python -- there are a lot of languages out there. But I'm unaware of any whose mindshare has been sufficiently growing that it threatens Python in the mid-term.

If I'm off base please let me know. I'd love to find a viable competitor to Python that's strictly better than it.


> I'd love to find a viable competitor to Python that's strictly better than it.

I strongly recommend Go as a better Python. Personally, I think it's easier to write than Python (although people who care very little about correctness will be bothered a bit by the type checker), and the tooling is many times better (single-binary deployments, great dependency management, etc are awesome). Also, the performance is about 100-1000 times better for serial execution, and Go's goroutines allow you to take advantage of multiple cores much more easily than with Python.

JavaScript and TypeScript are similarly easy-to-use, performant languages with a better-than-Python tooling story. I've also heard similar things about Elixir, Closure, and Kotlin.


Go and Python have pretty minimal overlap IMO. If you are using Python for anything other than a server or CLI, Golang is not a very good replacement for Python.

Some of Python's strengths that I work with regularly are dynamism, easy data exploration, visualizations, succinct & customizable syntax, extremely strong data science libraries, REPL/Jupyter, C bindings, easy to use packaging solution via PyPi (bet some people are going to disagree with that one), quick to prototype code. If you really need performance in Python, you can probably get it out of the box (if you can run deep learning with Python, performance isn't a limitation).

I love Go and use it for a bunch of projects, but I've only once wanted to move a project from Python to Go and that was a performance-centric CLI that was only written in Python originally because of how quickly it let us prototype in comparison to Go.


Julia covers your use cases and overwhelmingly fast.


My gripe with Julia etc. as replacements, is that Python is duct tape. I don't need fast duct tape, I need duct tape that is understood and used by essentially everyone I work with, and that has native, fast handling of large amounts of data (NumPy, Pandas). Good user experience as duct tape.

From my perspective Julia is sacrifising some amount of "duct tape UX" to gain speed, and that's the wrong direction.

Whenever we need more speed, we just pull the slow bits down into compiled languages, and scale them out to many cores with solutions like MPI.

R is another language that is mainly duct tape for stringing pieces of compiled code together. If it had a better UX for developers than Python, I think we would see it dominating much more today, without having any speed advantage.


I agree with you right now, but think that Julia will end up dominating over the long-term, because dropping down into another language absolutely sucks for data scientists without engineering support.

Interestingly, R is probably a better UX for statisticians/data scientists than Python is (almost all the good parts of Numpy/Pandas were in R first), but it really suffers from not being well known by developers.

To be fair to R though, it's much, much easier to deploy than Python, which is a shocking indictment of the current Python packaging ecosystem.


> Whenever we need more speed, we just pull the slow bits down into compiled languages, and scale them out to many cores with solutions like MPI.

This only works sometimes--for problems that allow you to do a relatively large amount of computation in the compiled language to justify the cost of marshalling Python data structures into native data structures. For matrices of scalar values, this works well. For many other problems (consider large graphs of arbitrarily-typed Python objects, or even a dataframe on which you need to invoke a Python callback on each element). If you rewrite a big enough piece of your Python codebase in the compiled language, then it will work, but now you're maintaining a significant C/C++/etc code base and the bindings and the build/packaging system that knows how to integrate the two on all of your target platforms. Python really doesn't have a good answer for these kinds of problems, and these are by far the more common case (though perhaps not more common in data science specifically).


> From my perspective Julia is sacrifising some amount of "duct tape UX" to gain speed, and that's the wrong direction.

What particular language features of Julia make that trade-off? (Not a rhetorical question, I'm not disagreeing with you, just curious; I'm familiar with Python, not really familiar with Julia.)


Probably the usual complaints about package load and compilation time.

These aren't fundamental issues with the language and will be solved by tiered compilation (there's already an interpreter mode, just have to integrate that with normal use), separate compilation and incremental sysimage creation which works with the package manager.


Possibly - I'm keeping my eye on it. But I can't stand the matlab-esque syntax.


I'll be honest, as someone whose non-Python programming (paltry as it is) is mostly done in statically-typed functional languages, I have a bit of a bias against Go for the whole generics thing. I'll give it a closer look at some point.


I gotta be honest, the lack of generics has never bothered me.

The type-assertion escape hatch has always been largely sufficient for all but the most performance-critical projects (of which I have exactly 1, and it was a side project), and runtime panics due to failed assertions are quite easy to eliminate via wrapper types.

My advice: just go for it. You almost certainly won't miss generics.

(Edit: I've broken my own rule and given advice without first asking what kind of programming you do. My assertion holds for most run-of-the-mill stuff, e.g. writing REST interfaces, network servers, etc. If you're within 2-sigma of the industry, you won't miss generics.)


I understand; lots of people have this sentiment. It’s an inconvenience in many cases, but we’re comparing it against Python, which has no type safety at all much less generics (apart from Mypy, which has many, many other issues).


All dynamic languages are generic by default.


If that's how you define "generic", then Go is also generic by virtue of `interface{}`.


No, because interface{} in an empty type that needs to be type cast to the actual type before use.

Dynamic languages do that implicitly.

Implement max in Go with interface{} without casts and reflection:

   def max(a, b):
     if a >= b:
        a
     else:
        b


I'm not aware of any definition of 'generic' that prohibits casting. It certainly seems like a very arbitrary condition. Basically I'm familiar with two definitions:

1. The abstract idea of writing an algorithm that supports a variety of types. This allows for casting, reflection, dynamic typing, etc.

2. The specific idea of a type system that allows for parameterized types (aka "typesafe generics"). This definition excludes castng, reflection, and dynamic typing.

Typically "typesafe generics" is what people talk about when they discuss "generics", but since you chose to pick the "dynamically typed languages are generic" nit, I assumed you were talking about (1).


> single-binary deployments

More than performance (pure Python can be too slow, but NumPy is usually fast enough), this is what I miss when using Python.

There's currently no good solution for bundling Python code and the interpreter into a single binary. PyInstaller works, but the resulting binaries write lots of files to `/tmp` every time they run, which is a hack that leads to long startup times. PyOxidizer avoids this, but includes the entire standard library in every binary, making them too large by an order of magnitude.


What's Go like for REPL-driven / exploratory development? That's mostly how I use Python.


Go has nothing on Python in this regard. I write Go every day, and come from a Python background. I often describe Go as the strongly-typed, more performant version of Python. I say this mostly because my Go code isn't too dissimilar from my Python code (structure, naming, packages). But I still drop into Python if I want to do something quickly. I don't really know why. Maybe it's the Go tooling, e.g. unused variables cause compilation errors, so things like this slow me down. Or maybe it's because Python just offers _so_ much out of the box, e.g. all the data structures you'll ever need (list, set, dict, tuples), and all small things you take for granted (like writing "is a in list", which would require a function in Go). The REPL is Python's killer feature.


> I often describe Go as the strongly-typed, more performant version of Python.

I've heard Go described this way several times, but I've found it to be a significantly lower-level language than Python.

For example, it's much more verbose. In this recent blog post [0], the author converts some C++ code to Go - and it gets longer. 57 lines of C++ become 65 lines of Go. The same code in Python is about 20 lines.

In particular, the author's Go code requires five lines to do the equivalent of Python's `with open(path) as f` and four lines for the equivalent of `word_array = list(word_counts.items())`:

    f, err := os.Open(path)
    if err != nil {
        return err
    }
    defer f.Close()

    // [...]

    wordArray := make([]WordCount, 0, len(wordCounts))
    for word, count := range wordCounts {
        wordArray = append(wordArray, WordCount{word: word, count: count})
    }
[0] http://jmoiron.net/blog/cpp-deserves-its-bad-reputation/


Exactly my opinion! I write python for a living since more than 13 years, and I find Go awfully verbose. It makes simple things feel like a chore. A three-to-one ratio of lines of code sounds about right. That's not a trade-off I can make, no matter how big the performance improvements.


> That's not a trade-off I can make, no matter how big the performance improvements.

This is crazy. Many of those lines are closing brackets or whitespace. But moreover, optimizing for characters or LOC is absurd. Optimize for maintainability or readability, at which point Go is at least as good as Python (I would argue better). Optimize for tooling, especially package management and build tooling--Go is many times better than Python here. Optimize for performance--Go is literally hundreds or thousands of times better here. Optimize for breadth and quality of ecosystem. Optimize for deployment story (single small artifact vs hundreds of megabytes of dependencies). These are the things that matter, not lines of code.


I think Julia is strictly better than Python, both for data oriented applications and for web development.


Julia might be. It's certainly higher performance from what I understand, and it offers some syntactic flexibility that I miss from R when using Python (Python could never have a fully complete dplyr). It seems from afar to be rather more complex than those languages, though.


I actually think that from a programming perspective, Julia is much closer to R than Python, because of the focus on generic functions rather than objects as a means of abstractions.

It's still a little too wild-west for my tastes right now, but I want it to improve as I think it's got real potential.


I agree that packaging and dependency resolution can be a total nightmare at times. Anaconda helps to an extent, but it's far from ideal especially compared to the default options for other languages.

The main problem I have with Python is its maintainability, especially when you have multiple developers working in the same code base. I would never choose a dynamically typed language again to build something that is going to be more than several thousand lines of code, especially because of how it cripples your IDE which is essential for helping junior developers understand the existing codebase. Type hints help to an extent, but they're just a small bandaid over an oozing sore. Languages like Go are significantly better if you're going to build large codebases that need to survive for a long period of time.


100% agree. To a very real extent, Go probably only exists at all because Google poured an infinite amount of effort into Unladen Swallow--a project designed to remove the GIL from Python 2 to make it more usable for concurrent programs and add an LLVM-based JIT compiler to improve performance--and Python's response was to not only not merge it, but to break the entire Python ecosystem for a decade by forking the language... while somehow leaving "concurrency" and "performance" off the list of goals while making their backwards incompatible change (which of course also destroyed all the work on Unladen Swallow); hell: early versions of Python 3 were benchmarking universally slower than Python 2 :/. Google seems to have "gotten the hint", and around the same time (2009) hired Rob Pike to do Go, and has since "moved on" from Python and taken a massive part of the ecosystem of users Python used to have with it (the rest having bailed for node.js; remember when Python was the future of web development, competing with Ruby thanks to Django? those were the days): people still use Python, but it is now an entirely different crop of data scientists and AI people... all of whom are starting to run into the performance issues in even their glue layer, and so if the Swift tensorflow efforts ever work out, Python is done.


Google did not hire Rob Pike with Go in mind. Go did not even come around as an idea until Pike had worked at Google for a while. [1]

Don't forget Rob Pike created Sawzall at Google first (2005). [2]

[1] https://golang.org/doc/faq#history [2] https://research.google/pubs/pub61/


The early Go compiler reused the code generator from Limbo. To the point that many comments referred to Limbo. So while Rob Pike might not have joined Google to work on Go or had Go in mind, the idea and direction for Go was probably preordained.

It doesn't really support or attack your claims. It's just a thing I wanted to share.


Small mistake: there are references to inferno, the operating system which limbo was a key component in. The broad idea is the same: Go has a clear heritage in systems that Rob Pike already worked on.


Unladen Swallow was only an internship project.

http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospect...


Didn’t the original Go creators mention on multiple occasions they intended to build a replacement to C, only accidentally produced an application language? Assuming that’s true, Go would have happened regardless. Google couldn’t have pulled out of Unladen Swallow before Go is usable, since they could not have known it’s actually an alternative at the time.


C++ actually, but most of us did not care.

https://commandcenter.blogspot.com/2012/06/less-is-exponenti...

If this version of generics finally makes it, then I care, otherwise only when dealing with Docker and Kubernetes eco-system.


> Google poured an infinite amount of effort into Unladen Swallow

Google most certainly did not.


The glue layers speedups are really bothering me. I absolutely fail to understand how a data science heavy language in 2020 can have such convoluted parallel processing. Dask, numba, joblib all are unfinished projects and absolutely not headache-free. CuPy works great, but it is not multi-GPU capable as of today.

And anytime you point this out, people will trot out a toy problem in Cython and try to prove you wrong, which has zero relevance to real world issues. Hell, Cython cannot even compile something as basic as Numpy FFTs, something that is absolutely critical in signal processing.


There's so little truth in this it is hard to know where to start.


As other people have pointed out, almost none of this comment is accurate:

Google put some effort, not that much, into Unladen Swallow, and the people who worked on it admitted that it didn't achieve the goals they set. The Python community was well on their way to creating Python 3 before work started on Unladen Swallow (so it was not in any way "Python's response"). As other comments pointed out, Google did not hire Rob Pike to "do" Go. Go was initially a side project of a few people and had nothing whatsoever to do with Python. The Go team was surprised when Python users started migrating to it. Google never used Python for web development, with one major exception, Youtube, an acquisition. I also don't understand how Google has "moved on" from Python (this seems unlikely) and how what they use Python for internally has anything at all to do with the fate of Django and Python webdev.

I don't mean to pile on, but I kinda have to ask, what were you thinking when you wrote this comment? Do you believe all those statements, or were you taking huge liberties with facts and making guesses to tell a story that supports a statement you wanted to make? You seem like a valued member of the community and this kinda makes me distrust the accuracy of everything else you write. (And to be honest, makes me distrust more of what I read in general, which is probably good but sad.)

A couple references for history/dates:

https://en.wikipedia.org/wiki/CPython#Unladen_Swallow https://www.python.org/dev/peps/pep-3000/


Huh, I was wondering why Unladen Swallow faded rapidly since 2009 and why Python seems secondary at Google compared to Java and Go.

Silence and distance seems to be a common echo of failed projects.


I'd be curious which languages you feel are closing the gap, and which gaps are getting closer by them.

For example, I do a bit of Swift in addition to Python, and sometimes I've heard people try to compare the two. But I vastly prefer Python to Swift when I can afford to.


Scipy has a poor performance ceiling? Numpy has a poor API? Compared to what? Eigen? Whatever the Scala guys use? That sounds kind of silly to me, especially when hardly anyone is actually CPU-bound, anyway.


WRT performance ceiling, I'm mostly talking about things like Pandas which eagerly evaluate and which aren't amenable to a parallel execution model (multiple threads operating on the same data frame with minimal contention).

WRT poor APIs, I'm talking about things like matplotlib or pandas or etc that take a whole slew of arguments and try to guess the caller's intent by inspecting the types of the arguments. The referent isn't "some other scientific computing API" (although I'm sure there are some sane scientific computing APIs), but rather "other APIs in general" since there's nothing inherent to any particular domain that demands this kind of 'magical' API.

WRT 'hardly anyone is CPU bound'--the context is numeric computing; what are people bound by if not CPU? I've seen several projects where web endpoints were timing out while grinding in Pandas, largely because there weren't good options for taking advantage of multiple processors. Based on prototypes I did, I'm confident that other languages could serve those requests in single-digit seconds if not sub-second.


You're getting some pushback, but I tend to agree with you on matplotlib and pandas. Great libraries are designed so that you can get a feel for them and -- with practice -- use them intuitively. Even after years of (admittedly light) use I still find pandas' multi-indexes confusing, and I always have to look up the best of myriad ways to do something in matplotlib. In comparison, R's dplyr and ggplot have stuck with me even ages after giving up day-to-day use of R.


Pandas is really similar to base-R, which accounts for much of the weirdness (but at least R can claim to be copying a language developed around the same time as C).


So who does it right? If all these APIs suck compared to an imaginary perfect library, then that isn’t a useful comparison.

Also, if an endpoint is spending minutes to respond, then I would think actually profiling the application would be a good start. Maybe researching prior art in the problem domain would be good too. If nobody can be bothered to explore the several solutions to distributing pandas computations over multiple cores, like Dask, and get the NPV of just buying more or faster cores, then “Python sucks” isn’t your problem.


That’s quite a rant with a lot of assumptions. Just about every library has a better API than matplotlib or pandas. Requests has a pretty good API IMO. The team who was responsible for the slow endpoint did investigate dask and alternatives, and they probably will end up on something like spark because they didn’t feel like they have better options. Maybe our team is just stupid and Python isn’t for mere mortals, I don’t know, but I do know that these problems don’t exist in other languages.


I would certainly hope a minimal HTTP library would be simpler than a suite of functions to manipulate and plot tabular data.

“My application is slow, the language sucks!” Doesn’t indicate a very serious investigation into the problem.


> I would certainly hope a minimal HTTP library would be simpler than a suite of functions to manipulate and plot tabular data.

HTTP is pretty complex, but that's neither here nor there. The relevant bit is that there is no domain for which guessing caller intent based on reflection over argument types is appropriate.

> “My application is slow, the language sucks!” Doesn’t indicate a very serious investigation into the problem.

I was pretty explicit above and elsewhere in this thread about why Python's performance is miserable; I'm not sure why you would invoke such a poorly constructed straw man when everyone can look upthread and see my actual arguments.


IMO panda's lenient inputs is a godsend when you are working with real-world, dirty data regularly. It's my favorite API I've ever worked with because it lets me focus on my high-level tasks and it takes care of the things I don't really care about like whether I am working with a list of dicts or a dict of lists or whatever.

But once you've done the cleaning/exploration, you should move any heavy computing to a high-performance library like numpy.


> there's nothing inherent to any particular domain that demands this kind of 'magical' API

Plotting seems to tend towards magic because plots are basically art, with all the desire for aesthetic customization that applies, and it's a very common task so users also want brevity (magic). The result is a plot() function with a gazillion options hidden behind keyword arguments.

I agree that matplotlib has a sprawling interface, and this can be annoying, but I'm still not sure what "guess the caller's intent by inspecting the types of the arguments" means. Sure, the functions have multiple call signatures, but that's not exactly unusual in libraries or languages. I don't understand the context that brings guesswork into the picture. Skimming the manual—are you using the data keyword argument and hitting the `plot('n', 'o', data=obj)` ambiguity [0]? Or calling plot through `pyplot.plot` &c. (which rely on state) instead of `Axes.plot` &c.?

Asking because if there's an interface trap I'm unaware of I'd like to learn about it before walking into it blindly.

Pandas I sort of agree with; I personally find it harder to remember how to use pandas than dplyr, despite using pandas more often and spending more time reading the pandas documentation. I also find it inconvenient to represent missing values in Pandas (`None` and `NaN` are overloaded, and `None` forces the `object` dtype). But maybe the problem is on my end.

[0] https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.p...


"Based on prototypes I [spent a limited amount of time on and didn't research better methods], I'm confident..."


In the context of pandas, 3 GB of (raw, uncompressed) data could easily require 30 GB of RAM, and that kind of overhead adds up quickly.


Pandas is not some mysterious black box. If you need predictable runtime performance or bounded memory usage, you have to figure it out. Pandas doesn't inherently have a staggering or unpredictable amount of overhead, given that it's a statistical analysis package. There are ways to mitigate Pandas memory usage (10x is a sign that something has gone very horribly wrong), and sometimes Pandas is simply the wrong tool for the job.


10x reflects both experience and expert recommendations. You may recognize the author [1]:

> Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset

[1] https://wesmckinney.com/blog/apache-arrow-pandas-internals/


I don't doubt Wes' upper bound for Pandas OOTB, without optimization. The context was web applications. If you're seeing 10x on a web app, either something is wrong or you probably shouldn't be using Pandas.


I think most of us use Pandas for data exploration and one-offs, so it is completely reasonable to discuss what our likely use of RAM is going to be in this circumstance.

Web application using Pandas and "highly optimized web application" would seem to be nearly disjoint sets...


matplotlib and pandas were designed with the idea of mimicking interfaces more popular than the project (when they were first conceived). The "easy" interface is a large part of why those projects are now more popular then their inspirations.


very true; I found matplotlib very appealing because I didn't have to relearn anything coming from matlab


Ah, that makes sense (I suppose I should have guessed from the name). I never understood matplotlib's popularity, but if its a matlab clone that makes way more sense.


many scientific computing applications are considered to be bounded by io


rather famously, one needs an intense operation like matrix multiplication to get cpu bound (an operation that has many enough arithmetic operations per data element, for I/O to not dominate).


Numpy has a very poor API compared to Julia, Matlab, Mathematica, R. That’s just me comparing to the ones I know. It’s a mishmash of methods and functions, in-place operations and non-modifying operations, confusing indexing and broadcasting API. There are much better things available for array manipulation.


For all of numpy's faults I find it odd that you choose its indexing and broadcasting API as the thing that's confusing and worse than equivalents in Mathematica and R. I can't speak to Julia or Matlab, but I find working with arrays in the other two is like pulling teeth.


I'm curious what the outcome would be if we ran a poll asking what people find easier. I find numpy's API more intuitive, but I could be a minority.

I will say that the differences between python and R syntax are almost trivial; I've taught classes with python and R example scripts that are almost _exactly_ the same, and run in both languages


I like R, but there is nothing elegant or consistent about its standard library. I don’t think you’re making a good-faith argument.


Don’t just randomly accuse people of making bad faith arguments. I’ve use R a lot and while it has its issues, it has a much better interface than numpy.


Well they're different, right?

R provides an interface to dataframes and matrices, whereas numpy is just for matrices (and their generalisations, arrrays). I think the appropriate comparison is between base R and Numpy + pandas. (FWIW, I agree with your major point, but then I learned R first so that may be biasing me).


It’s hardly random. R functions could be noun-adjective or adjective-noun or underscored or camelCase or dotted ... and that’s just naming conventions. If someone, like you, says that Numpy sucks, but R is the real masterpiece, then that’s totally insincere. The good thing about R is that it’s free, the C FFI is OK and R-studio is decent. It’s API is a patchwork. It’s cool to say Java sucks or Python sucks on the orange site because they’re popular and if you say something popular sucks, well you must be a pretty cool guy who is smarter than all those rubes out there. The arguments are almost always total nonsense, though.


My work is 100% CPU bound, and I am very angry at Python for wasting so much of my time.


It's a shame how the core maintainers see CPython as a "reference implementation", rather than an opportunity to make a world-class programming language implementation. That the CPython maintainers have, time and time again, decided for a "simple implementation" has pushed away many professional VM engineers and researchers who would be more than willing to help maintain a JIT.

I also will say that a lot of the complexity of making a Python JIT is all the weird edge cases and special interpreter functionality. To a VM engineer, CPython is a very bizarre codebase; all the complexity is tucked away in corners other than its C codebase.

Your complexity has to go somewhere.


>It's a shame how the core maintainers see CPython as a "reference implementation"

Ironically this also makes CPython less interesting as a reference language - which keeps PyPy and co insignificant in adoption...


it's _still_ essentially the same switch(opcode) based interpreter it was 20 years ago. no threading, no super instructions, no jitting, nothing.

https://github.com/python/cpython/blob/master/Python/ceval.c...



ha! you're right. so some optimizations did happen while I wasn't looking ;)


> no threading,

CPython does use threads (on Linux, it spawns one pthread per Python thread). However it also has a lock (the infamous GIL) that prevents those threads from interpreting Python code simultaneously.


Not that. https://en.wikipedia.org/wiki/Threaded_code

It's an interpreter implementation technique.


That post was 4 years ago. Even though I haven't heard explicitly, certainly some of the recent comment from core developer talk about performance techniques that would definitely make things more complicated.

There's new boss(es) over the Python project, and things might be changing in the near future.


I might have to eat those old words as there is some form of runtime code specialisation being considered for inclusion in CPython. It doesn’t sound like a fully-fledged JIT, but that might be a distinction without a difference.


Link?



Thanks - Mark Shannon is a core committer, right? So hopefully he has standing to get this done.

The plan looks very high level at this point, but it looks like Mark is an expert in interpreter VM and JIT technologies. All I can hope for is that he doesn't get blocked by the "keep cpython simple" obstructionism.


He thinks (they think?) the project will need to be funded to the tune of $2M. That seems like a hefty sum to me, though maybe it's doable.


It's not that hefty once they pitch this improvement to the many wealthy VC-funded companies whose business and data science divisions depend on Python.


I read the thread, it's from 2016. But hasn't numba made some big progress in JITting CPython?


Victor Stinner (Python core developer focused on performance) has a great speech about this question:

There are a lot of reasons actually

- Performance limited by old CPython design. If you fork it you have to deal with all the legacy code.

- CPython is limited to 1 thread because of the GIL.

- Specific memory allocators, C structures, reference counting, specific garbage collector etc.

You can find that video in here: https://youtu.be/TXRPCZ7Nmh4


They intend to monetise it.


I feel like anybody really searching for speed is using something other than Python. I don't use Python for speed, but for ease-of-use.

It took me some clicks to see it's supposed to be a drop-in replacement, so that's good.


> I don't use Python for speed, but for ease-of-use.

Right - but if you can get that with better performance that's good isn't it?

I don't get why people object to performance work on languages not intended for performance.


> I don't get why people object to performance work on languages not intended for performance.

You're right, this objection is not relevant in general.

But in the situation we talk about, having 20 % more performance requires you to choose a fork that can came with it's own limitation. It's definitely not free.

This trade of makes the point about the language choice relevant.


> I don't get why people object to performance work on languages not intended for performance

My gut feeling is that Python is just not safe or static enough to ever be worth trying to compete with (say) C++ with. Python sure is easy but I think the asymptotic cost of using it for a big project (in my hands at least) is just not worth it.

I like Python's syntax quite a bit but I feel bad watching people learn to program using it - partly because it's an oddly low-level language (It's closer to being C than Haskell) and there's no compiler to stop you shooting yourself in the foot (If I write something fundamentally unsound I want to know about it now not when the process has been running all day)


As someone who learned on python (many years ago), I think the balance of being allowed to shoot myself in the foot, while only having to learn complexity when complex concepts came up, was a good combination rather than a bad one. Trying to learn C++ and Java before, having to type everything was an impedement. On python, you find out that types still matter when you try to add a string to an integer, for example. The moment the type matters to you is when you need to dig into those details. I'm not sure I would have the career I do if it hadn't been around to be the one language that fit my 16-year old brain.


You don't have to compete with C++. If switching the interpreter allows you to save 20% server costs at ~0 development cost that's a no-brainer. Rewriting your stuff in C++ might allow you to save 95% of your server costs, but development cost is a lot higher.


>I don't get why people object to performance work on languages not intended for performance.

It's simple, really. They're concerned about the trade-offs.


If it's in a fork that you can pretend doesn't exist if you don't want it... what is the trade-off?


A fragmented ecosystem, incompatibilities, etc.

And more importantly, your original question isn't asking about forks, specifically. It was asking why someone might oppose performance work.


> incompatibilities, etc

But it's compatible. It's a drop-in replacement.


Serious question: do you have experience with “drop in replacements” for reference implementations? They’re rarely 100% compatible. Especially when the reference implementation isn’t formally specified.

Moreover, you continue to cherry pick, neglecting the other arguments you’ve been presented. It sounds like you’ve already made up your mind about this.


Other arguments are ‘it’s not worth it’ or ‘it’s not likely to succeed’ but nobody is making you do the work!


Hmm no, they’re really not. I’ll leave you to reread the tread.


> I don't get why people object to performance work on languages not intended for performance.

Presumably because they see it as effort that could be better applied elsewhere.

I don't know that I agree, but I can understand the viewpoint.


People object to changes in requirements and implementations that they don't control and don't perceive as being beneficial to their use case.


But it's a fork. And what do you mean by 'requirements'? It's compatible. Forget it exists if you don't want it.


I use python because of Numpy, Scikit and Tensorflow. I don't know of any other languages with libraries as productive as these, so speeding these apps up is a big win for a lot of people.

Also 20% is huge, I look forward to trying it!


20% faster is nothing. You want at least a magnitude faster to justify the cost and risk of switching.

The Python language is already 30 years old, and it wasn’t even the cutting-edge in imperative language design (<koff>Smalltalk, Lisp</koff>) back then. It’s positively antiquated now.

I’ve never understood this tunnel-vision obsession with endlessly chasing ever-diminishing returns. It’s Zawinski's Law of Software by way of Greenspun’s Tenth Rule, and a fundamental failure of courage.

Learn the lessons from both the good and bad of what’s been done before, and move on. A better long-term answer would be to design a much faster, more efficient language for running Numpy, Scikit, and Tensorflow, then port those libraries over to that. If that language turns out to be good for other things too, then great. If not, let a thousand flowers bloom.

There is a much larger learning opportunity here, to get a whole lot better at migrating extant code bases from an old, popular, dead-ended language to a new, upcoming one. But it’s like finding your way to Carnegie Hall: it takes practise, practise, practise.


Interesting that you are down voted. I agree with you that 20% faster is really nothing in comparison to the many new languages. I was saddened by the path Python3 chose. If compatibility was already broken at the time, they should have designed something with performance in mind from the beginning. The V8 javascript engine was there. We knew things could get >10X faster with JIT. Python is too big to die, but it is an inferior programming language in many ways.


“Interesting that you are down voted.”

Doesn’t bother me; I’ve got points to spare.

What downvoters haven’t got, it seems, is any arguments.


> 20% faster is nothing. You want at least a magnitude faster to justify the cost and risk of switching.

Eh. If newer versions of cpython end up 20% faster, eventually most things will end up running on those newer versions. It may take years for almost everything to drift to newer versions, but there's a noticeable performance benefit we all realize over time.

> Learn the lessons from both the good and bad of what’s been done before, and move on. A better long-term answer would be to design a much faster, more efficient language for running Numpy, Scikit, and Tensorflow, then port those libraries over to that. If that language turns out to be good for other things too, then great. If not, let a thousand flowers bloom.

Python's great at running numpy, scikit, and tensorflow. It's not the bottleneck, because these are largely native libraries-- achieving something "much faster" and "more efficient" for this are doubtful.

The benefits of Python are that it A) has a huge ecosystem, B) it's relatively polished and expedient to write code in, and C) there's a ton of people who know it. These are not easy things to create somewhere else, and there's no guarantee that the set of tradeoffs you choose will leave you in a better place afterwards.

The biggest downside of Python is that if you want to get a lot of concurrency or performance for Python code (instead of things like numpy, scikit, and tensorflow), you get pushed into the edges of the Python ecosystem, where you only get a portion of A's advantage (but still largely realize B & C).


I agree. Almost all arguments in favor of Python is about sunk cost. Not much about the actual language is appealing compared to modern languages.


> Almost all arguments in favor of Python is about sunk cost.

I think you are confusing ecosystem and other established advantages with sunk costs, they are different things. It's true that (from the perspective of the people who built them), those advantages are the products of sunk costs, but the argument is about the ongoing value delivered, not the sunk cost involved in delivering it.

> Not much about the actual language is appealing compared to modern languages.

Even if that was true, many of the actual languages competing with Python are less modern by any measure, and in any case so what? Does it matter when choosing a langauge if an advantage is produced by the abstract design of a language, it's ecosystem, or the peculiarities of the available implementations? Advantages are advantages, value is value.

Sure, if you are considering how to promote a “modern” language against Python, it's important to distinguish whether your current barrier is the design of your language or Python’s ecosystem to know how to direct your efforts, but if you aren't a tool evangelist and instead are choosing a language for a project, I don't see that it matters why Python is a net advantage, as long as it is.


“the argument is about the ongoing value delivered”

Which is an admirable sentiment… but the title of this thread is not “Python: still doing useful work” but “Python: now 20% faster”, and being ridiculously self-congratulatory about this when the correct response is to laugh at the silly pointless frivolity of it.

Trying to make Python fast is a fool’s errand, because Python is slow by design.

A useful argument would be that Python is faster overall at solving various real-world problems than current alternatives; but that’s not the popular argument being made, because the population fixates on minutiae instead of overall perspective.


> Which is an admirable sentiment… but the title of this thread is not “Python: still doing useful work” but “Python: now 20% faster”

No, it's actually, “Pyston v2: 20% faster Python". But...so what?

> and being ridiculously self-congratulatory about this when the correct response is to laugh at the silly pointless frivolity of it

For the same reasons Python is often a valuable choice, a faster Python is a valuable option.

> Trying to make Python fast is a fool’s errand, because Python is slow by design.

Python is not slow by design, though it's slow because it's not fast by design. But that doesn't mean it's not useful to have a faster Python, only that there are likely to be limits and trade-offs involved in doing that.

> A useful argument would be that Python is faster overall at solving various real-world problems than current alternatives; but that’s not the popular argument being made

Yes, actually, it is. It's not the message of the Pyston v2 release blog entry, but then that blog entry isn't making an argument in the debate that you seem to want everything to be about.


“Python is not slow by design”

Everything in Python is late-bound, untyped, and mutable by default. Straight off the bat that’s three key design decisions that make for a slow interpreter. Defining numbers as heap objects, implementing structures as hash tables, and poor parallelization (<koff>GIL, dumb POSIX threading) are three more.

Sure, you can throw huge amounts of brains and resources at code analysis, opportunistic JIT, and the rest a-la V8, but at some point you have to say “Is this an effective use of those valuable resources?” Especially when every such optimization could potentially break existing, stable user code running in production.

..

But, let’s get back to larger perspective:

Faster overall takes into account not just the time it takes to run a user program, but also the time it takes to learn, implement, debug, and deploy. Only one of these is machine time, which these days is cheap as chips and nearly inexhaustable; all the rest are human labor, which is both expensive and limited.

Which is not to say that Python is faster at all those human tasks than other languages. To determine that would require real-world practical testing, plus a willingness to accept what those tests tell you (which might not be what you wanted to hear). But I’m willing to bet that the time spent on all those manual tasks vastly outweighs the time saved by 20% faster runtime for the vast majority of use cases.

So at some point you have to stop and ask: Are these fundamental changes adding genuine, measurable value for real-world users solving real-world problems? Or is it just code masturbation basement nerds whose idea of productivity is playing with internal guts in pursuit of some trivial abstract benchmarks?

Because, honestly, “20% faster” is an absolute joke. If I can’t make my program 200% faster just by adding a second hardware box, then I’ll want to know why. And if the answer is no more complex than “because the language isn’t very good at parallelism”, then all other arguments are completely moot.

..

Look, I’ve written slow interpreters. Implemented in Python, no less. A not-very-complex program might take 2 minutes to run. But then, 80% of that painfully long run-time is actually IO-bound operations, and even that is totally irrelevant when those 2 minutes of machine time have replaced 20 minutes of manual work.

That’s 20 minutes of paid human labor, eliminated by a really-slow custom interpreter written in pretty-slow CPython. You can easily put a dollar cost on that human time (salary, etc) and multiply it by the number of work units in a year, and you’ve calculated its real-world benefit.

Let us know when you can calculate the real-world benefit of a 20% quicker proprietary Python-like interpreter that may or may not execute user programs exactly the same as CPython. Otherwise, as I say, anything less than a magnitude’s improvement isn’t even worth getting out of bed for.


>Almost all arguments in favor of Python is about sunk cost. Not much about the actual language is appealing compared to modern languages.

Well, I, for one, use Python because "the actual language is appealing compared to modern languages".


Ah, sunk costs. Where the future goes to die.

And I say this as a 20-year Python user myself, ’cos while it has scratched many itches and continues to do so, I am not the least bit sentimental about it. The best compliment would be to kill it with something far better, that steals all its good parts and replaces the rest.


Well that still goes in the "ease-of-use" bucket in my opinion.

That being said, those libraries are already highly optimized and all the heavy stuff running on C anyways, so making the Python itself faster won't make that much of a difference in those workflows.


Aren't all of those mostly implemented in C or C++ extensions? Not to minimize speeding up the glue code, but I don't know if they will see that large of an improvement.


I'm in 5he same bucket here. My other ecosystem products are built on top of python, so we make use of pandas, numpy, dask, and our vendors own modules (which have taken them 10+ years to put together with a reasonably full team).


The 20% speed up looked like it was for flask and Django. In their benchmarks PyTorch did not have a speed increase.


Likely because pytorch uses C++ under the hood.


Maybe. One of my projects at work is supporting a legacy ColdFusion website that is nearly 20 years old (well, it was started 20 years ago, but has seen updates since). We'd love to move it to something faster, but it is huge and it would literally take years to rewrite it. Other things take priority so it will probably never be rewritten. I imagine there are Python projects in a similar boat.


Rewriting parts of it will not take years. Most of it can probably be thrown away, replaced by standard components. (only talking from personal experience, sure we had legacy systems running for years).


Unfortunately that wouldn't work with this site. Between the five of us in my group we have nearly 100 years of developer experience. We've talked a lot about if there is anyway we could divest of it a little at a time or replace parts of it, but we really can't.


I feel you, but then again, Python applications tend to deployed anyway, so a speed boost can be welcome for some. Memory usage looks better, too.


>I feel like anybody really searching for speed is using something other than Python.

Well, anybody using Python for all its over benefits, still would very much like more speed.


Sure, but there's lots of people already using Python that would probably love some more speed.


It is not about searching for speed. It is about lowering the cost of large Python code base running in production.


Why not make the languages that are easy to use speedy? That way I don't have to search for one or the other.


> A very-low-overhead JIT using DynASM

Interesting. DynASM [1] is the template assembler used in LuaJIT, so it sounds like they might be JIT'ing CPython bytecode. IIRC this is also what the first version of Pyston did. I'm curious how this is working out, both implementation and performance-wise compared to LLVM (used in Pyston v1). That could mean there is a lot of performance still on the table, at least for some kinds of code, but also a big complexity jump to get further gains.

[1] https://luajit.org/dynasm.html


We started with an LLVM-based JIT in Pyston v1. Our experience with the two jits is that it's very nice that DynASM is ~2 orders of magnitude faster, and also that we have not been able to extract enough high-level knowledge to make a powerful JIT like LLVM worth it. We do use LLVM elsewhere in our build process, and we hope to write some future blogposts about all of this.


Wanted to see redistribution rules, and was surprised to see there is no license anywhere for the binaries... The closest thing I found is "copyright" file inside .deb:

    Copyright: 2020 The Pyston Team <support@pyston.org>
    License: Closed source, all rights reserved.

I guess it means no one should be touching the file, as they haven't even granted access to run it.


From the link:

> Our plan is to open-source the code in the future, but since compiler projects are expensive and we no longer have benevolent corporate sponsorship, it is currently closed-source while we iron out our business model.


> I guess it means no one should be touching the file, as they haven't even granted access to run it.

Since when does someone need to explicitly grant you permission to run a program on your own computer?


Running is probably ok, but giving to others? I am not sure.

Those limitations existed since before the computer time, when the copyright law was passed. For example, even you own a book, there are certain things you can not do, like duplicate it and sell copies.

In case of software, here is how the law works [0]

> When you make a creative work (which includes code), the work is under exclusive copyright by default. Unless you include a license that specifies otherwise, nobody else can copy, distribute, or modify your work without being at risk of take-downs, shake-downs, or litigation.

There are "fair use" terms, which allow some things without permissions -- but things like "copying the binary to company-internal repo so CI runners can pick it" really need explicit permissions if you want to be above the board.

[0] https://choosealicense.com/no-permission/


Yeah, you won't be able to distribute it, but running it should be fine (unless you want to consider copying it from disk to memory as copying, of course).


Everybody should note that the existence of pyston does not prevent you from making cpython or pypy faster if that's what you want to happen.


I stalked the author's linkedin and notice he has competitive programming experience: https://www.topcoder.com/members/kmod/details/?track=DATA_SC... (and top 15 putnam, ICPC world finals, etc)

I wonder if he would be interested in optimizing for purely algorithmic tasks?

There are a lot active and successful CPython and PyPy users on https://atcoder.jp/. For example:

https://atcoder.jp/contests/practice2/submissions?f.Task=&f.... (the user "maspy" is rated at 2750 using only cpython!!!)

https://atcoder.jp/contests/practice2/submissions?f.Task=&f.... (though pypy is more practical)

I am linking to atcoder because their testing data is public so you can rerun contestants solutions using both pyston/cpython/pypy for benchmarking purposes: https://www.dropbox.com/sh/arnpe0ef5wds8cv/AAAk_SECQ2Nc6SVGi...

Right now, other than a handful of people who figured out how to make numba's jit work, only pypy is viable for competitive programming. I wonder if you can do better than pypy?

There are also a few red coders on codeforces.com who mostly use pypy (cpython is completely unviable there because numpy and numba is not installed)

https://codeforces.com/submissions/pajenegod

https://codeforces.com/submissions/conqueror_of_tourist

But codeforces' test cases aren't public anyway so it's not as relevant.


All my CodeJam solutions are in Python :)

While we could certainly go in this direction, we're not planning to, because in our experience optimizations for different workloads are largely distinct, and this use case is already handled well by PyPy.


Isn't this use case the scientific computing use case? That's a fairly large part of the ecosystem to give up on!

I think it's still a relatively low effort way (just need to write a scraper) to create a benchmark on a diverse set of algorithmic tasks that have clearcut criteria on AC/TLE/WA. PyPy is often 10x faster than cpython on these problems (and just 2x slower than equivalent C++ solution) so it will be a much nicer headline too if you can achieve similar performances!

Though I can also see how it can be completely irrelevant for server workloads. Pypy's unicode is so slow, some people on codeforces still use pypy2 over pypy3 just to avoid it. And c extensions is so bad on pypy, you can often get better performance on cpython if you need to use numpy.


This is just a comment on my personal use of Python for competitive programming: I've never used numpy for competitive programming or thought that it would be a good tool for that. PyPy seems like a great solution for the highly-numerical algorithms that these contests tend to lead to.

So I would not call this "scientific computing". Personally I consider competitive programming to be it's own use case.

And as much as we want to improve scientific computing in Python, it's very hard since the work is done in C. Our current hope is to help mixed workloads, such as doing a decent amount of data-preprocessing in Python before handing off to C code.


"After the project ended, some of us from the team brainstormed how we would do it differently if we were to do it again. In early 2020, enough pieces were in place for us to start a company and work on Pyston full-time."

I didn't know the Pyston team had split off to form their own company! Anyone know who's involved, or if they've raised money for it?


Looks like https://www.linkedin.com/in/kevinmodzelewski/ is the founder, as-of May this year.


It's exciting to see this reborn outside of Dropbox!


20% isn't nothing... but is it really worth switching to a non-standard, closed-source version of the interpreter?


Depends on your scale. If each of your web servers has a hand picked name, definitely not. But if you stopped naming servers a long time ago, and if the pricing structure is favorable, it could mean a huge cost saving without an expensive rewrite.


Electricity costs can far outweigh developer salaries.

In some cases a rewrite may save more money than moving to a 20% faster python.


What does it have to do with naming servers, can you explain? Or do you mean if you have very few servers such that you can name each and every one of them, this wouldn't be worth it?


Pretty much. If you’re spending $1000/month on servers, you’d only save < $200/month and the added complexity probably isn’t worth it. If you’re spending $100000/month on servers, saving ~$20000/month is probably worth it. Where is the tipping point between those two numbers? Depends on context.

Rewriting in a more efficient language might seem even cheaper, but you have to factor in risk and opportunity cost. At some point it does become smarter to rewrite, but your app needs to be pretty simple or your server bill pretty huge before it’s actually the best option.


Well it's a drop in replacement, so you can always just go back to CPython.


Funny their choice over luajit's Dynasm, when you have something like Turbofan laying around.

The V8 Javascript interpreter can generate machine code dinamically targetting the host arch for each bytecode instruction..

You can define the target assembly directly in C++ without resorting to any specific machine code.

Maybe complexity or because they wanted to stick to C?


They replaced dynasm in 2014 with llvm. Then there was no other good jit around. llvm is certainly not good either, but for specific workloads in companies it's doable, for benchmarks or short scripts not.


Would it be easy to apply turbofan to Python? Do you have any details?


How about PyPy?


PyPy is the awesomest thing created since sliced bread, but C extension interoperability is still a source of perf problems in a few scenarios AIUI. Glad to have Pyston in addition to PyPy


They compare PyPy and CPython in their benchmarks.


Quick question - does pyston use the new PEG parser that cpython does? https://www.python.org/dev/peps/pep-0617/

Because that's a long term risk on subtle differences that may crop up.


If anyone is curious, it is available on the Docker Hub but it seems it's a version from 4 years ago based on https://hub.docker.com/r/pyston/pyston/tags.


Whoops, I forgot we had that up, I took it down until we can post an up-to-date image


Not that it's necessarily a bad thing, but it looks like the Pyston project isn't getting a lot of updates:

https://github.com/pyston/pyston/commits/v2.0


I updated the repo in the middle of this comment thread; it used to show the v1 code, but since that was confusing people now it's just a stub that redirects you to the blog.


Project says: A faster and highly-compatible implementation of the Python programming language. The code here is out of date, please follow our blog


where is the code? the repo looks relatively untouched, the blog directs to this repo for filing issues. https://github.com/pyston/pyston


Pyston v2 is closed source (for now?). From the blog post:

> Our plan is to open-source the code in the future, but since compiler projects are expensive and we no longer have benevolent corporate sponsorship, it is currently closed-source while we iron out our business model.


I think it's private. Looks like a for profit fork.


The latest open source repo, I believe, is at https://github.com/jmgc/pyston (latest Sep 11) 9f672c1bbb75710ac17dd3d9107da05c8e9e8e8f

Maybe someone has something newer.

At Oct 28 kevmod deleted all the history.


There has been many faster python threads lately, I'm looking forward to improvements, especially, those that reach CPython!

As others have been pouring out before, the new load attr opcache is maybe the most interesting recent performance improvement that was merged.


Now we just need a Python implementation re-written in Rust.

(only kinda joking)


https://github.com/RustPython/RustPython RustPython is still in active development, I don't know how compatible they are though


It's really awesome that it's not just a speed boost but a drastic decrease in memory usage too.

Going from a 230mb Flask app down to 55mb is huge if it's really a drop in replacement. If you factor in gunicorn process count, the wins are even higher because if you had 4 gunicorn processes each using 230mb but now they use 55mb, you're really going from 920mb down to 220mb of RAM.

Edit: This isn't true in the end, a brain malfunction mis-read the table thinking PyPy was actually the regular Python interpreter. It would be interesting to see how it compares to the default Python implementation for memory usage tho.


From the benchmark numbers reported in the post, pyston uses slightly more memory than cpython for the flaskblogging benchmark. switching to pyston only a win in terms of reducing memory consumption if you're using pypy, and it would be more of a win to switch to cpython


Wow thanks for the clarification.

I don't know why but I read PyPy 7.3.2 as Python 3.7 in the table. Talk about a brain auto-complete failure haha.


Always go back to python (and c) when in doubt. Problem is graphic user interface. TK, ... or even pythonista scene. So diverse and so non-standard.


What does everyone think of just switching to Julia?


My last impression on Pyston was that it got killed by Dropbox. Happy to see community efforts still continue on this project


I read this as Python V2 and thinking why would they make V2 faster than V3 :D


There is another way. Since Pyston isn't generally compatible with the original Python, why not to use faster languages? Go for network services and console tools. Julia for AI applications, data and computational science.


I thought this project was dead. Nice to see it’s alive.


Just think: if CPython had been GPL-licensed, this would already be open source (and maybe even merged into upstream).


Or, it might never have been written at all.


What is meant by “quickening”?


The paper is gone. Maybe someone has a link.

Looks like inline caching of some specific ops with certain types. Some arith and call ops are optimized this way.

It also looks like dynasm was later replaced by the much slower llvm, so it needs heavy warmup.


I wonder if it's similar to Python 3.x's opcache for Load global and load attr? It would make a lot of sense to generalize those.


Interesting to see that it significantly outperform pypy on some metrics, but wouldn't have it been better to allocate the human resources towards pypy instead of a duplicated effort?


They are completely different, incompatible approaches. Pypy in particular has a C compatibility issue.


Do you mean the Pylint result? Was this confirmed elsewhere? It's unlikely that an interpreter is faster than a tracing JIT in geomean over a relevant set of benchmarks.


Can we talk about the name??

Like, if you wanted to create a confusing name, I can't think of worse ideas than naming it "v2" for a runtime that is Python 3 compatible, particularly for a project that has traditionally been Python 2 compatible...


Ugh. At some point we need to stop using side-forks/-projects like this because then they become competing standards that pull resources away from the main projects and evolve into their own incompatible beasts. I hope they instead contribute to the main branch, instead of wandering off into NIH land.


That's not how progress happens in practice. First of all, it's not a known fact that a Python JIT implementation with reasonable maintenance cost, functionality, and performance can exist. Pyston is trying to prove that it is possible, but it's not exactly a weekend project.

Even if you solved all the political issues around getting CPython maintainers to accept performance contributions (discussed already in this thread), adding a JIT to the main CPython codebase would surely slow the pace of development of other Python features. If the effort were to fail, then CPython's primary maintainers will have wasted a whole bunch of time coordinating with the JIT effort. I'm personally rooting for Pyston to succeed but it's admittedly an ambitious project.

So forking off an experiment is the right move here. Pyston's existence as a side-project is pulling no resources away from the main project -- but it would if it was trying to send changes upstream.

Hypothetically the Pyston developers could be making non-JIT contributions to CPython instead, but developers aren't interchangeable commodities. Pyston's engineers have expertise in optimizations -- they may not be as skilled or interested in language development.


Disagree. Currently, many people have "Python" projects which only CPython can run, for some reason or other. Having a lot of competing interpreters/compilers helps define what the core of the language actually is, and which of your assumptions are hinged on that single implementation.


Sure, but who to blame? There's no such thing as "Python Language Specification", and everyone is trying hard to be CPython-compatible (but ~impossible as your implementation need to be compatible with various C-written module).

Should there be a "language spec", competing implementation would make the language stronger not weaker. Examples include gccgo, various Java vm, C++ compilers.


I'm pretty sure you can find the Python language specification in the form of documentation at python.org. Some things that CPython does are occasionally misconstrued as specification, but the docs usually call that out.


The closest thing to a "language spec" is https://docs.python.org/3/reference/index.html, which still contains a lot of CPython specific/internal implementations (e.g. https://docs.python.org/3/reference/datamodel.html?highlight...).

This, and combined with https://www.hyrumslaw.com/, makes this "Python Language Reference" a "CPython Language Reference".


An issue. But isn't that forking fundamental to the principle of Open Source? The market will choose what it wants, like some Genetic Algorithm where the fittest thrives and the rest go extinct. Else how does a project evolve at all?


The solution to that is not to stick your head into sand, it's to merge the desired changes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: