One of the annoying things about the x86-64 ABI is that floating-point numbers use vector registers, even if they're only scalars and not vectors. The only way you can tell the difference is if the result is the p versus the s in vmulsd/vmulpd.
So the optimized assembly here isn't using any vectorization. Which isn't surprising, since as far as I could tell, the author isn't actually optimizing the code using LLVM. Most of the optimizations happen in opt, not in llc, which only does codegen optimizations. Those sorts of optimizations are largely things like stack frame optimization, instruction scheduling, or some more powerful pattern matching in instruction selection (which, even in -O0 in LLVM, does a limited amount of common subexpression elimination and the like). You might have to add restrict to the pointers (in LLVM terms, noalias on the arguments) to get vectorization to kick in, but the number of values is small enough for some sort of loop-versioning to probably kick in anyways.
The benchmarking is also pretty unfair. The C and C++ code have the values get copied in from a volatile array before progressing each loop iteration, which the Python-via-LLVM doesn't have. That doesn't sound like much, but it's 30 million extra guaranteed memory accesses over the time of the program, which will come out to a few milliseconds. (And , gee, the Python code is only faster by a few milliseconds). A better approach is to measure the time by moving the function to an external file and calling it in a tight loop, wrapped by a high-precision clock.
Also, benchmarking a 5×5? In practice, it's the memory access patterns that kill code, so you'll want to use programs large enough to actually cause any expected cache movements at the L2/L3 level. 5×5 is small enough that you could actually stick it all in vector registers if there's a bit of vectorization.
"You don't have to learn some special language to create high-performant applications or libraries."
I like this comment, which seems to suggest that I don't need to learn LLVM IR in order to write a python class that emits LLVM IR when run through python code. I'd suggest that perhaps the reason he's able to do this is exactly because he knows "some special language" quite well.
When such statements in the style of "you don't have to learn a new language" is uttered, it is usually completely ignoring the fact that solving the problem at hand is many times harder than learning some new syntax. Learning C when coming from Python is not hard because as a language C is hard.
To generalize, I have the impression that the general CS/programmer culture values initial easy of use much more than long term usefulness.
And that's not directed at the younger generation, you will find plenty of old farts that only want to use or learn git when it's been thoroughly lathered the thick sugary syrup of a pretty GUI.
Learning C is easy, but getting a piece of C integrated with your Python code isn't. It can be done well, but bad software is littered with things that could have been done well. I recently had a project at work of ripping out a bad C implementation of a high-performance section of code and replacing it with faster and shorter piece of numpy, so this is definitely a practical issue.
I find the opposite to be true. With numba & Cython, it’s ridiculously simple to target a segment of Python code for compiled specialization, often required for custom pieces of work not directly expressible in NumPy, especially when processing non-ndarray data structures.
I find I am frequently removing sections of inscrutable “vectorized” NumPy code that is inefficient, not because NumPy is inefficient but because conforming the data to a structure amenable to NumPy is too inefficient in the first place, and rewriting in straightforward, simple Cython functions.
>I find I am frequently removing sections of inscrutable “vectorized” NumPy code that is inefficient, not because NumPy is inefficient but because conforming the data to a structure amenable to NumPy is too inefficient in the first place
To generalize, learning C is easy, but doing anything with it is not.
To be clear, this is not meant as a critique on C, on the contrary. Put simply, it is just that making computers do stuff is hard. This is especially true if you want it to do it in a very particular way such as in high-performance code. If you cannot already leverage the work of other experts, you need to know about modern optimizing compilers, language runtimes, modern processor architectures, OS/system, ...)
To continue your Numpy example, that framework entombs the know-how and man-years of work of specialist in a certain area. If you need to do simple dense algebra, you will a hard time beating Numpy/Julia/Eigen/MKL etc.
If your problem goes slightly off-track (e.g. sparse problems) where you cannot easily leverage the hard work and expertise of others, you better start rolling up your sleeves and open that main.c
> values initial easy of use much more than long term usefulness. ... to use or learn git ... (in a) pretty GUI.
With more experience you'll find that certain operations are easier with particular workflows.
Git vs other ... well, Subversion is the standard. I'm not getting paid to move repositories on a whim. I have to make a business case ... etc. Guess what, I've got better things to do (i.e. do things that make money) than chase the newest shiny thing.
Sometimes a multi-hundred line git dump works and sometimes it doesn't. In some workflows the GUI scans better. e.g. pick out the hot spots in functions.
Git CLI is universally agreed to be absolutely terrible, with its manuals actually making it worse. Just flick through one or two blog posts you can find here, many of which already made HN:
- learn a basic workflow and ask/google for help for problems. A gui makes this slightly easier
- learn the underlying model for git. Then you probably still have to google for invocations occasionally but you would have to do the same with a sugary gui. And for complex actions in a gui the answer usually is 'use the cli'
Plus occasionally a gui does something incredibly weird like giving direct access to `git clean --force` like vs code did a while back.
People used to say that about svn to which I akways found so odd. Perhaps I was fortunate to start out on a team that actually understood branching, merging, and most importantly, how to resolve merge conflicts?
If you could expand it into a longer answer, I would appreciate that - perhaps there are some pieces of good design (git add -p?) in Git, but overall it's too messy for many.
git add, push, pull, fetch, bisect and rebase, along with git grep and log commands, are very easy to use and exceedingly rarely is anything else needed. Just use a rebase-only workflow and the cli essentially is self explanatory in terms of how your mutations affect the revisions.
Other commands like stash, checkout and branch are equally simple. It’s just a very simple user experience.
I think where people express confusion over git is that they don’t expect to be interact with revision history, rather just inflicting changes onto a head-only svn style repo.
People are mistakenly encouraged to bring that mentality to the table when it’s not useful. You need bring a mentality that you’re mutating and sculpting branches.
It’s like a beginner saying “snowboards are complicated” because they come in expecting it to be like a sled. You hear the same complaints about other systems like darcs and mercurial too, for the same reasons, even though each of those offers respectively super easy to use cli experiences as does git.
As much as I find Git's all magic incantations difficult to master, I have to agree with your comment 100%.
Once you grok rebasing branches, git add, push, pull, bisect, rebase, log, commit --amend are all I have needed to work with Git and the experience has been far better than working with any other version control like SVN or Mercurial.
> Other commands like stash, checkout and branch are equally simple
Ok, so they are the simple because you say so? Stash in all but the simplest cases is a sure-fire way to shoot yourself in the foot. Checkout is a weird guy, he does too many different things, and what's up with the double dashes? Super-weird! Also beware that any of your branch names happen to match any of the file names in the repo :O.
> I think where people express confusion over git is that they don’t expect to be interact with revision history
This is one of the things that confuses beginners about git, but it's not a confusion about git cli. You are conflating the criticism of git cli with criticism of git. Git is good, git cli is (notoriously) bad. However, eventually you'll (hopefully) learn the proper incantations and then using git cli will seem simple to you.
I never claimed any of this was anything more than my own technical appraisal / opinion. But to be clear, you’re not offering any counter argument, just unsupported assertions of your different opinion, including things like “what's up with the double dashes?“ which makes it hard for me to believe I could personally find value in your view on it.
> “You are conflating the criticism of git cli with criticism of git.”
Not at all. I’m saying most of the time when people say they are confused by git cli, that it behaves counter to their expectations, they are not actually expressing any design limitation of the cli commands themselves, but instead blaming cli commands for the conceptual confusion they have about git’s model in general.
Well, as long as the operations you want are add/sub/mul/div/array indexing, you don't need to learn LLVM IR—you can just copy the class in this blogpost and write your completely unrelated algorithm in plain Python just as this post does, and it will work.
Basically the author has written a tiny framework for making a reasonable subset of Python a DSL that targets LLVM, which is super cool.
There are a few tools out there that fuse Python and llvm. What I find interesting about this is the way he did codegen by passing in mock objects that just emit llvm it in the simplest possible way and that it actually works while being incredibly succinct.
That said, like all things that try to accelerate Python, it only works for a subset of the language (and in this case a very narrow one) and TBH while this whole "let's accelerate Python" movement is cool it's really quite awkward to have hidden subsets of a language that are accelerable. I'm not sure, but I think it might be better if there were a second accelerable language that was inlined with clear rules other than this stuff that seems like it will always be prone to inconsistency...
...or Julia. I suspect there are tradeoffs there too though... Building a language/runtime for high level experimentation with built in support for performance that rivals custom low level code is hard, the goals themselves are at odds with each other...
"What I find interesting about this is the way he did codegen by passing in mock objects that just emit llvm it in the simplest possible way and that it actually works while being incredibly succinct."
It's interesting, but it doesn't work in general, which is why you don't see the technique in common use. The problem is that it only catches methods called on objects, but it can't catch anything that isn't a method called on an object. Unfortunately, that includes for loops, if statements, some aspects of exceptions, and just generally "all flow control".
So rather than being a powerful and general technique, it's a powerful, but very very specific technique that only works in certain toy problems. Unfortunately, once you leave those toy problems behind, it doesn't even help you solve the problem because the paradigm this forces you to operate under makes easy things easy, but medium things hard and hard things pull-your-hair-out impossible. You're better off writing a real compiler... or doing this in a language that does expose enough of what's going on to make this technique work, although you tend to end up in either a Lisp or Haskell, which come at this from very different angles but both have the capacity to make this work. (Lisp via syntax tree manipulation, Haskell via Free Monads and other similar techniques.)
The Unity 3D engine is doing interesting things with this subset idea, where you can write most of your game in regular C#, then if you need a chunk of super fast stuff, you can use a subset of C# that can be easily vectorized, or parallelized, or GPU-ified. You can just do that all inline, no need for separate files or build processes. It is pretty neat and letting them get some crazy performance numbers.
How are they at odds with each other? Julia is more expressive and faster than python. It supports all the same high level programming constructs, and better versions of the same (like it's lambdas, macros, multi dispatch).
The one issue is compilation time, which is much and improved and will be further obviated through more caching and planned multi level compilation.
I find the opposite to be true. End users don't need to worry about types, interfaces are generic and multiple dispatch and abstract typing obviates the need all kind of weird special methods and type system stuff
I can't remember the very specifics, but one particular instance that sticks out is when I was creating some images on the fly for tobii eye-tracking software. I was using pillow, I remember... I think it might have been pillow which wouldn't work well with PyPy. I'll try to dig out my code later and see what specifically made things not work with PyPy in a few days
Aye.. There are some bugs left to be ironed out in PyPy. But many times you just need it to work and the correctness of what you're calculating isn't too critical. Then it is pretty awesome.
Yeah I like this technique too. To put a name on it, it's a form of metaprogramming that could be called "staged programming" or "multi-stage programming", and there are a bunch nascent systems that do things like it:
https://scala-lms.github.io/ -- Scala LMS is similar in that you replace type T in your algorithms with Ref<T>, and you get a code generator (compiler) rather than an interpreter.
http://terralang.org/ -- You can write arbitrary Lua to generate Terra code, sort of like he's writing Python to generate LLVM code. I think this is more general and doesn't necessarily work by retaining the structure of say solve_linear_system(). I'm not quite sure. But the problem domains they tackle are very similar to this (numerical code).
TensorFlow also has two stages of computation -- you define a graph of objects in imperative Python, using Python operator overloading in the same way that LLVMCode in this example does. And then later the separate TensorFlow language is evaluated in parallel / distributed on GPUs.
In one case you're defining an SSA graph of ops in LLVM; in the other case you're defining a graph of TensorFlow ops.
You can also do a lot of this with constexpr / templates in C++, although the syntax is horrible.
As you allude to, these staged programming systems can have TWO languages or one. In the case of Scala and Zig it's one language, but in the case of Lua/Terra it's two. Arguably in the case of C++, it's two languages and not one.
Another thing to search for is "partial evaluation". I still have to understand the difference more precisely, as alluded to in the link above.
At first I was a little critical but I get it a little more now. Of course the real test is trying it, but I'm not working in that area at the moment, though I've used TF before.
What I find interesting is the recurring theme of two languages -- in Swift's case they say you can use a "static subset". I think that means all the imperative control flow, and then static dispatch (as in templates), but NOT dynamic dispatch (as in OOP).
So in this example you have Python/LLVM, then there's Lua/Terra, Python/RPython, Python/TensorFlow, or Swift/static Swift, etc. Except in the latter case I guess it's all mixed together and you don't explicitly have a graph building stage.
- Switching to -O3 for gcc (instead of author's default -O2), the C version (exp_volatile_a_b) gets ~30% faster, bringing it down to ~50% of the python-generated llvm - it appears to still be doing the full computation at runtime.
- Switching to -O3 for llc doesn't make any difference for the python-generated llvm version.
- With gcc 4.8 -O2 (instead of 7.3 -O2), I get ~0.06s - it looks like gcc 4.8 decides to inline everything, and gcc 7.3 doesn't.
This 'partial interpretation' trick used here has also been used successfully in the database community to accelerate whole queries, etc. Tiark Rompf's group in particular has been pushing this idea to it's limit.
I haven't seen this website before and, after taking a cursory look at this and the previous posts mentioned the article, I'm highly skeptical of the quality of this work:
- In the C article he shows that LAPACK can be massively improved upon. That sounds very impressive, until you realize he hasn't mentioned which LAPACK/BLAS implementation he used. The default download from netlib is the vanilla version and is known to be very slow in general. Any benchmark dissing LAPACK is incomplete without comparison with OpenBLAS and Intel MKL. And instead we see the author pulling all sorts of overengineered tricks to top the performance of a vanilla LAPACK build (I mean there isn't even a mention of whether or not he used optimization flags while compiling LAPACK).
- In this python aritcle, again, python comes with highly developed scipy/numpy stack which can be coupled with a highly performant OpenBLAS or Intel MKL, but there is absolutely zero mention of the most popular scientific stack in the python community. Again, instead, we see a round-about and "clever" solution of trying to couple python with LLVM.
I got LAPACK and BLAS via apt-get on Ubuntu 18.04, so whatever that means. I installed OpenBLAS from source, and it looks like the target detection decided to enable AVX2 (but not AVX512).
So the fully-unrolled version is still faster on the 5x5 matrix, which doesn't seem super surprising to me. I would expect a LAPACK implementation to have some overhead compared to a straight line solution to a problem this small.
Yes.
This is easy to see in Julia, by comparing native arrays with StaticArrays.jl.
Native arrays operations are linked to OpenBLAS, while many StaticArray operations are unrolled (will have to get to a computer to see if that includes 5x5 inversion).
At the small sizes, BLAS is tens of times slower.
Especially OpenBLAS (MKL does better at small sizes).
There's a lot of overhead from things like determining optimal blocking structures.
The tracing Python objects technique is very similar to how pytorch traces models to produces a graph representation that can be executed entirely in C++ for performance reasons.
The pytorch one is a little bit more general since it can trace arbitrary models + have custom high level operations that might not be supported by default. You just pass it a python method and an example input.
But it shows that you can write programs in Python that outperform their counterparts written in C or C++. You don't have to learn some special language to create high-performant applications or libraries.
... you just have to write bespoke macros to turn python code into LLVM code...
I think this is a false dichotomy. You can make almost any language fast by writing very non-idiomatic code. The advantage of something like Java is that you can write readable, "normal" code that runs at nearly the speed of optimized C.
I see these comparisons all the time and they inevitably jump through a ton of hoops to make a language like Python perform like Java/C#/Go/rust. IMO the big advantage of these language is that they're fast for the majority of normal, totally unoptimized code.
I can write REST endpoints in Java/C# that do a hundred thousand requests per second using a mainstream framework with no optimization. That's important in the real world where I'm usually working on large ancient projects with somewhat poor code quality
If you write a simple for loop incrementing `i` then yeah Java runs almost as fast as C. In whole programs where GC unpredictably runs and does all sorts of nasty things like dirtying pages and pressuring the cache, you won't get anything close to C. Citation needed.
This was true until a few years ago. Java 11 has a new ultra-low-pause GC that's almost entirely multi-threaded and mostly predictable.
There's plenty of semi-realistic benchmarks out there. I like TechEmpower the best since they test an entire web framework + database.
Several java implementations are right up there with C, along with .NET Core, Go, and Rust. In the vast majority of cases the extra 20-50% speed of C isn't worth the dangers.
I converted the authors c version of the code to Java and ran it through JMH, on my machine it runs about 30% slower. (This is before he added the metaprogramming that essentially precalculates some of the matrices at compile-time).
That's also without making any attempt at optimizing, basically a straight conversion that took ten minutes.
It's not the speed of C but it's pretty close. And it has memory safety.
Agreed. This is a point I make routinely when discussing the performance of platforms and frameworks: all else being equal, selecting a high-performance platform and framework affords you, as the application developer, the luxury to defer optimization of your code.
It's an inversion of the instinct people have with our culture's commandment about deferring premature optimization. The instinct is that selecting for performance at the start of a project is premature. I argue that selecting a platform is precisely the time to consider performance because it is a decision that is extremely high-friction to change later. Therefore, it's timely, not premature.
By selecting a high-performance platform, you set the performance ceiling and baseline performance of your application high. This gives you the freedom to develop application code in a sub-optimal manner—deferring its optimization until such time that it is necessary, perhaps forever. With a lower-performance platform, you will run into a performance hurdle event that requires optimization effort earlier and the level of effort to raise your performance ceiling will be greater than it would be had you selected a higher-performance platform at the start.
The key, of course, is all else being equal. That's not always the case for all teams, but it is still often the case. The key is knowing when your team can manage to adopt a high-performance platform and being careful about that decision.
"high-performance" is a qualifier worthy of question, though.
If the goal were to write one algorithm to be the fastest implementation, performance would usually equate to having low level control.
If the goal were to do automatic optimization of a wide variety of similar algorithms, then we're in the "actually writing a compiler" territory and then there is little to be gained from defaulting to low-level control.
Meanwhile, I'll just keep using Common Lisp and get the benefits of a higher level language AND have all of my code run 15-20x faster than equivalent Python.
And I'll still be able to manually generate LLVM IR if I really want to (but I don't).
To summarise: Use Python to output LLVM intermediate representation (IR) and then compile the IR.
C was created in the PDP-11 era whereas LLVM IR was designed for more modern processors, so coding in the latter allows for more performance optimisations that C compilers can't always make.
I noted at https://news.ycombinator.com/item?id=19014504 that this benchmark is basically a garbage benchmark because it's not properly comparing apples to apples. It doesn't even manage to use LLVM's optimizations on the IR properly, which should raise major red flags about how valid the comparison is.
Besides that, LLVM IR is not all that different from C. LLVM essentially originated as an implementation of the abstract machine described in the C standard, and most of the semantics of LLVM essentially align with the equivalent semantics in C. There are some additions for things that don't exist in C--bitcast, vector types, unwinding/exception handling--and there are knobs to turn down some of the undefinedness in C (e.g., you can choose whether or not integer overflow causes undefined behavior, and you can vary the alignment of load and store operations). But using LLVM IR directly isn't going to expose any extra optimization opportunities that you couldn't have already expressed with C, potentially with some extra gcc-isms (such as vector types).
In other words: Use Python to create a domain-specific compiler that essentially unrolls the entire problem and feeds it into LLVM. LLVM is able to optimize that better than the C or C++.
I don't think so, and even where it would be, GCC's IR can evolve. But GP was talking about the limitations of C, not GCC. C's memory model puts some limits on what a compiler can do that other languages might not. Fortran is usually given as an example of a language that can be faster than C, because the compiler is allowed to do more. Rust might get there as well, at least in very specific circumstances. Given that GCC also has a good Fortran frontend, I'd assume that their IR is already more powerful than what's needed to compile C alone.
Isn't it just that it is hard to prove there's no aliasing in C, which can make it harder to vectorize?
If so, "restrict" should handle that in most cases.
LLVM IR isn’t portable in the general sense, but if you’re careful to maintain specific ABI details it’s possible to have IR that works on multiple platforms.
For anyone interested in doing this kind of stuff in an actually sane manner (i.e. Using a high-level dynamic language to intermix llvm), the sadly now abandoned Terra project is worth checking out.
Reminds me of the FEniCS[1] project, a framework for solving partial differential equations, which I got introduced to some 10 years ago during my studies.
LLVM wasn't a thing back then so they generated C++ source code which got compiled into a Python module which was loaded back into the running Python script and executed.
The translation pass also allowed for optimizations, for example if one of the inputs was a constant it could call a different library function which handled that special case. The result was a very flexible solver which allowed you to write code that looked almost like the math you were trying to solve, yet could match or beat a handwritten solver.
From my experience, what makes program slow more often have to do with memory than CPU cycles.
More memory use mean more cache misses, and cache misses have a huge impact. Roughly, L2 cache is 10x slower than L1 and RAM is 10x slower than L2.
With C, when you want to allocate some memory, you typically precalculate the size and malloc() just the right amount. In C++, you are more likely to use containers, which are safer and more flexible but tend to introduce some overhead. GC languages like Java have the GC overhead in addition and dynamic languages like Python are the worst because they also need to keep track of object types.
Now there are some applications that are really CPU-limited, like complex calculations. But in that case, chances are that you'll probably want to offload things to the GPU.
Still, I really liked the article. That's another great tool to have when it comes to performance. But it won't change the general case where C > C++ > Java > Python when it comes to performance.
A little caveat when it comes to C vs C++ though. C++ can be faster than C if you use C-like memory management, but that's not the usually recommended way of doing things.
No. Worse, this makes, exactly, the case for Julia[1]. Julia has llvm integrated within it for code gen. So, there's no actual need to start playing with lower level code, unless you really want to.
From a programmer productivity scenario, I'd find it difficult to justify all the extra work and time spent on doing the Python + LLVM version, as compared to something that was far simpler to maintain, and performed almost as well.
Now if he came out with some sort of 2 order of magnitude improvement, that would be a different story. But I don't see Python getting there. Without appealing to something external, which isn't, itself, Python.
It is Python in the sense that the solve_linear_system() function is unmodified. At first it runs with Python objects, and then it runs with LLVMCode objects.
It's a form of metaprogramming which some would call staged programming or multi-stage programming:
The authors of all those systems call it metaprogramming, more or less. Metaprogramming is a very general term, and there many varieties of it.
Once you get past the syntax and specific technologies, you'll see that many of these systems have a similar structure, and you could probably do a line-by-line port of solve_linear_system() to those systems. I think Scala LMS is probably the closest one.
Metaprogramming has a specific technical meaning that has to do with code that reads or generates code. Try not to parse the word metaprogramming itself.
If the syntax is recognize by the python repl .. maybe it's an ad-hoc compiler written in python. Which brings the question: what is a language..
Many low perf dynamic languages did all sorts of tricks to improve perf. It's even an engineering rule to rewrite things that need more perf.
Similarly you can twist the language if it has metalevel capabilities...
It's a thing that I liked in less common languages, they often enjoy generating low level code and consider the system as an object (cpu being consumer of instruction stream) and not a divinity.
I agree with the sentiment of the article, nowadays there is almost no reason to not use a high level language for most of the tasks... native code could be just plugged on demand.
That's exactly what Numba[0] does for python, just a decorator on a function and magically a function gets native performance by being compiled with llvm.
Since the author appears to iterate one million times a calculation of constant a and b, I would expect a perfect compiler to compute x only once at compile time and then also compute sum_x so that there is not much else to do.
I love using Python for getting stuff made fast, and then only if needed accelerating a few function calls to make it fast. I have used scipy, cython, and even the C-API for this, but this looks like a interesting new take.
One of the annoying things about the x86-64 ABI is that floating-point numbers use vector registers, even if they're only scalars and not vectors. The only way you can tell the difference is if the result is the p versus the s in vmulsd/vmulpd.
So the optimized assembly here isn't using any vectorization. Which isn't surprising, since as far as I could tell, the author isn't actually optimizing the code using LLVM. Most of the optimizations happen in opt, not in llc, which only does codegen optimizations. Those sorts of optimizations are largely things like stack frame optimization, instruction scheduling, or some more powerful pattern matching in instruction selection (which, even in -O0 in LLVM, does a limited amount of common subexpression elimination and the like). You might have to add restrict to the pointers (in LLVM terms, noalias on the arguments) to get vectorization to kick in, but the number of values is small enough for some sort of loop-versioning to probably kick in anyways.
The benchmarking is also pretty unfair. The C and C++ code have the values get copied in from a volatile array before progressing each loop iteration, which the Python-via-LLVM doesn't have. That doesn't sound like much, but it's 30 million extra guaranteed memory accesses over the time of the program, which will come out to a few milliseconds. (And , gee, the Python code is only faster by a few milliseconds). A better approach is to measure the time by moving the function to an external file and calling it in a tight loop, wrapped by a high-precision clock.
Also, benchmarking a 5×5? In practice, it's the memory access patterns that kill code, so you'll want to use programs large enough to actually cause any expected cache movements at the L2/L3 level. 5×5 is small enough that you could actually stick it all in vector registers if there's a bit of vectorization.