Very nice. I just looked at the author's resume.[a] It appears she is still in, or only very recently graduated from, high school. Is that right? Impressive!
How does a high school kid get so interested in (and manages to pull off a great article about) stuff like Intermediate Representations, compiler optimisation techniques, JITs as different as those of Julia and GraalVM, that would easily scare even a seasoned programmer with years of experience and a CompSci degree under the belt?
EDIT: When I was in high school, msdos and windows viruses were all the rage: self-modifying polymorphic code that injected itself into the RAM of other processes, having no identical strings longer than a few bytes between each instance of the virus, etc. Way useless than jit stuff, but a comparable level of "depth", and 100% assembly code.
I suppose I can answer this, as a high school kid: because it's interesting! I'm really interested in compilers and interpreters as well, I suppose because I think it's cool how they're so integral to all kinds of programming, and how it takes so much work to build one that can then almost magically understand what you tell it to do. I work on RustPython[0], and even though I've seen and edited the bytecode interpreter loop and the parser and the ast->bytecode compiler and all of the builtin functions and types, it's still hard to fathom that all of that is running under the hood when I type something at the REPL and it just runs.
Entirely off-topic, but the author's name when written without a space (i.e. Carolchen, as it is in the URL) just happens to be the diminutive of Carol in German. Except that Carol is not a very common name here at all, so rare in fact that I've not once met a German Carol. This had me very confused for a moment.
This article is really nice for not making things harder than they needed to be. Trace then substitute the trace - no magic.
It seems like a big part of all these strategies is making sure the checks for the optimizations (checking if they're possible, checking if they're done) don't take more time than the optimizations save.
Or could there be a way to alter the execution graph so that once you add in the optimization, you never have to check if it's there?
> It seems like a big part of all these strategies is making sure the checks for the optimizations (checking if they're possible, checking if they're done) don't take more time than the optimizations save.
You can't really check that, you can only hope your heuristics are right. You don't know in advance how many times a method will be executed and thus how much optimization budget is worth it. If you boot up a java application that runs for months and handles tons of traffic then you could theoretically justify many minutes of CPU cycles for optimization.
And that's not all that makes JITs complex. Some also do speculative optimizations that require guards and deoptimization points so they can fallback and recompile if those assumptions are violated. If the language has a garbage collector then the GC and JITs have to cooperate for ideal performance.
> Or could there be a way to alter the execution graph so that once you add in the optimization, you never have to check if it's there?
One of the things V8 does is store "field dependencies" -- a list of optimized code attached to fields that is invalidated if that field is modified in a certain way. The optimized code is known to only access that field in a certain way, thus doesn't need to check it, so this only slows down unoptimized paths (and only on their slow paths too, since even unoptimized code has fast paths for known access patterns thanks to inline caches). Comes with a memory/generality cost though, so this isn't used for every type of check.
I just realized that the link color changes every time I load the page–I clicked on the article again and it was different from the last time I looked at it!
I don't think I'm familiar with this (I may have come across it but didn't identify it by name when I worked with Graal?), could you provide a link? :o Thanks!
I have provided links above, there are also a couple of Java Language Summit, Code/Java ONE and Google IO talks about them, but I would need to search for them. Can provide them later.
Thanks, I know of this stuff but I don't think I would've been qualified to form an explanation/overview of them. It would be nice to have these in a post though!
Is anyone aware of any efforts to get clang/rustc to use lli to speed up iteration times? My thought is you lower everything to llvm bytecode first but skip all the codegen/linking. Then you can codegen or run interpreted. Might help make debug builds much faster to get to execution. Probably could trivially parallelize the background compilation & auto-build in the background on every file change to get the best of all worlds (most code gen is ready, the remainder is executable & will get compiled as needed). If you can do it with live substitution (which I think is what lli does) then you'd progressively get a faster program.
How much time actually goes into instruction selection and register allocation? I thought a lot of the time went into optimization passes at the IR level.
There is also an interesting presentation about how Azul implemented their JIT called Falcon, that is fully based on LLVM. https://www.youtube.com/watch?v=Uqch1rjPls8
Generally LLVM is a nice idea that allows vendors to reuse major components.
Absolutely correct that LLVM is a fair amount slower than C1 or C2. Azul augmented our JITs with a compilation recording & replay mechanism to combat the general start-up problems posed by JITs (including Falcon).
That’s not really accurate. It has strict requirements about stack traversal required to appropriately trace memory roots. This is incidental to JIT vs AOT.
Tl;dr: JIT is a compiler that optimuzes certain parts of code after interpreter see it as hot (e.g. was executed many time). Thank to runtime information it can occasionally exceed performance of statically compiled language.
> Thank to runtime information it can occasionally exceed performance of statically compiled language.
Interestingly in 30 years I have not once heard of a case where this theoretical benefit has manifested as a clear advantage in any real world application when looking at the system as a whole... amdahls law and all that.
You can always hand tune the 1-10% hotspots for reasonable cost most of the time, and even static tools can do PGO which generally gets you where JIT would anyway.
The way I explain the unreasonable effectiveness of JITs and adaptive compilation (multiple compilers with the advanced ones focusing on the hot spots) is by contrasting constants and variables.
In a long enough time frame nearly everything is variable: over decades even computer architectures and language syntax change. In a short enough time frame nearly everything is constant: in a single cycle a von Neumann computer will only change a single memory location and all the others will have a constant value.
Even though an AOT compiler might have minutes or more to work on a code fragment, the code it generates has to work for a long time and on many different machines. A JIT might have milliseconds or less to work on the same code fragment but what it generates only needs to work on this exact machine and only for minutes or hours. In fact, it might even be wrong and have to be recompiled less than a second after the JIT produced it.
So the code generated by the AOT must treat as variable things that the JIT can pretend are constant (with hook to recompile if it proves not to be so).
This is slightly related to partial evaluation, which is a key idea in Graal.
JITs vs static compilers on a language made for static compilers is probably a contest that JITs will never win. The keys to making JITs "better" than static compilation is (by my limited knowledge) creation of languages that are made to be JIT compiled (like Java) and ensuring that tuning JITs is easier than PGO (though nothing that is currently in development seems promising to this end)
You do see JITs winning in Java because Java's "everything is virtual" and "most for loop are interface calls to Iterator<T>" makes it very hard to statically compile efficiently.
Languages designed for static compilation generally consider virtual dispatch to have a user-visible cost and don't unilaterally make users paying it without them asking for it.
> in 30 years I have not once heard of a case where this theoretical benefit has manifested as a clear advantage in any real world application
Today you make make a very direct empirical comparison to see this - using the Graal compiler. This lets you compile exactly the same Java code either ahead-of-time or just-in-time, but using the same compiler logic except for the runtime information available when running just-in-time. The just-in-time code is (ignoring startup and warmup time) in my experience always faster, due to the extra runtime information.
Isn't this because Java is designed for JIT compilation, or at least not designed (with the appropriate tweaking knobs) for AOT compilation?
Languages built with AOT compilation in mind (e.g. Rust or Nim) usually give you lots of ways make choices at compile time and give hints to the AOT compiler that the JIT compiler would instead try to infer at runtime in Java.
But by infering these things at runtime intead, maybe the JIT approach makes it easier to get fast code in those cases where you (as an application developer) don't want to put a lot of effort into optimization?
Java has had commercial implementations of AOT compilers since the early 2000.
Most compilers for embedded systems have always offered that option, and in what concerns enterprise JVMs, JIT compilers have had the capability to cache JIT code and PGO data between runs.
Both options that have come now to OpenJDK, OpenJ9 and Graal.
Android also learned the hard way that changing to pure AOT did not achieve the performance improvements that they expected, while compilation on device achieved C++ compile times when it was time to update all apps, hence the multi-tier interpreter/JIT/AOT with PGO introduced in Android 7.
The main problem of AOT compilation with PGO, is that first of all one needs a good dataset so that the optimizations are in line with the actual behaviour in production, still doesn't work across dynamic libraries so optimizations like devirtualization are not possible, and most of the time the tooling is quite cumbersome to use.
Pretty sure assymetrix was done in the late 90s. 1998 or so, they started out doing educational software and pivoted to doing ahead of time compilation for some weird reason. Here is a link to when it failed in 1999: https://www.cbronline.com/news/sepercede_exits_java_sells_to...
but from a higher level.. you could just reimplement the thing in C++ with gcc and the whole thing will probably perform better.
Basically what I’m saying is that I’ve never seen any substantial rewrite of decent C++ code into Java perform better, even if jitting has benefits on a small scale, it’s not substantial enough to overcome other overheads in managed languages.
True, but part of that difference comes from the “we don't need to design the language for perf, the JIT will close the gap automagically” mindset though.
I don't really think so for Java; most of it was designed prior to high-performance JIT-based VMs were a thing. In fact I think a lot of the advancements in JITs actually came out of work that went into making HotSpot fast.
Sure but one of vocal pitches/hype for Java in the beginning and even smalltalk before that.. was don’t worry about the price paid for .. automatic memory management, bytecode, dynamism... the top men are working on JIT, GC research and other optimizations that would not only close, but handily overtake the gap compared to the state of the art static compiled languages of the time. (Hence exactly why Hotspot was a thing after java was clearly becoming widespread)
In reality it was a bit of a mixed bag.. and to some of us that remember the hype from 30 years ago it comes across as over promising and underdelivering.
That isn’t to say that the technology isn’t incredible, I don’t mean to dump on it. But overpromising is sort of the status quo for tech.
>> Interestingly in 30 years I have not once heard of a case where this theoretical benefit has manifested as a clear advantage in any real world application when looking at the system as a whole.
> The just-in-time code is (ignoring startup and warmup time) in my experience always faster, due to the extra runtime information.
I don't think that result is surprising. The issue is that in the real world you can't ignore startup and warm up times.
I've never heard the claim that JIT compiled code is slower than statically compiles code. The issue is that the extra costs associated with JIT don't outweigh its benefits.
No unfortunately - it's a closed-source Enterprise feature I believe. But the current differential is pretty large and nobody is shouting loud that they can fix it using the PGO that I've heard. And what will the PGO determine that a JIT can't also do?
A number of things actually. For one, most JIT implementations only optimize once, instead of continuously. The result is machine code that is optimized for the sorts of things done at startup, as opposed to steady state operation. For example, I have a Play app that takes a minute to start up. The JIT does a great job of optimizing the code that is called during the setup process, but the API code itself doesn't get much optimization.
With PGO, I can get a more representative profiling dataset, allowing the JIT to see actual production loads instead of startup loads.
And for CLI apps, PGO code starts up fast and never slows down to profile or optimize.
> For one, most JIT implementations only optimize once, instead of continuously. The result is machine code that is optimized for the sorts of things done at startup
Funny you should mention that - the author of this blog post has another post on fixing that problem for one specific (but very practical) case where we want to disregard some profiling information from the startup phase because it pollutes the genuine profiling information.
Most JIT compilers will go after any code that shows up as hot, regardless of when it executes. If your API code is that, even a minute after startup, it should really be getting optimized…
Think of an use case where you have an API whose mode behavior is to do nothing, waiting for a call... but commonly gets calls that are very compute intense for short bursts of time.
With the most common type of JITs, which profile once and compile once, I'm going to get code that is optimized for startup and initialization.
If I have an "advanced" JIT, which is constantly deoptimizing and reoptimizing for whatever it sees as the hottest path for some arbitrarily chosen snapshot of time, I'm going to see my compute-intensive code slowed down so that it can optimize and compile it every time that endpoint is called, but then subsequently deoptimized while it is sitting around waiting for something to do, ensuring that I have to go through the same process the next time it is called. You can actually see a lot of situations where this regime could be even worse than a naive startup-based single optimization, which is why it is actually not that common outside of dynamically typed languages.
With PGO, I can select a profile snapshot during a stress test, and get heavily optimized code specifically for the things that actually stress the server. And it will stay optimized for that use case.
The most commonly used JIT in the world is almost certainly the one in the OpenJDK, which is of the advanced kind. And it does not suffer from the problem you are talking about, because it only ever looks at code that is getting executed.
Basically, it will do something like: interpret a function for a the first few thousand times it is executed, collecting execution metrics. After a threshold is reached, next time that function is called, start compiling it, optimizing based on the execution metrics collected earlier. Leave behind a few hooks for future changes. Keep collecting metrics from function execution. If the profile of execution changes significantly, re-compile the function according to the new profile (possibly doing things like un-inlining).
This is perfectly aligned with a startup vs normal production use workflow. The only problems can appear if you have a continually changing pattern of execution through the same function.
HotSpot certainly does better than the vast majority of JITs out there, but it is definitely not perfect. If you have a JIT cache that is too small (really easy to do if you're not aware of the JIT settings or what they do), or your code does most of its heavy lifting in functions that are not called very often (because call frequency is the only heuristic it uses to prioritize optimization), you can easily trap yourself into horribly optimized executables that the JIT doesn't know its way out of.
And since Java 11, OpenJDK has inherited the PGO and JIT caches capabilities of J/Rockit, but people still use Java 8, so they have any idea of what modern Java JITs are actually capable of.
I too keep hearing this repeated but without empirical support. Same goes for the idea that garbage collection can be faster than manually managed memory.
In practice, PGO has been the best possible compilation regime that I have ever found.
It's easy to come up with cases where you can do better than a fairly reasonable programmer: think of a loop that allocates at the top and frees at the bottom with enough stuff happening in between that the allocator doesn't just hand you back the same memory from a cache. Depending on the runtime, it might be better to just bump allocate and garbage collect all that memory at the end. You can find the benchmarks that'll show it, too. In reality, with actual applications garbage collectors have "good enough" performance with many benefits such as correctness and ease-of-use, but manual memory management done by a skilled programmer will outperform it.
> Same goes for the idea that garbage collection can be faster than manually managed memory.
It's really workload dependent and depend a lot of the GC involved (a pretty dumb one like Python's or Go's won't get you anything performance wise), but a copying collector can achieve allocation way faster than a regular heap allocator (the allocation can be almost as cheap as allocating on the stack). If you can't avoid boxing and your objects aren't all long-lived, you can run circles around a program not using such a GC.
Yes, allocation can be incrementally be made cheaper with a copying GC. But you have to realize that it still comes at a cost, and allocation speed isn't always a bottleneck. You have a runtime process intercepting your code state, scanning and analyzing your heap and making a decision to keep, copy, or free. For anything larger than a trivially small object graph, that additional processing can easily overwhelm any incremental allocation speed benefit that you get.
I get that it is theoretically possible for GC to be faster, but I literally have never seen it pan out in practice.
Im very skeptical that the comparison really makes sense. Programs written in GC languages are so different in their allocation behavior from those written in manual memory languages that I don't think you can meaningfully compare them. And comparing regular C++ with Boehm GC is probably not that interesting, as Boehm is extremely limited by the semantics of C++.
And whichever way you look at it, the GC doesn't really have to outperform manual memory management to be extremely useful. It just has to be close enough, the correctness guarantees alone make it worth it as long as the performance difference is not too large, in many domains, not to mention the productivity boost and program readability.
> And whichever way you look at it, the GC doesn't really have to outperform manual memory management to be extremely useful. It just has to be close enough, the correctness guarantees alone make it worth it as long as the performance difference is not too large, in many domains, not to mention the productivity boost and program readability.
I agree with all of this. But it is a very different argument than the theoretical argument that garbage collection can be faster than manual memory management.
Rust, for example, is manually managed or compiler managed, depending on how you look at it. But its semantics are extremely crude and naive: allocate on creation, free once an object leaves its current scope. There is no optimization of heap fragmentation, there is no optimization of large bulk frees...the compiler literally just inserts the malloc/free statements for you in a static location that you don't get to choose.
And despite the crudeness of its memory allocation behavior, it runs circles around garbage collected languages all day long. Even fast ones...I have done more than a handful of comparisons between Scala, Java, Go, SML (MLTon), OCaml, F#, and Rust...and Rust always comes out on top. That doesn't mean the performance advantages are worth the extra pain of dealing with a borrow checker...for most of my code, I default to Scala for all the reasons you've mentioned. But I don't delude myself into thinking it is faster.
But Rust is designed to be blazing fast in many aspects, even if it uses a traditional malloc/free.
Most likely if you compare a Rust and Java program that both run a loop creating a million node linked list and then starting over, you'll see that the Java program is much faster. I'll actually try to do that and confirm, but I believe it should be the case.
Even if you were to allocate arrays (Vec) you may see the same.
But, of course, that is not a fair comparison. Rust gives you tools to avoid allocation in the first place. It also optimizes so many things that Java doesn't. Summing up all of the values in an ArrayList<Integer> in Java is going to be so much slower than doing the same in Rust it's not even funny. I'm not even sure that Java implements vectorization yet.
It would be very interesting to see what performance a Rust program would get with a good, concurrent parallel generational copying collector. Perhaps in a few years there will be some interest in that.
What you're really seeing the benefit of here isn't manual memory allocation but stack allocation. Hammering malloc/free in Rust is slower than using even a pretty simple GC which is why there are crates to do arena based allocation.
As a more concrete example splitting strings by copying in Rust and hammering malloc and memcpy is 4-5x slower than using slices in Go and letting the GC deal with keeping things alive.
You might say that's not a fair comparison but few businesses can tolerate the messing around that is writing zero copy Rust.
malloc implementations actually do optimize by delaying the work of free(1). They do a lot of complex stuff hidden behind those function calls to improve performance.
> You have a runtime process intercepting your code state, scanning and analyzing your heap and making a decision to keep, copy, or free.
Most generational gc's don't call free (the ones in Java can, but not per object, the reason for doing it is for returning memory to the OS). The GC will scan every _living_ object, but only those it is interested in. So if the decision to be made is which newly allocated objects should be copied over to the old gen, the gc will avoid scanning objects known to already be old.
The sentence you wrote made it seem like it scans the entire heap and decides which object should be kept, which should be copied and which should be free'd, but this is far from the case.
Since allocation with a GC is usually much cheaper than in non-GC systems (bump allocator) and since free is rarely, if ever, called, you can in theory get better performance from a GC than in a system with manual memory management. Of course, it all depends on the system.
The systems I write (mostly CRUD web apps) benefit from it, because requests to my service allocate little and run pretty fast, they're unlikely to survive to the next GC, and so I mostly pay for bump-allocation (which is cheap).
However, while the performance might be better my systems are prone to unexpected pauses while the GC does run. That pause might be less than the time spent in free in a manual memory managed system, but of course saving up all that time and spending it in one place does make it more noticable.
Thankfully, Java's G1 and collectors have managed to reduce these pauses so they're no longer noticable for my use. Pauses are less than 100ms (usually much lower) which is about the same variation I get from calling a third-party http service.
Yup, and many GCs have the freedom to move objects around due to language semantics so they can do things like munmap pages if they want to rather than calling free.
> Pauses are less than 100ms (usually much lower) which is about the same variation I get from calling a third-party http service.
Note that a pause for calling a web service is different than a GC pause, because in the former you can do other useful things while you wait.
You're missing the point here: modifying tcmalloc can give you percentage point improvement, but a copying collector can speed up the allocation by orders of magnitude (the allocation is done in a contiguous memory space, every object adjacent to each other and allicating just becomes a few asm instructions, like 5 or something).
It's like comparing speed of a bike vs airplane: the brand and the quality of the bike doesn't really matter…
People repeat this a lot but it's not entirely true and it's not even close to 10x faster. More like 10%.
Freelist based allocation is actually very nearly as fast as bump pointer allocation. Like 10% if you compare the allocation function in isolation and 1% if you benchmark the whole object allocation path.
Some older malloc implementations like Hoard have bump pointer allocation into empty pages but the newest allocator impementations found that the branch prediction cost of having both paths outweighed the benefits
Reading PDF on mobile is annoying so I can't comment on the content of your first link, but your claim that a bump allocator is no more than 10% faster than a free-list-based one is really suspicious:
- first, as you said, people repeat the opposite a lot, including world class GC experts.
- then, unless I'm missing something, that claim sounds equivalent to claiming than heap allocation is no more than 10% more expensive than stack allocation, which is arguably completely false.
That’s why I linked the paper from JVM GC researchers with the direct quote on freelist vs bump pointer „Finding the right fit is about 10% slower [13] than bump- pointer allocation.”
I think that’s my point. I get that JIT can produce faster code, I get that java is good enough for most things. But if you’re already in the performance critical regime there are other things you have to do where just having JIT available isn’t some magic bullet.
A possible magic solution is to load a JIT state from a previous execution and have good manual-optimization features to be like PGO. This would give JITs best of both worlds with being able to get to a strong peak performance and also not requiring manual tuning
Except that PGO is in the box for most production quality JITs, the PGO data is even optimized between executions, while most developers never bother with using the PGO toolchains for languages like C and C++.
Even if they lose in micro-benchemarks championships, it hardly matters in most enterprise codebases.
I don’t believe JITs can ever beat AOT compilation. However, if it can achieve parity in the execution of the critical path, that is sufficient to call it a win.
> I don’t believe JITs can ever beat AOT compilation.
In certain cases [1] JIT can beat AOT by a decent margin because it has access to live information that isn't available at compile time. In theory you could AOT compile, profile, and recompile but in practice my understanding is that nothing beats doing it live.
Consider that if the workload suddenly changes in production and your heuristic violates an invariant as a result, you can deoptimize, reprofile, and recompile everything on the fly.
But wait, you say! Can't an AOT compiler do that? But no, the state space is almost certainly going to be too large by many orders of magnitude. So unless your AOT compiler effectively inserts a JIT compiler ...
[1] Sorry, no examples to hand. Nobody knows but I'm actually a dog. Don't trust me!
Or .Net, which entirely lacks an interpreter, and JIT compiles everything that it executes. (Like Julia, if I understand correctly.) JIT != mixed-mode execution.
Some people make a distinction between 'just-in-time compilation' (like .NET - it's normal static compilation, but done just before first execution) and 'dynamic compilation' (like Java, it choses when to compile, uses dynamic information, may re-compile, etc.)
In the Julia community they frequently call it Just Ahead-Of-Time the compiler strategy of making the inference of all downstream types and methods to dispatch and compiling all at once. And the period while purely running a static program "world age", which usually lasts until you go back to the global namespace or you need any dynamic feature. Julia does not have an interpreter outside of the debugger.
Julia compiles at runtime which makes it a JIT but it does not actually use runtime information to compile (except to know what to compile and maybe some other things) or have a concept of "hotness"
First of all! Nice post :)
Just some random things:
The first paragraph needs some commas, I think.
In the first code block you probably want to have `print(variable_x)`
[a] https://carolchen.me/