Parallel GCC: a research project aiming to parallelize a real-world compiler

modeless · on Sept 14, 2019

There are so many tools to parallelize compilation, but comparatively little effort has gone into reducing the vast amounts of redundant work that compilers do. A staggeringly large percentage of compilation time is simply wasted duplicating work that was already done hundreds of times over.

Tools like ccache are really primitive compared to what is possible. Change one bit in a header file and you have to recompile your whole project, even if the output ends up being bit-identical.

Zapcc[1] is more advanced, but still far from ideal. What I really want is a compiler that can incrementally update the binary as I type code, and can start from a snapshot that I download from someone else so that nobody ever needs to do a full recompile. It would require a radically different compiler infrastructure, but I don't see any reason why it would be impossible.

[1] https://github.com/yrnkrn/zapcc

fanf2 · on Sept 14, 2019

There is an interview with Anders Hejlsborg https://youtu.be/wSdV1M7n4gQ where he talks about how the Roslyn and TypeScript compilers are designed to support IDEs, with a structure that allows the compiler’s model of the program to be incrementally updated as the code is modified. So there has been some fairly prominent work along the lines you want...

imtringued · on Sept 14, 2019

One can also stop using boost. The last C++ project I worked on spent at least 30% of the compile time parsing boost/filesystem headers. I don't know what type of crackpot designed boost/filesystem but when you trace the includes (and the included includes) then you will see a reasonable amount for your own application. Maybe up to 50 headers that are part of the actual project and then comes a list of hundreds of boost header files.

I then proceeded to simply copy the includes of a random .cpp file and put it into a new .cpp file with nothing but includes. I removed all of them one by one except boost/filesystem. Compile times didn't change and if they did then only by a few milliseconds. A single include for boost/filesystem took 2 seconds to compile. I didn't even instantiate any templates in that simplified .cpp file so in practice the overhead is even higher. Isn't that absolutely insane?

segfaultbuserr · on Sept 15, 2019

> One can also stop using boost. The last C++ project I worked on spent at least 30% of the compile time parsing boost/filesystem headers.

I've seen one experimental IRC bot project that heavily uses template expansion, with the asynchronous model and other components in Boost, as a demonstration of advantages of modern C++. The whole compile requires 2 GiB+ memory, if multiprocess make is used, 4-8 GiB! The scale of the project is nowhere close to Chromium, building the Linux kernel doesn't need much RAM either, it's just a IRC bot! But the hardware requirements for using Boost and C++ template expansion is spectacular!

The actual binary runs well though, and does not need such hardware. And the developer simply saw it as a small price to pay for using these language features.

big_chungus · on Sept 15, 2019

Much of boost's compilation time is a result of being so template-heavy. It's still got to parse and do much of the generic processing, even if you don't actually instantiate a template. Whether boost's style is the pinnacle of elegance or ugly as sin is still somewhat in contention, but no one wants to re-write it. I am some what put off as some one who started with C, but most of the C++ people reference it regularly as an excellent example of code style and structure.

Const-me · on Sept 15, 2019

Microsoft's ATL and WTL are comparable to parts of standard library and boost. They build very fast, despite also based on templates. Often run much faster too, e.g. CAtlMap is typically faster than std::unordered_map by an order of magnitude, due to cache friendliness. The only downside, it's Windows only.

anaphor · on Sept 14, 2019

You need something like http://adapton.org/ for C/C++ compilers really

kachurovskiy · on Sept 14, 2019

First time I spend a day on a partial-compile-only bug I switch if off and it's good old full recompilation from that point :)

slacka · on Sept 14, 2019

Why not just do a full, clean build when you hit strange bug? If your workflow is anything like mine, partial-compile and tools like ccache have saved months of hours I would have wasted waiting for the compiler.

NullPrefix · on Sept 14, 2019

>when you hit strange bug

Any bug is strange until you figure it out.

Animats · on Sept 14, 2019

That's a big problem with build systems based on timestamps, not checksums. Also, unless the build system can see all the inputs to the compilation, there's the potential for trouble.

carlmr · on Sept 14, 2019

That's why Ninja build is so amazing for C/C++. It can use the whole dependencies output from the compiler and knows exactly what to recompile, even if only a header changed.

ori_b · on Sept 14, 2019

...You know, you can do that with plain old make, too?

tempguy9999 · on Sept 14, 2019

That happened to a colleague of mine with the Eiffel compiler (a long time ago). He was pissed off. Ditto full compilation thereafter.

Had something similar in our office with scala & SBT, but SBT is a POS anyway.

the8472 · on Sept 14, 2019

Another waste of resources is that global dead code elimination happens really late (LTO). So functions are compiled and then discarded much later instead of not compiling them in the first place.

snazz · on Sept 14, 2019

I have no idea how they do it, but Common Lisp compilers like SBCL and CCL compile each function incrementally, as you write them, ready to be played with in the REPL. This seems like a complex task in a language that doesn’t have the idea of an “image” that Smalltalk and Lisp do.

ndesaulniers · on Sept 14, 2019

I look forward to this. One that that will be important for reproducible builds is having tests for non determinism. Having nondeterministic code gen in a compiler is a source of frustration and dispair and sucks to debug.

earenndil · on Sept 14, 2019

My understanding from the article is that the code gen will still be deterministic--independent operations will be performed in parallel instead of in sequence, but data dependency will still be respected.

pjc50 · on Sept 14, 2019

The old-school approach to this was "distcc". At the company where we used it for C++ we had a small compile farm and usually did "make -j 50".

ndesaulniers · on Sept 14, 2019

Google has objfs, a authenticed network mounted drive that is an object file cache. Usually most of the tree has already been compiles so just your change needs to get recompiled and relinked.

Further, there's then GOMA which is the distcc equivalent. I frequently build Android with 500 cores in the cloud. For that reason, I don't think Google engineers will ever focus on the compiler speed of llvm. That said, LLD (llvm's linker) is the fastest linker in town by far.

exikyut · on Sept 15, 2019

Nice info. I've wondered for quite a while about the impact of the G level of compile parallelization :). A couple of followup questions, if I may:

- How long do incremental and non-incremental builds take with the build configuration you describe?

- How does core<->engineer allocation work? Is 500 cores an average number? High-priority work presumably commands more access, but I'm curious if allocation is generally team-specific or engineer-specific (eg, more responsibilities = more resources). (I guess the reason I ask this is that 500 cores sounds kind of impressive and I'm going "surely they'll run out of cores if they're handing them out like candy???"), but then thinking about it that's only like 11 computers...)

- I'm curious if you happen to know how Chromium gets built (IIUC Android builds it 2 (or is it 3?) times). It's tricky to distill (presumably/understandably deliberately so) how many cores the CI/trybot/etc infra is using. Overall my curiosity is what the core/time ratios/graphs are, ie how many cores translates to how short of a build.

- As an aside, I vaguely recall absorbing by osmosis several years ago that Chromium takes approximately 15 minutes to build on a Very Cool™ local workstation (HP Z series). Not sure how out of date this info is. Do engineers still do test builds on local hardware (and if yes, how long does this take? probably longer than 500 cores, heh); or is everything done in throwaway cloud instances nowadays?

Thanks in advance for whatever answers you're happy to provide :)

jart · on Sept 15, 2019

Google has unlimited resources. Cores and codebase size don't matter to them. Only concern is what's on the critical path, e.g. final link step. It's very rare for builds to not be incremental. AFAIK JavaScript is the only language that can't be compiled incrementally. Probably the only engineers who would need to run a build solely on local hardware are ones that help support open source tools.

joshuamorton · on Sept 15, 2019

Js can and does get compiled incrementally if people do it correctly. It's fully supported by blaze. It's just that a bunch of old codebases globbed everything, so it appeared no incremental.

jart · on Sept 16, 2019

Is that with the js_binary/library rules or something new? Last time I checked, all the work that goes into compiling the JS sources and minifying them into a single "binary" blob happens in the js_binary definition, and the js_library definitions traditionally served only to construct a total ordering of raw source source file dependencies.

joshuamorton · on Sept 16, 2019

Two partial answers: there's a newer better thing thats now the default and splits the building, but even under old js_library, there were some gains from not globbing everything together...

cromka · on Sept 14, 2019

Interesting. I searched for some more information on that but don't see anything except the GitHub repostoryitself. I wonder how would one go about integrating that into .Net stack.

jingwen · on Sept 14, 2019

Check out Bazel's remote execution feature and Google RBE (remote build execution) [1]. Bazel itself is language agnostic (language support comes in the form of rules).

(disclaimer: I'm an engineer on the Bazel team)

[1] https://docs.bazel.build/versions/master/remote-execution.ht...

bgmeister · on Sept 15, 2019

There's also an opensource goma client + server which proxies to RBE (or potentially other bazel remote execution API implementations). Which is great if you can't or don't want to use the bazel client, but can use a compiler wrapper.

https://chromium.googlesource.com/infra/goma/client/ https://chromium.googlesource.com/infra/goma/server/

Twirrim · on Sept 14, 2019

Distcc is neat, and invaluable, but it's parallelism on a different level to this, which is introducing parallelism at the single-file compilation level. It would be interesting to see the overall performance impact of this in combination with parallelism on a build level. It could even be harmful (but I doubt it) due to resource contention.

saurik · on Sept 14, 2019

That's great for super large builds, but the point of this is to deal with "I have one file I want to compile which is taking 30 seconds to compile, and I only have like three files to compile, so it would be awesome if I could use all of the cores on my laptop to do this compile and get it down to 10 seconds or something".

marmaduke · on Sept 14, 2019

This, parti ularly for template heavy C++.

Older versions of Stan on GCC 4.8 would take 60+ seconds to compile a single C++ file.

stochastic_monk · on Sept 14, 2019

I write template heavy code. I’ve recently refactored template instantiation across many object files. It’s faster, but a lot of work. It’d be great if I didn’t have to.

pstrateman · on Sept 14, 2019

distcc can be useful, but requires that the local build environment matches the remote 100%.

bgmeister · on Sept 15, 2019

Icecc (an old distcc fork) also distributes the toolchain, to avoid that particular problem.

https://github.com/icecc/icecream

IcePic · on Sept 14, 2019

No it doesn't, it sends preprocessed files just to not depend on remote env.

cat199 · on Sept 15, 2019

> it sends preprocessed files just to not depend on remote env.

to then be compiled by the host compiler ..

IcePic · on Sept 15, 2019

Sure, one compiler can't be egcs 2.95 and the other clang 8 (at least not in most cases) but the target compiler system doesn't even have to have include files, libs, libfoo-devel packages installed or they can be, but from old versions. As a proof of concept I once had a linux with gcc4.2.1 make c->.o files that later ran on OpenBSD. Would I trust that resulting binary? Sure not.

But in cases of "lets spin up 10 compile boxes with <same ubuntu within a week of updating>" or have something along the lines of Xcode which allowed you and your colleagues to help out each other with small work units in order to make a single compile run quite much faster, that is a definite possibility.

For that final release build, you might consider running it on one of those 10 VMs in the example above and have 'only' 9 others help out with the sub-parts in order to get some kind of .. guarantee.

If it takes a minute for a full build on one box, getting a hint after 10 seconds that you misspelled something and that it won't ever link decently is worth something too if you value your time as a developer.

Not saying it's perfect, only chipped in on the "must be 100% or it can't ever work", hopefully without needing to go into the "coach says we need to give 110% this game, and 120% if it's the finals" nitpicking.

boring_twenties · on Sept 17, 2019

This doesn't work for C++, as there is no standard ABI and thus no safe way to link object files from different compiler versions into the same binary.

andyayers · on Sept 14, 2019

FWIW, MSVC has had this ability for a few years now. It helps with both normal compilation unit compiles and with link-time code generation. See for instance Bruce Dawson's notes: https://randomascii.wordpress.com/2014/03/22/make-vc-compile... .

nwmcsween · on Sept 14, 2019

OT but why isn't there a compiler that simply does absolutely basic passes to get the IR into a 'normalized' form then apply optimizations based on a 'database' of super optimized 'chunks'? For unoptimized parts run the super optimizer.

exikyut · on Sept 15, 2019

So basically a code lowering engine that works a bit like https://en.wikipedia.org/wiki/Hashlife. Interesting!

(NB. Googling "lowering" turned up the seemingly-random https://news.ycombinator.com/item?id=14422944; that page isn't such a bad set of starting links, so I figured I'd include it.)

Eug894 · on Sept 20, 2019

Please, read this idea regarding use of superoptimizers: https://drive.google.com/file/d/1GSv89tiQmPDcnFEu4n4CqfaJcUJ...

UncleMeat · on Sept 15, 2019

Superoptimization scales to hundreds of assembly instructions. Its not feasible for overall compilation.

NilsIRL · on Sept 14, 2019

The use case is really niche.

The only time it comes in handy is when:

1. You have enormous files AND 2. You have a little number of files.

ris · on Sept 14, 2019

I'm not sure about that. The traditional way of exploiting parallelism for large builds is to work on a per-file basis. Want more parallelism? Fork more jobs. But this can fall apart in a couple of ways:

1. It can get extremely expensive in memory use. Your parallelism limit can be dictated by the memory your machine has. Spending your available cores over fewer simultaneous compilation units could also have benefits arising from data locality, even in situations where a machine's memory headroom isn't a concern.

2. Many large build processes have serialized steps which are currently unable to benefit from any form of parallelism. This is becoming even more the case with the rise of IPO & LTO.

But as ever, we'll probably have to wait to find out how large the practical benefits end up being from this work.

mrich · on Sept 14, 2019

It is quite a common case. First you do your scratch make with 800 jobs which maxes out your machine just from preprocessing :) Thereafter you only edit a small portion of the code and recompile a small number of files, which only uses a few cores. It's especially problematic for C++ with all the template instantiations. Some files in our codebase take 40+ sec to compile.

mxcrossb · on Sept 15, 2019

Agreed. I do know some projects that do a lot of code generation that end up with huge files, though this is probably largely caused by C++’s templates forcing too much into headers (which newer versions fix).

It’d also be nice if parallel make had better load balancing.

rumanator · on Sept 14, 2019

Yes, I agree. In a world where all projects are comprised of a myriad source files where each one takes a fraction of a second to compile, this kind of approach sounds a tad pointless.

loeg · on Sept 14, 2019

Huh, I thought compilation was already quite parallel even on many core machines. I guess that's only true for larger software projects (or ones with many small compilation units, at least). The step that could use some TLC and increased parallelism is linking.

On the other hand, maybe this project will help improve whole-program LTO link time. I see that is mentioned as future work for this effort:

> Parallelize IPA part. This can also improve the time during LTO compilations

Big kudos to Giuliano Belinassi, who seems to be the one driving this effort.

ape4 · on Sept 14, 2019

A tricky job to say the least

The_rationalist · on Sept 14, 2019

Llvm already has one process per compile unit I believe.

ndesaulniers · on Sept 14, 2019

Which is not parallelization. There's always the question of should I parallelize my compilation at the compiler level, or the build level (ie. Multiple translation units in flight). I think there's room for both, so you hopefully can get faster incremental compiles of a few number of TUs, but still have the old data level parallelization of multiple TUs.

blattimwind · on Sept 14, 2019

From my observations multi-threading the linker would be more useful, since for small changes in bigger projects the linking step usually takes a lot longer than compiling a couple units.

kardos · on Sept 14, 2019

In a similar vein, not all units cost the same to compile -- even with many jobs in a make -j situation, toward the end of the build there may be only a few units still in progress. These lengthy 'stragglers' might benefit the most from this parallelism and consequently reduce total compile time. Quite looking forward to this project landing in a stable release.

On the topic of linking-- can you sub in the gold linker for faster linking in your projects?

tacostakohashi · on Sept 14, 2019

There exists a multi-threaded linker called gold.

ndesaulniers · on Sept 15, 2019

LLD (llvm's linker) is also threaded but much faster than gold.

ef004 · on Sept 14, 2019

Ive never found linking to take more than 5 seconds with the standard linker which, but lld exists and its super fast in my experience. But i guess its only needed whon working on huge c++ like firefox or sth. like that.

https://lld.llvm.org/

barrkel · on Sept 14, 2019

Linking is isomorphic to tracing, moving GC. Effectively, the linker relocates objects in memory and writes a freeze-dried "heap" for the OS loader (another linker) to page in later.

Linker slowness ~= GC pause. The cost is approximately proportional to the size of the traced graph.

seamyb88 · on Sept 14, 2019

I also find that the linker is the issue. Yes for performance, but also for testing (to the point of failure) linux's 'out of memory' scenario (try make -j8 on the llvm source, for example).

DannyBee · on Sept 14, 2019

LLVM has actually been working through multithreaded compilation for years.

It actually works just fine (IIRC), but occasionally breaks when people do silly things :)

I believe you just need to enable LLVM_ENABLE_THREADS in the config.

Joky · on Sept 14, 2019

In LLVM, even with LLVM_ENABLE_THREADS, the compilation of a module is still sequential as far as I know.

The LLVM Context is the unit of isolation between threads. For instance the multi-threaded ThinLTO optimizer/codegen will use one LLVM Context per-thread, and so each thread is processing a single Module / TU.

The issue for concurrency is fairly deep in LLVM (use-lists, constant stored uniquely in context, etc.) that would make it really difficult to address intra-module parallelism.

MLIR for example is designed with this in mind and the pass-manager is already multi-threaded at every level of nesting (function passes would runs on two functions of the same Module in parallel). This is causing other complication/inefficiency in the infrastructure, I'm still not sure how much it is a good tradeoff, we'll see...

daddypro · on Sept 16, 2019

Could you elaborate on what MLIR has done differently wrt making it multithread safe and what complications it's causing and what the trade offs are..

Joky · on Sept 22, 2019

Sure: for instance in LLVM global values (variables, functions) have use-lists, so do constants. That means you can find easily every uses of a function: they are chained in a doubly linked list. In MLIR, it isn't the case. We use symbol name to refer to other global object, we lose the ability to easily find the uses of these global objects. On the other hand functions are well isolated and you can run Function passes in a multi-threaded fashion safely.

phkahler · on Sept 14, 2019

I only see parallelism at this level useful for JIT compilation. Interpreted languages, shaders, and similar. Those are good uses.

Not real valuable for building big projects that can build in parallel at the file level already.

jart · on Sept 15, 2019

I agree. At the end of the day, C++ headers are quadratic. Folks who don't respect that fact are going to end up with files that take minutes to build. Threads won't save them. Linearly increasing core usage can't satisfactorily address an exponential problem. It would also mean make -j#cores may need to be adjusted to make -j#cores/#threads to ensure the system doesn't get bombed if a bunch of boost-heavy files get scheduled for compilation at the same time.

If GCC devs try anyway and end up adding mutexes to all their data structures, it could potentially make things slower for everyone. It'd be overkill too. Last time I checked, GCC doesn't even memoize the O(n^2*m) stat system calls it does when processing includes (where m = # of -I / -isystem / -iquote flags). That would be a very simple patch to make things less bad. But it still doesn't solve the problem, which is that programming practices need to change.

dunkelheit · on Sept 14, 2019

So at a project that I was working on one of the compilation units got so unwieldy that the last part of the build process was basically waiting for it alone to compile (it contained a lot of template instantiations). The solution was of course to manually "shard" it into several files even if semantically it didn't make much sense. Now, you might argue that it was our shitty code that caused this but surely this is a fairly simple thing for a compiler to do automatically and not require the user do it.

namibj · on Sept 14, 2019

Even better: throw differential dataflow machinery at all the template handling of C++ code, interprocedural constant peopagation (and similar steps), and of course the linking itself.

It's sad these foundations only seem to exist in rust.

exikyut · on Sept 15, 2019

Hmmm.

I recently learned about c-reduce, which minimizes the size of reproducing C/C++ crash testcases by iteratively permuting the source code and invoking the compiler.

I can imagine a similar tool that takes a set of input file(s), carefully instruments the files somehow to determine the data interactions (a bit like the dataflow analysis mentioned in the sibling comment), and then iterates through different bucket-sorts (automatically invoking the compiler) until it finds some arrangement of locality-optimized input that also happens to compile the fastest.

On the one hand, this process would take hours - but on the other hand it can be lifted out of the compile/test cycle, and run eg overnight instead.

Optimizations might include tracing what you're editing right now and what that depends on, so active work can be relocated to the smallest discrete files possible. The system could just aim to minimize the size of all input files, but weighting what you're currently working on might produce additional speedups, I'm not sure.

In such a model, feeding in something like Boost would result in it eating all the templates that are never referenced.

To me, the biggest problem is that this entire infrastructure would need to understand very large parts of C/C++, and of course would also need to be faster than current infrastructure in order to actually speed anything up. I don't think there are any production-capable research analysis systems out there capable of doing this.

So, the likeliest path forward would be turning LLVM/GCC into something that can a) stay resident in memory (not fundamentally hard, just don't exit() :) ), b) be fed modified source code and accordingly traverse/update its analyses graph(s), and c) (most important) perform (b) efficiently (hah).

One major downside, apart from the total nonsemanticity of the actually-compiled output, would be the introduction of yet another hurdle to jump over to achieve reproducible builds.

I wonder if a design like this could be [part of] an intermediary first stage to getting something like incremental compilation into LLVM/GCC. Ie, it could be a (temporary) binary of its own that would allow for these features to be developed within a production-usable context that doesn't impact the behavior of the compiler itself; and then when it was properly built out, the compiler could be made to more and more progressively depend on it until either a) the compiler itself has the server-mode built in, or b) the changes are so dramatic the server-mode is not needed (unlikely).

I say all the above as a not-compiler person. I have no idea what I'm talking about.

coldtea · on Sept 14, 2019

>This page introduces the Parallel GCC -- a research project aiming to parallelize a real-world compiler. This can be useful in many-core machines where GNU Make itself can not provide enough parallelism

Huh? One wants a fast compiler even for a single compilation, not just for multiple ones that parallel make can handle as implied here...