GCC: Improve on memory cost in coloring pass of register allocator

jimrandomh · on Jan 23, 2016

Summarizing the interesting bit for non-compiler-specialists: This is an improvement to a part of GCC's compiler optimizations which will make compiled-with-GCC programs slightly faster. The SPEC INT benchmark mean score went from 3717 to 3730 (0.5%), and SPEC FP from 4752 to 4775 (0.34%). Which doesn't seem like much, but making a good compiler in practice tends to mean stacking lots and lots of improvements of this magnitude on top of each other.

johnbender · on Jan 23, 2016

To add just a bit more:

When assignments to variables happen in the context of a procedure call it's preferable to use registers for performance reasons since memory access (even cache) is much slower by comparison. Unfortunately registers are a limited resource, so we would like to reuse them when a variable, though in scope, isn't "live" or in use.

This whole area of optimization is called register allocation and the optimal solution reduces to the graph coloring problem which is NP-complete. As a result there are a myriad of approximation algorithms with knobs to turn to spend a bit more complexity for a bit more accuracy. This patch appears to be the turning of one such knob.

DannyBee · on Jan 23, 2016

Graph coloring is a small part of register allocation, and not the hard part. It's also not NP complete on interference graphs built from SSA form, which most modern compilers (though not GCC) use in the backend. In fact, it's linear time optimally solvable if you build ssi form, which can be done in linear time.

The optimal spilling and rematerialization problems have not yet shown to be solvable in polynomial time on these forms, but folks are working on it.

Someone · on Jan 24, 2016

Also, something being NP complete does not say much about whether it is hard to do in the real world. Exponential is fine, as long as the constant, the value of n, or the exponent aren't too large.

Here, real world cases likely will have only a fairly limited number of registers and only a limited (but potentially very long) length of code to optimize across.

It would be interesting (and quite an accomplishment, even if doesn't give significant performance gains) to combine this with whole-program optimization (imagine optimizing the calling conventions on a per-function basis, and considering generating code for a function twice, with different calling conventions) and/or with the code that decides what functions to inline, but AFAIK, nobody does that.

DannyBee · on Jan 24, 2016

"Here, real world cases likely will have only a fairly limited number of registers and only a limited (but potentially very long) length of code to optimize across. "

You can do a good job with polynomial time heuristics (chaitin's heuristic) for coloring, however, you are wrong about the code. It is often very weird. At the same time, most of the performance issue is not coloring, or trying not to spill at all. The problem is deciding where to split live ranges (which introduces copies), where to merge existing live ranges (which removes copies at a cost of more interferences), where to spill, where to remat, are quite hard.

andrewflnr · on Jan 24, 2016

The bit about SSI form is news to me. I have more research to do.

DannyBee · on Jan 24, 2016

See http://web.cs.ucla.edu/~palsberg/paper/PereiraPalsberg08.pdf Look at page 10.

(Fernando was once my intern, back in the day ;P)

umanwizard · on Jan 23, 2016

I thought L1 cache access cost the same as register access on Intel machines. Is this not true?

Just something I had heard somewhere; I don't have data to back it up.

solarexplorer · on Jan 23, 2016

No, it's not true. Even if L1 accesses were only 1 cycle, they would be slower. The CPU will try to execute instructions out-of-order and needs to detect dependencies and possible conflicts between instructions. This is easy with registers, if the registers are the same (have the same name), there may be conflicts or dependencies. This can be tested right after decoding the instruction. With memory it's much more complex, because the memory addresses have to be calculated before any comparisons can be made. Therefore it's much harder to reorder memory instructions. There are some exceptions, like push/pop pairs, but in general memory accesses will be significantly slower.

umanwizard · on Jan 24, 2016

Makes sense, thanks!

berkut · on Jan 23, 2016

I think register access is technically faster than a L1 hit, but due to store forwarding from the write back cache, this extra latency is usually not visible, so in practice they're the same.

solarexplorer · on Jan 23, 2016

You need two additional (micro)instructions, a store and a load. Store forwarding will save you from going to the cache and back, but it won't be as fast a register forwarding.

gpderetta · on Jan 24, 2016

Yes. In particular store forwarding is slightly slower [1] than (or, in Skylake, as fast as) a plain L1 load. This still beats waiting for the dependent store to complete, but it is nowhere as fast as using a direct register reference.

[1] i.e. higher latency.

chrisseaton · on Jan 23, 2016

But is 0.5% statistically significant? How do I know it isn't natural variation or measurement error?

He must have the most stable setup in the world if he has confidence in those results.

iheartmemcache · on Jan 23, 2016

You know way way more about compilers than I do, but the key is in this:

Furthermore, there exists a path from h to a use of v which does not go through d. For every node p in the loop, since the loop is strongly connected and node is a component of the CFG, there exists a path, consisting only of nodes of L from p to h. Concatenating these two paths proves that v is live-in and live-out of p.

I have no idea how to formally prove that assertion (it's logically sound, afaik, someone like Walter Bright will have to jump in, he probably knows GCC inside-out), but if correct, the code change is correct semantically. The computational gain can be trivially exhibited (again, no idea how to 'prove' it) by throwing code at it and seeing the RTL emitted. (I agree it lacks 'rigor' in the academic sense, but based on the list to which he submitted his code-change and the fact that he's at Xilinx and sent it to his colleagues makes me confident enough that the change is sound.)

Edit: Fair enough. You've convinced me. (Though the SPEC2000 resulting numbers still have me in the 'maybe' category - your argument below is more sound than my ignorance.) Anyone reading this should defer to Danny's posts rather than mine for the time being (after a day or two, read the resulting responses to see the expert analyses on the mailing list). Keeping this post intact solely for continuity for the reader. RE: Walter - I'd assume he'd keep up with the competition and know enough about the RTL as it hasn't changed extensively in the last decade to make a valid opinion w/r/t changes. Either way, gracefully upvoting Danny's two posts ;)

DannyBee · on Jan 23, 2016

"someone like Walter Bright will have to jump in, he probably knows GCC inside-out"

FWIW, walter has not contributed a single patch to GCC i can find in the past 10 years (It may be longer, i stopped looking) :)

Not that this makes him bad in any way, mind you, he's just not who i'd go to for gcc expertise. That would be someone like Richard Henderson, who is pretty much never talked about, but has touched pretty much all parts of the compiler and always does great work.

Also note that embedded developers in general do not have a good reputation among compiler people for doing "sound work". Again, not that this is a characterization of their engineering skills - They often have very tight deadlines, so the amount of time they can spend figuring out things instead of applying bandaids tends to be pretty limited. So they patch something, submit it, and move on.

bionsuba · on Jan 24, 2016

"walter has not contributed a single patch to GCC i can find in the past 10 years (It may be longer, i stopped looking) :)"

It's worse than that. Because LLVM and GCC are copy-lefted and his compiler has a proprietary backend, he forces himself not to even look at their code in order to make the lawyers happy.

WalterBright · on Jan 24, 2016

Haha, I once got sued by a company claiming I stole code from them, and their lawyer got sent packing when it turned out they had stolen the code from me. (I had a registered copyright on it.)

One thing I absolutely adore about Github is the audit trail about who contributed what and when. It's a solid gold defense, and a way to limit the damage if someone does check in bad code.

comex · on Jan 24, 2016

(LLVM is under a BSD-style license and thus not copylefted, although it does require attribution.)

WalterBright · on Jan 24, 2016

No, I've never contributed a patch to GCC. No need to look :-)

WalterBright · on Jan 24, 2016

Walter here. No, I don't know gcc at all.

The frustrating thing about many optimizations is they'll improve some code generation and worsen others, because they all interact with each other. They're like a pharmaceutical - sometimes they cure, but have deleterious side effects. Most of the work in an optimizer is trying to make an optimization work better most of the time and limit the downsides.

quietplatypus · on Jan 23, 2016

I've no idea about the specifics either, but I've done a little bit of compiler optimization in the JIT setting. What you say sounds right, and I'm filing it in the same mental bin as "when you're looking at which parts of the state + variables depend on others, you're most of the time better off reasoning from the initial state and input variables, instead of trying to do clever tricks of also reasoning backward from the output variables and combining them, resulting in worse approximations"

Or maybe it's something else that has to do with loops the way they are written.

DannyBee · on Jan 23, 2016

SPEC is a geometric mean, so 0.5% is very significant.

(Assuming you are accurately measuring the parts)

chrisseaton · on Jan 23, 2016

Does using the geometric mean do something to the size of the error? If the geometric mean of the original implementation varies 1% under normal conditions then this is still not a significant result is it? I have benchmarks where my geomean easily varies 1% even in the most stable conditions I know to give it.

exDM69 · on Jan 23, 2016

It's a benchmark engineered to measure the performance of compilers and CPUs. It's healthy to be skeptical but in general if you get a better score, it's a real improvement. You should be able to safely exclude measurement errors and other mundane issues.

These benchmarks have a slight issue that they might not represent real world performance, but compilers and CPUs tend to be optimized against them - because their timing and measurement is generally more reliable than ad-hoc real world measurement. It might lead to slight "over-optimization" which doesn't reflect real performance but at least it's reliable and repeatable.

gus_massa · on Jan 23, 2016

To know if it's statically significant,it's not important if the benchmark use the arithmetic or the geometric mean.

Did they repeat the benchmark? How many times? Is that information available? (Perhaps it's available for the master branch and not for the new commit).

DannyBee · on Jan 23, 2016

"To know if it's statically significant,it's not important if the benchmark use the arithmetic or the geometric mean. "

Of course, but i was answering whether 0.5% matters, not whether it is valid.

This is explicitly why i said "assuming you are accurately measuring the parts".

It seems you are just making a side point for the heck of it, despite completely agreeing with what i said.

chrisseaton · on Jan 23, 2016

I'm not sure what you mean by 'valid' or 'matters' or the difference between them though.

I'm asking if it is more likely that the 0.5% difference is due to the change in the implementation than normal variation in the original implementation.

I don't know how you can say that this is the case based only on the information that it's a geometric mean. You need to know the variation in the reference implementation, and you don't.

The facts that he measured an increase of performance of 0.5% and he used a geometric mean are not enough information to say that it's statistically significant.

DannyBee · on Jan 23, 2016

I understand exactly what you are saying. I'm saying you are being super pedantic here, since it's pretty obvious I meant valid to mean statistically significant, and matters took mean whether improving spec by 0.5% translates into real world application performance benefit.

I'm not sure why you want to be so pedantic, but it doesn't seem worthwhile to me since nobody was confused.

zodiac · on Jan 24, 2016

> it doesn't seem worthwhile to me since nobody was confused.

fwiw I was confused about what you meant by valid and matters

the8472 · on Jan 23, 2016

I assume that after years of optimization work you're not going to get more out of a small tweak to one part of the compiler unless you harness new CPU instructions.

amelius · on Jan 23, 2016

Could this also be improved in LLVM? Or is this specific to GCC?

DannyBee · on Jan 23, 2016

So, this is actually only true for natural loops :)

In particular: "Consideration of only enter_freq is based on the concept that live Out of the entry or header of the Loop is live in and liveout throughout the loop."

This is demonstrably false for loops with >1 exit, or multiple latches.

IE it assumes that there is a single loop exit and latch that everything goes through.

That said, most compilers, however, including GCC, just don't care about loops with these kinds of weird control flow structures, and either duplicate nodes, or just don't consider them loops (https://gcc.gnu.org/onlinedocs/gccint/Loop-representation.ht...)

More interestingly, it's not clear why the patch should make any difference at all, nor is it explained. It just makes an assertion that it is better to do it this way (" This increases the updated memory most and more chances of reducing the spill and fetch and better assignment.")

(Expanded on this a bit to make it more clear).

When calculating the cost of spills/moves/etc, usually you want to take into account how often a block is executed. This is because you often have multiple places you can put spills/moves/etc to get a register. Multiplying by entry/exit frequencies tells the compiler "If you need a register, it is more expensive to spill something that occurs in a block that happens all the time, than something that occurs in a block that happens infrequently". As an aside, calculating predictions on loops ånd nested-loops statically is non-trivial (it requires estimating the trip count of each loop).

In any case, this patch changes the move/spill cost calculation to avoid taking into account the exit frequency of blocks (well, really edges, but let's ignore that) in loops, under the argument that it should be the same as the entry frequency.

The problem with that is exactly the last sentence. It shouldn't matter. This change should, in practice, be a no-op, or at best, is a random cost model change.

From another perspective, it just divides all the previous frequencies of things in loops by 2 compared to what they were before (assuming the exit frequencies are sane. If they are not, that is a problem to be solved instead of this one). There is no given reason this is a necessarily good thing to do :).

Basically, i'm not sure this patch will go in, because it's not clear it's really solving whatever underlying problem exists.

I expect someone to come along and tell them to provide better analysis of where exactly it's helping and what the cost numbers are that this is changing "for the better".

boulos · on Jan 23, 2016

Why do you think this is a no-op? Looking at the diff, the code was summing the entry and exit counts. So, by the same argument, that means certain things were being double counted (considered higher cost) and now aren't.

I assume (like you), that someone will come along and point out that this doesn't really apply to all loops.

DannyBee · on Jan 23, 2016

It's a noop in most cases because the cost of spilling a given loop variable is still the same relative to other loop variables. But yes, the only affect it should have is to possibly value loop variables low enough to spill them instead of non loop variables (note that non loop variables are calculated using both exit and entry frequencies) That is almost certainly a bad idea in general, execution frequencies are what you should use whether in loops or not. It most likely the real issue is a bunch of loops are having their relative (to non loop blocks) execution frequencies over estimated highly, and in reality the loop doesn't execute much

quietplatypus · on Jan 23, 2016

Ah, this explains it! I was thinking about the loop as presented and it seemed to boil down to "do your data flow correctly" but the situation is more subtle.

bjourne · on Jan 24, 2016

IME, optimal register allocation isn't so important anymore. In normal code, function bodies are so small that all local variables can be kept in registers. I.e, if the # of local variables is small, like 5, then the register allocation is optimal by default. It would be interesting to see before and after assembly code from c functions that gcc generates.

gpderetta · on Jan 24, 2016

At least C++ depends for its performance on very aggressive inlining of small (often one-line) functions.

It is not unlikely that your hundreds of tiny functions become one massive "function" after high level optimization.

andrewflnr · on Jan 24, 2016

For extremely optimistic definitions of "normal". Also, if your "five variables" have non-overlapping live ranges, then you might be able to get it down to only 2 or 3 registers, so I wouldn't say it's "optimal by default".

adrianN · on Jan 24, 2016

Maybe if you disable inlining and beat your coworkers up when they add more code to a function. Where I live functions with several hundred lines are still common.