I don't have a great deal of experience tuning code, so would someone be able to...

zenogais · on Sept 26, 2014

My interpretation:

Firstly, inlining has nothing to do with state mutation. He just happens to be talking about a codebase that does a lot of state mutation - eg. a video game. Therefore the functions he's talking about inlining are state mutating functions.

It also sounds like performance is a secondary, but still important, concern in this post. What he's really getting at is what the typical modular programming style does to our awareness of details and therefore our ability to understand and optimize our programs. The commonly held conviction is that smaller functions are better for writing understandable and correct code. He's saying this isn't necessarily the case - especially not when you're trying to optimize code and understand the interrelations between the state it's mutating. Modularity can often produce deep stacks while hiding and scattering state mutations. It can also obscure interrelations you should be aware of. In the kinds of scenarios where correctness, comprehensibility, and performance are tantamount having one big long function might just be better.

geoelectric · on Sept 26, 2014

As a software automation engineer, I've made this argument in the past with regards to UI/systems test code bases.

Heavy modularization makes a ton of sense with utility code made to create and tear down fixtures, as well as with general utility functions like navigation through a user interface to get to the initial point of testing.

However, in the test, where you need to have complete understanding of the sequence of events in order to keep the test valid, it's much better to inline nearly everything that would affect simulated user or client flow even if that means duplication between tests.

That's exponentially more true when more than one person/team/org/whatever would be maintaining different tests or test areas. The last thing you want to do at that point is share test sequencing code, since it's so easy to subvert the flow in other tests using it by mistake.

It's a hard argument to make, because everyone gets SPOT, yo, Fowler rules, etc. They aren't wrong, but it really only applies when the interface is everything and how you fulfill the interface is irrelevant.

For some types of code--frame-accurate games being one, tests being another--the order of events is paramount. IMO, even mature patterns like Selenium's Page Object Model gets this one wrong by encouraging test flow code to live in POM methods.

There are absolutely times where optimizing for understandability and being paranoid about implementation changes is the way to go.

phkahler · on Sept 26, 2014

>> It also sounds like performance is a secondary, but still important, concern in this post.

Not exactly. He points out that in real time systems, worst case execution time is more important than average execution time. You have to render a video frame in 1/60 of a second or it will need to be displayed twice. In that case, getting the job done faster on average doesn't matter. This can change a bit if something like power consumption becomes relevant - then you've got conflicting requirements.

Real time keeps performance a top priority, we just look at a different metric.

ufo · on Sept 26, 2014

I think the article was about "manual inlining" rather than the kind of automatic inlining that a compiler does. Its less about raw perfomance (inlined functions have less function call overhead and open opportunities for cross-function optimization) and its more about correctness and predictability of runtime. For example, he mentions that he prefers having a dumb and sequential code flow that always runs at 60 FPS than something clever and full of conditionals that runs at 120 FPS most of the time but falls to 30 FPS every once in a while. In addition the predictable runtime there are also many bugs that happen when stateful functions are called in the wrong order and this is less of an issue if you always call them once in the same order. For example, he mentions that he would rather check for `health <= 0 && !killed` at the end of your game loop instead of potentially calling KillPlayer() in 20 different places, which might end up doing slightly different things in each frame.

As for pure code, the reason he has become more bullish about functional programming is that it by definition is less susceptible to all these subtle order-of-execution issues. You are free to structure your code in whatever way you like and split it into as many functions as you want and still have the peace of mind that you will never access an uninitialized variable or use a dead object.

on Sept 26, 2014

[deleted]

cygx · on Sept 26, 2014

"inlining" means that instead of hand-writing blocks of manually inlined code, you write a function, which can be inlined for you.

That's backwards: Carmack advocates to avoid writing such functions, manually inlining their code into the caller (with the possible exception of pure functions).

pionar · on Sept 26, 2014

In his context, it really has to do with references.

So, if I have a value A, and it's a reference type(either by pointer, or by just the nature of the language), then if I pass that object into function DoSomething, DoSomething may change A without my knowledge, and cause behaviors further on that I wasn't expecting.

If I inline DoSomething, I see exactly what I'm doing to A, what I'm changing on it.

tel · on Sept 26, 2014

Inlining doesn't always work in side-effecting languages. E.g. only if `f` is pure are the following two fragments identical.

    let a = f () in a + a

    f () + f ()

kazinator · on Sept 26, 2014

Inlining always works, by definition, otherwise your compiler is buggy.

In side effecting languages, inlining cannot make function calls disappear, unless the compiler can convince itself that it's safe to do so.

If an expression calls f() twice, and f has a side effect, then the inlined version of the code has to do that side effect twice, exactly if the non-inlined function were called, just without some of the overhead of a function call.

(On a different note, in a language like C, inlining a side effect actually improves safety. If we have f() { global++ } then f() + f() is safe and has the effect of incrementing global twice, whereas global++ + global++ is undefined behavior.)

tel · on Sept 26, 2014

Sorry, you're using a slightly different definition of inlining than I was. Fortunately, it still demonstrates the same point: you have to be aware of state/side effects in order to do inlining.

In a pure and total language the two fragments are always the same. No further thought needed.

kazinator · on Sept 27, 2014

Not really. Inlining can be done by mechanically incorporating the called function into the caller, in a way that respects lexical scopes. This will result in identical semantics. Then the resulting bloated code is optimized: subject to caching optimizations, common subexpression optimizations and whatnot. These have to obey the usual rules for not deleting or reordering side effects. The issue you're referring to boils down to the fact that in an imperative language, we cannot assume that two occurrences of a variable X can be replaced by the same value, because X may have been assigned in between. So compilers have to do flow analysis to discover in what part of a call graph does a given variable have a stable value.

E.g. if we have a silly C function like

    int accumulate(int x, int y)
    {
       global += (x + y);
    }

and then call it in two places:

    accumulate(s, t);
    accumulate(u, v);

the inlining is almost like just macro-expansion of the above into:

    {
      int x = s;  /* "parameter" passing */
      int y = t;
      global += (x + y);
    }

    {
      int x = u;  /* "parameter" passing */
      int y = v;
      global += (x + y);
    }

we don't have to care about side effects when we are doing this expansion. That's the job of later optimization.

The later optimization pass can worry about things like whether global is volatile-qualified. If it is, then the best that can be done is:

     global += s + t;
     global += u + v;

these can't be merged or re-ordered. And so this stays faithful to the original function calls. If global isn't volatile, then more can be done:

     global += s + t + u + v;

Guvante · on Sept 26, 2014

But that isn't the kind of inlining he is advocating.

He is talking about

    f _ = do_work 12

    let a = do_work 12 in a + a

JoshTriplett · on Sept 26, 2014

If the compiler has the full source of f() available at compilation time, it knows whether it can convert between those two. (Though not all compilers are good at knowing the memory-usage implications of doing that kind of commoning operation.)

(And, for that matter, whether it can convert that to "f()<<1".)

tel · on Sept 26, 2014

Sure, and thus inlining does not always work. To be clear, what I probably should have said was beta reduction, but I wanted to keep the terminology of the op.

The point being that one has to be aware of state in order to execute an inline/beta reduction. In pure and total[0] code the above can only ever change the execution speed, never the meaning.

[0] And of course "pure and total" just means "pure" since non-termination is an effect.

mappu · on Sept 26, 2014

Does `shl eax 1` really outperform `add eax eax`? Although i guess that's a question for `-mtune` to decide.

spc476 · on Sept 27, 2014

SHL EAX,1 and ADD EAX,EAX are the same, time wise. The only difference is that the A flag (auxiliary carry [1]) is undefined for SHL and defined for ADD. SHL EAX,n may be faster than n repeats of ADD EAX,EAX, but there are other ways to do quick multiplication by powers of 2 though.

[1] The auxiliary carry reflects a borrow from bit 4 to bit 3, or a carry from bit 3 to bit 4. It's used for BCD [2] arithmetic, and there is is no direct way to test for it.

[2] Binary Coded Decimal

JoshTriplett · on Sept 26, 2014

Shift definitely outperforms "mul"; it may or may not outperform "add", but it'll probably outperform a series of adds to implement a larger power of 2. And if you're using the result as part of an address, you can use [eax*2 + ...] as an operand and the shift will happen as part of addressing.