Different compilation unit prevents inlining. Function pointers to small functions can be a performance problem for exactly that reason -- except for very specific special cases (when compiler can actually figure out your function pointer calls always end up calling same function), function pointer calls cannot be inlined.
C/C++ compiler abstracts the difference between inlining and actual call. For best results, cross compilation unit optimizations should be enabled so that inlining can occur across object and library boundaries.
> Interpreters are slow because they chase pointers and otherwise branch in a manner the CPU can't predict. Calling the same function through the same pointer over and over again isn't that.
A lot of function pointer usage is also data dependent. Data dependency and too complicated dependency chains are what breaks prediction.
Data-dependent audio pipeline callbacks? In the schemes I'm aware of, they rarely change after setup--hence my thinking effectively fixed targets were the representative case. (After an update, pipeline stalls plague you until the predictor catches up.)
Breaking inlining is probably the real cost to this. Still, we're talking ~9 cycles per indirect call on this hardware. If you call each callback per buffer instead of per sample (which everyone seems to do), you're down to an amortized cost below 1 cycle per sample for a filter chain 14 deep with only 128 sample buffers. 14 filters probably swamp that cost.
Edit: A good compiler will completely remove my trivial function if it's inlined into call.cc's code. It has no side effects, and neither does the loop calling it 100,000,000 times. The resulting timings would be meaningless: doing nothing is cheap!
> Still, we're talking ~9 cycles per indirect call on this hardware.
In 9 cycles, you could execute up to 18 SIMD instructions, each operating on 256 bits worth of data.
For scalar, 9 cycles can be up to 30 instructions or so. For good code, average about 15-20.
> If you call each callback per buffer instead of per sample (which everyone seems to do), you're down to an amortized cost below 1 cycle per sample
Function pointers are perfectly ok even in high performance programming, if it's not what burns the cycles. If you can batch the filter, all is good. If you can't, well, you might need to think up some other solution.
Profiling rocks. I did simply some experimental cases where I ran same test setup with dynamic function pointers and hardcoded code. I quickly noticed that hardcoded version was over an order of magnitude faster. I figured out a way to narrow the difference and ended up doing dynamic code generation.
> doing nothing is cheap!
Code you don't need to run is the fastest code you can have. Infinitely fast in fact.
C/C++ compiler abstracts the difference between inlining and actual call. For best results, cross compilation unit optimizations should be enabled so that inlining can occur across object and library boundaries.
> Interpreters are slow because they chase pointers and otherwise branch in a manner the CPU can't predict. Calling the same function through the same pointer over and over again isn't that.
A lot of function pointer usage is also data dependent. Data dependency and too complicated dependency chains are what breaks prediction.