Surely it’s possible to build some benchmark to demonstrate the difference right? Otherwise, what’s the point of making that improvement in the first place?
I think what you’re saying though is that having benchmarks/micro benchmarks that are cheap to run is valuable and in those instruction counts may be the only way to measure a 5% improvement (you’d have to run the test for a whole lot longer to prove that a 5% instruction count improvement is a real 1% wall clock improvement and not just noise). Even criterion gets real iffy about small improvements and it tries to build a statistical model.
> Surely it’s possible to build some benchmark to demonstrate the difference right? Otherwise, what’s the point of making that improvement in the first place?
No, sometimes the improvement you made is like 0.5% faster. It's very very difficult to show that that is actually faster by real wall clock measurements so you have to use a more stable proxy.
What's the point of a 0.5% improvement? Well, not much. But you don't do one you do 20 and cumulatively your code is 10% faster.
I really recommend Nicholas Nethercote's blog posts. A good lesson in micro-optimisation (and some macro-optimisation).
> It's very very difficult to show that that is actually faster by real wall clock measurements so you have to use a more stable proxy.
That’s what I’m saying though. You don’t actually need a stable proxy. You should be able to quantify the wall clock improvement but it requires a very long measurement time. For example, a 0.5% improvement amounts to a benchmark that takes 1 day completing 7 minutes earlier. The reason you use a stable proxy is that the benchmark can finish more quickly to shorten the feedback loop. But relying too much on the proxy can also be harmful because you can decrease the instruction count and slow down wall clock (or vice-versa). That’s because wall clock performance is more complex because branch prediction, data dependencies, and cache performance also really matter.
So if you want to be really diligent with your benchmarks (and you should when micro optimizing to this degree), you should validate your assumptions by confirming impact with wall clock time as that’s “the thing” your actually optimizing, not cycle counts for cycle counts sake (same with power if you’re optimizing the power performance of your code or memory usage). Never forget that once a proxy measurement can stop being a good measurement once it becomes the target rather than the thing you actually want to measure.
I think what you’re saying though is that having benchmarks/micro benchmarks that are cheap to run is valuable and in those instruction counts may be the only way to measure a 5% improvement (you’d have to run the test for a whole lot longer to prove that a 5% instruction count improvement is a real 1% wall clock improvement and not just noise). Even criterion gets real iffy about small improvements and it tries to build a statistical model.