> Benchmarking isn’t a perfect measure of how effective this change is (benchmarking is arguably never perfect, but that’s a different conversation), as the results would be too noisy to observe in a large project.
So... does that not mean that any micro-improvement would also be too small to be significant on any large project?
The goal of the blogpost was less about talking about this optimization and more introducing TruffleRuby and making changes to programming language implementations at a high level understanding. The goal I had when I started working on the feature, was to introduce myself to TruffleRuby and was actually my first PR. It's not an impressive change, but it helps get into a lot of the important aspects of how TruffleRuby is engineered.
I kinda suspect it may be time to restate the problem and revisit it.
Microbenchmarking has nearly the entire resources of the machine available to itself. In live scenario, an arbitrary and unknowable mix of other code will be sharing those resources. We assert that we cannot model the negative - or positive - effects of that other code on the behavior of the code being benchmarked.
Except us workaday programmers do that in practice all the time. We hear about a performance improvement, we estimate the likelihood of a payoff, then we try it, first in isolation and then in our code. Sometimes the change is bad. Sometimes the change is boring. Occasionally, the change is surprising.[1]
Why couldn't we come up with test fixtures that run some common workloads and stick the code being benchmarked into the middle? You'd have 3 data sets instead of two in that case, control, A, and B. You will want to know how (A - control) relates to (B - control) as a fraction.
1. Biggest perf surprise I ever got (aside from the time I predicted the benefit of 4 optimizations in a row with a margin of error of <3%), I got a 10x when I expected 2x. We had two methods that got the same data from the database, filtered it, and compared the results. I moved the lookup to the caller and made them both pure functions (which helped with testing them). I don't remember chaining the calls, but even if I did, both methods were roughly the same cost, so that should have been 3-4x. The rest I suspect was memory and CPU pressure.
This particular one is. Maybe we could measure time-to-warmup or hot performance and with extremely careful scientific process we can say X application has sped up by Y percent. It also would not be helpful, because we know that it's a micro-improvement. Having data on the results of this improvement in the context of compilation speed and machine code output is useful, since it tells us that it's not a micro-un-improvement and because that was the intermediary goal in this optimization.
So... does that not mean that any micro-improvement would also be too small to be significant on any large project?