I kinda suspect it may be time to restate the problem and revisit it.
Microbenchmarking has nearly the entire resources of the machine available to itself. In live scenario, an arbitrary and unknowable mix of other code will be sharing those resources. We assert that we cannot model the negative - or positive - effects of that other code on the behavior of the code being benchmarked.
Except us workaday programmers do that in practice all the time. We hear about a performance improvement, we estimate the likelihood of a payoff, then we try it, first in isolation and then in our code. Sometimes the change is bad. Sometimes the change is boring. Occasionally, the change is surprising.[1]
Why couldn't we come up with test fixtures that run some common workloads and stick the code being benchmarked into the middle? You'd have 3 data sets instead of two in that case, control, A, and B. You will want to know how (A - control) relates to (B - control) as a fraction.
1. Biggest perf surprise I ever got (aside from the time I predicted the benefit of 4 optimizations in a row with a margin of error of <3%), I got a 10x when I expected 2x. We had two methods that got the same data from the database, filtered it, and compared the results. I moved the lookup to the caller and made them both pure functions (which helped with testing them). I don't remember chaining the calls, but even if I did, both methods were roughly the same cost, so that should have been 3-4x. The rest I suspect was memory and CPU pressure.
Microbenchmarking has nearly the entire resources of the machine available to itself. In live scenario, an arbitrary and unknowable mix of other code will be sharing those resources. We assert that we cannot model the negative - or positive - effects of that other code on the behavior of the code being benchmarked.
Except us workaday programmers do that in practice all the time. We hear about a performance improvement, we estimate the likelihood of a payoff, then we try it, first in isolation and then in our code. Sometimes the change is bad. Sometimes the change is boring. Occasionally, the change is surprising.[1]
Why couldn't we come up with test fixtures that run some common workloads and stick the code being benchmarked into the middle? You'd have 3 data sets instead of two in that case, control, A, and B. You will want to know how (A - control) relates to (B - control) as a fraction.
1. Biggest perf surprise I ever got (aside from the time I predicted the benefit of 4 optimizations in a row with a margin of error of <3%), I got a 10x when I expected 2x. We had two methods that got the same data from the database, filtered it, and compared the results. I moved the lookup to the caller and made them both pure functions (which helped with testing them). I don't remember chaining the calls, but even if I did, both methods were roughly the same cost, so that should have been 3-4x. The rest I suspect was memory and CPU pressure.