Obviously N and f are independent (although they might not be, your N may dictat...

Obviously N and f are independent (although they might not be, your N may dictate how often a cache miss occurs).

Right, so now let's assume that our O(N^3) algorithm is simply a triple nested loop. And that we have an expensive operation f, but that through choice of data structures we can choose in which layer that operation occurs. All other costs are constant.

If f is in the outer loop, it happens N times, in the middle loop N^2, in the inner loop N^3.

So our choice of data structure impacts our running times by any factor from linear through cube of N.

I was simply answering the question of is it really possible to have two implementations of the same algorithm have such disparate run times, and it's trivially possible.

Another way of saying it, is that the mistake need not look expensive on cursory inspection, a simple way to have very different runtimes is through the accumulation of a large number of small costs.

Maybe the important point is to use a profiler.

I've recently gained a 60x improvement in real-world run time of what is considered in the literature to be the best known algorithm for a particular problem. The speed up didn't come from improving on the algorithm, it came from improving data locality, removing memory allocations, and general modern code hygiene (to be fair the canonical implementation is ~20 years old)