Hacker News new | past | comments | ask | show | jobs | submit login

Any entry level optimization course will make you implement something like that; do a matrix+matrix addition, but traverse it in row-major order in one program, and column-major in the other. If the matrix is big enough, you will easily get a 10x difference. For even larger sizes (once you thrash the TLB), you will get an additional factor.

You might not get up to 60, but this is just a simple matrix addition. Even just matrix multiplication might be enough to get a 60x difference just with optimizing for memory hierarchy. I expect that high-performance memory bound programs are more complex than matrix multiplication, so the parent comment's claim doesn't seem too unlikely to me.




Sorry, I didn't mean to question it whether it is possible. I just think that a factor of 60 sounds tricky to achieve if we are just talking about RAM (no disk involved).

The first time I encountered this myself, was doing texture mapping on the 486 back in the mid 90s. Texels would be laid out by row, and the mapper would draw horizontally. Texture maps that would fit within the L1 cache would draw fine no matter which rotation they were drawn it.

However texture maps that were larger than the L1-cache, would be drawn fine as long as the texture map wasn't significantly rotated. But, if you rotated the polygon by 90 degrees you'd see a significant drop in frames per second, because you'd skip a while row between drawing each texel, which would effectively cause a lot of cache misses. OK, I didn't manage to explain that well, but I hope it is understable.

Obviously, I have encountered this many times since, in many variants (including disk read/write), and I definitely agree that a factor of 10 should be possible.

I guess, I am just wondering if the difference between RAM and L1 has reached a factor 60, and how easy it is to come up with something where the cache misses all the time, or if caches have become smarter.


Googling puts RAM latency avoiding caches on modern processors at 40-60 cycles.


Exactly... Now, that alone could make it tricky to get a factor 60 difference.


Write a simplistic O(N^3) algortihm

When N is 100, all fixed costs in the inner loop are multiplied by 1,000,000. So a nanosecond difference is now a millisecond. It's pretty easy to get worse than that by boneheaded data layout (row-major vs column-major is pretty common)


No...

If you use a "bonehead" data structure each memory reference will force a cache-miss, i.e. taking q CPU cycles instead of 1 CPU cycles.

If the algorithms has a running time of O(N^3), it means it is using some number of instructions <= c * N^3 to complete. Let's for simplicity sake say 1 operation is identical to 1 CPU-cycle.

Now, if each operation takes q CPU cycles instead of 1 cycles due to cache-miss on every operation, the running time of your algorithm is still bounded by q * (c * N^3) operations.

For the example with N=100 -> The fast version takes c * 1,000,000 operations to complete, for some value c. For the slow version, as each operation is now q times slower, it will take q * c * 1,000,000 operations to complete for values q and c.

I.e. it takes q times as long, not q * q * q.

What you were saying is that, if you run a O(N) algorithm on a 100 MHZ CPU, it will take 10 times as long as on a 1000 MHZ CPU, but an O(N^3) algorithm would take 1000 times as long. And that is obviously, not true. Both will take 10 times longer.


No

I'm saying that a poor choice of data structure will have an amplified impact on an O(N^3) algorithm.

Giving it as an example of where you could have the different implementations of the same algorithm have massively different runtimes on the same processor.


If the algorithm has a running time of O(N^3), that implies that it takes O(N^3) < c * N^3 operations to complete. Regardless of datastructures. That's the definition of asymptotic running time.

If we disregard c, and set N = 100, that is 1,000,000 operations.

We have two situations. 1) Either we use a good layout where each operation is fast because each memory access is from cache. 2) Or we use a bad layout where each operation is slow because we have a cache-miss and have to access the RAM. These access times are independent of the specific algorithm, and only depends on CPU architecture.

Let say that the fast operations each take 1 second. And the slow operations each take f seconds.

In the fast case, with good memory layout, the algorithm completes in 1,000,000 seconds. In the slow case, with bad memory layout, the algorithm completes in f * 1,000,000 seconds.

The time of each operation we perform does not depend on the size of N. However, in practice, you'll typically have a lot of cache hits even with a bad memory layout, so the difference may be less than a factor of f for smaller values of N (where you'll have more cache hits in a cubic algorithm). However, it will be upper bounded by f. I.e. there is an upper bound on how much slower you can make an algorithm by changing the memory layout.


Obviously N and f are independent (although they might not be, your N may dictate how often a cache miss occurs).

Right, so now let's assume that our O(N^3) algorithm is simply a triple nested loop. And that we have an expensive operation f, but that through choice of data structures we can choose in which layer that operation occurs. All other costs are constant.

If f is in the outer loop, it happens N times, in the middle loop N^2, in the inner loop N^3.

So our choice of data structure impacts our running times by any factor from linear through cube of N.

I was simply answering the question of is it really possible to have two implementations of the same algorithm have such disparate run times, and it's trivially possible.

Another way of saying it, is that the mistake need not look expensive on cursory inspection, a simple way to have very different runtimes is through the accumulation of a large number of small costs.

Maybe the important point is to use a profiler.

I've recently gained a 60x improvement in real-world run time of what is considered in the literature to be the best known algorithm for a particular problem. The speed up didn't come from improving on the algorithm, it came from improving data locality, removing memory allocations, and general modern code hygiene (to be fair the canonical implementation is ~20 years old)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: