Let's take a step back from the empirical measurements. It's always good to take measurements when considering performance but in this case I think the concepts are pretty simple. Simple enough that we can analytically predict performance.
Pure python is slow. Dog slow. For the most part, its slowness is due to its power. It's so terribly powerful that very near nothing is determined statically. The next line of code we're about to execute is "i = i + 1". What sort of computation is that in terms of the computer itself -- registers, load/stores with memory? We have no idea: none! Until we get to that line and start to break it down. We can't even guess (or even bound) whether it does one operation or a billion operations. And not just the first time, either. Every time†! i could refer to something other than an integer this time, or you could have patched i's __add__(). Sometimes we capitalize on python's mega-dynamic features but in simpler cases like these loops we don't.
So we should expect to see vast differences between numpy and pure python because numpy's guts are implemented in an extension.
But I'd love it if this article had gone a little deeper. What are some specific details about why we see some differences here? Does one of these versions suffer more heap fragmentation? Are the differences for bigger matrices dominated by cache misses? What's the difference in the disassembly between the for/while/comprehension loops?
† this is technically an implementation detail of CPython IIRC because tracing JIT optimizers like PyPy are somehow able to reason that this can't happen (or trap the cases when it could) and make this go faster when we don't use dynamic features.
> For the most part, its slowness is due to its power. It's so terribly powerful that very near nothing is determined statically.
No. It's slow because the language is interpreted, not JITed. Powerful, highly-dynamic languages can be quite amenable to JIT compilation, especially if you're not using the features. Shape optimizations (assuming the code in question is going to be used more or less monomorphically) can convert name lookups to structure offsets. You can make code work with dynamic modifications by implementing deoptimization guards.
The CPython implementation explicitly favored simplicity of the interpreter over making it a competitively fast runtime.
Pure python is slow. Dog slow. For the most part, its slowness is due to its power. It's so terribly powerful that very near nothing is determined statically. The next line of code we're about to execute is "i = i + 1". What sort of computation is that in terms of the computer itself -- registers, load/stores with memory? We have no idea: none! Until we get to that line and start to break it down. We can't even guess (or even bound) whether it does one operation or a billion operations. And not just the first time, either. Every time†! i could refer to something other than an integer this time, or you could have patched i's __add__(). Sometimes we capitalize on python's mega-dynamic features but in simpler cases like these loops we don't.
So we should expect to see vast differences between numpy and pure python because numpy's guts are implemented in an extension.
But I'd love it if this article had gone a little deeper. What are some specific details about why we see some differences here? Does one of these versions suffer more heap fragmentation? Are the differences for bigger matrices dominated by cache misses? What's the difference in the disassembly between the for/while/comprehension loops?
† this is technically an implementation detail of CPython IIRC because tracing JIT optimizers like PyPy are somehow able to reason that this can't happen (or trap the cases when it could) and make this go faster when we don't use dynamic features.