So a conclusion would be to use numpy instead of for loops for number crunching.
But behold! This does not scale as one could expect.
Turning a little more complex elementwise computation into vectorized numpy calls produces a lot of memory overhead. For each vectorized numpy call, which does a basic computation (like adding two operands, or negating, or sinus of, ...), you allocate a whole new array. The calls _can_ leverage parallel computation but you run into cache issues because of the massive memory overhead.
A simple C-loop computing each element completely can be much more memory local and cache friendly. Turning it into a simple chain of numpy operations can destroy that.
I think using Numba to compile ufunc for numpy can be a better approach to that problem.
> Turning a little more complex elementwise computation into vectorized numpy calls produces a lot of memory overhead. For each vectorized numpy call, which does a basic computation (like adding two operands, or negating, or sinus of, ...), you allocate a whole new array.
I believe this is no longer correct, within a single expression. In the last couple of years things were fixed so that `a = x+y+z`, for instance, will not create an intermediate array.
But behold! This does not scale as one could expect.
Turning a little more complex elementwise computation into vectorized numpy calls produces a lot of memory overhead. For each vectorized numpy call, which does a basic computation (like adding two operands, or negating, or sinus of, ...), you allocate a whole new array. The calls _can_ leverage parallel computation but you run into cache issues because of the massive memory overhead.
A simple C-loop computing each element completely can be much more memory local and cache friendly. Turning it into a simple chain of numpy operations can destroy that.
I think using Numba to compile ufunc for numpy can be a better approach to that problem.