"Load 16-wide vector register from scalars from 16 independent memory addresses, where the addresses are stored in a vector!"
Wont that cause 16 memory reads that are much costlier than what one can benefit on a 16-way SIMD ALU operation?
What I wanted when I tried to write highly optimized computer-vision code was to have multiple "cursors" in memory that i would read from and increment. I.e. that the hardware would prefetch data and my code would operate as if data was a stream.
A more interesting worst case would be one where the 16 slots take turns in missing the cache, so that almost everything is read from the cache, but there is still the penalty for a cache miss between every two SIMD instructions. But if you write your code in that style you get what you deserve :-).
Even some current architectures [edit: like, for example, x86] have instructions that give hints to the MMU that the program intends to stream some part of the memory in the near future, I guess these will rise in importance when more programs have to avoid these kinds of problems.
Wont that cause 16 memory reads that are much costlier than what one can benefit on a 16-way SIMD ALU operation?
What I wanted when I tried to write highly optimized computer-vision code was to have multiple "cursors" in memory that i would read from and increment. I.e. that the hardware would prefetch data and my code would operate as if data was a stream.