"Load 16-wide vector register from scalars from 16 independent memory addresses,...

beza1e1 · on Aug 11, 2009

Your processors fetches a whole cache line at once anyways, so it isn't more memory reads. The memory address needs to be aligned to 16 byte, though.

noss · on Aug 11, 2009

But it says the addresses in the vector are independent. What if they don't all point into the same cache line? You could have 16 cache misses.

What have I misunderstood?

jacquesm · on Aug 11, 2009

yes, you could have 16 misses, but the next n (where n=cacheelinesize/size of entry) after that will all be hits...

You have to risk that miss at some point. And it does not matter whether you see the misses one-by-one or 16 in one go, it's the same penalty.

fhars · on Aug 11, 2009

A more interesting worst case would be one where the 16 slots take turns in missing the cache, so that almost everything is read from the cache, but there is still the penalty for a cache miss between every two SIMD instructions. But if you write your code in that style you get what you deserve :-).

Even some current architectures [edit: like, for example, x86] have instructions that give hints to the MMU that the program intends to stream some part of the memory in the near future, I guess these will rise in importance when more programs have to avoid these kinds of problems.