I would expect that that magic number 3 is related to the size of a cache line r...

I would expect that that magic number 3 is related to the size of a cache line relative to the size of an array element and the time spent in each loop. You will want to prefetch the next cache line so that it is just available by the time you want to start processing it (and, ideally, not a moment earlier because that would push data from the cache that still could be used before the requested data is needed)

A robust implementation should figure out the size of a cache line and the load delay, and derive the magic offset from that.