Others have pointed out limitations on register read bandwidth and other possible hardware level issues, but I'll also point out that avoiding indexed addressing modes in the generated assembly doesn't look to be that difficult - if you're doing a big unrolled SIMD kernel it's probably not a stretch to have the compiler emit some LEA instructions as well to avoid indexed addressing modes in the loop. Apparently the compiler authors simply haven't decided that it's important enough to do so yet, but compilers are much easier to change than hardware...
Or they didn't want to end up producing code with noticeable fluctuations in efficiency, as it is described later in the article: sometimes, having consistent (read predictable) computation times is better.