For some parsing, serialization, or filtering tasks, you end up with quite large lookup tables of shuffles that would otherwise be pretty costly to compute on the fly, but real workloads hit only a tiny part of the lookup table at a time. (This comment fits entirely within his point about benchmarks, I guess)