> And you're limited to 1 decoded instruction per cycle > when executing instruc...

anon_cownerd · on Oct 10, 2013

In fact, hasn't been true since the Pentium Pro. Agner Fog's excellent microarchitecture docs [1] indicate that the PPro (which was a 3-wide machine, and the first out-of-order x86) could find 3 instructions per cycle out of a 16-byte fetch chunk, and likely did so by speculatively starting decode at each byte in parallel.

[1] http://www.agner.org/optimize/microarchitecture.pdf