I've probably read about it and forgotten, but what on earth was it doing in all those pipeline stages?
Were there a lot of idle/no-op stages for simple instructions? Or did most instructions, even simple ones, actually get some useful work done for each of those 32ish stages?
Mostly the stages you expect from an out-of-order core all had their own pipeline slot in order to go super fast, and some were split into two stages. For example, you have two stages to calculate the next instruction (branch prediction), two stages to fetch, a stage to redrive a value from the memory bus, etc. From there, the ALUs could only do a 16-bit add in one cycle, so even an addition took at least 3 cycles. The later Pentium 4's took this subdivision even further with the goal of reaching 4-5 GHz clocks in 2005.
The whole chip was hand-laid-out, so it was possible to put pipeline registers in places that would have been weird to do in Verilog.
Were there a lot of idle/no-op stages for simple instructions? Or did most instructions, even simple ones, actually get some useful work done for each of those 32ish stages?