I've probably read about it and forgotten, but what on earth was it doing in all...

pclmulqdq · on Sept 24, 2024

Here's a good source for the earliest Pentium 4, which had a 20-cycle pipeline: https://courses.cs.washington.edu/courses/cse378/10au/lectur...

Mostly the stages you expect from an out-of-order core all had their own pipeline slot in order to go super fast, and some were split into two stages. For example, you have two stages to calculate the next instruction (branch prediction), two stages to fetch, a stage to redrive a value from the memory bus, etc. From there, the ALUs could only do a 16-bit add in one cycle, so even an addition took at least 3 cycles. The later Pentium 4's took this subdivision even further with the goal of reaching 4-5 GHz clocks in 2005.

The whole chip was hand-laid-out, so it was possible to put pipeline registers in places that would have been weird to do in Verilog.