Hacker News new | past | comments | ask | show | jobs | submit login

  > And you're limited to 1 decoded instruction per cycle
  > when executing instructions brought into cache for the
  > first time
I don't think that's true for any recent processors. For Sandy Bridge, you can get a throughput of 4 instructions decoded per cycle as long as they average 4 bytes or less in length and are well aligned. The exact details are a bit messy given the different layers, but you can mark up to 6 instructions or 16B per cycle, and you can put 4 into the queue each cycle (or 5 if the is fusion): http://www.realworldtech.com/sandy-bridge/3/

> It can hold 1.5 K uOps, which is certainly something but > not so big that you won't be missing it regularly.

You'll be missing it frequently, but it's big enough that it should hold almost any tight loop. Intel is claiming it's equivalent to a 6KB instruction buffer, and has about an 80% hit rate.




In fact, hasn't been true since the Pentium Pro. Agner Fog's excellent microarchitecture docs [1] indicate that the PPro (which was a 3-wide machine, and the first out-of-order x86) could find 3 instructions per cycle out of a 16-byte fetch chunk, and likely did so by speculatively starting decode at each byte in parallel.

[1] http://www.agner.org/optimize/microarchitecture.pdf




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: