Which has to be done after every instruction (http://boston.conman.org/2015/09/0...

colejohnson66 · on Feb 21, 2022

My guess would be a pipelining issue where `INTO` isn't treated as a `Jcc`, but as an `INT` (mainly because it is an interrupt). Agner Fog's instruction tables[0] show (for the Pentium 4) `Jcc` takes one uOP with a throughput of 2-4. `INTO`, OTOH, when not taken uses four uOPs with a throughput of 18! Zen 3 is much better with a throughput of 2, but that's still worse than `JO raiseINTO`.

[0]: https://www.agner.org/optimize/instruction_tables.pdf

monocasa · on Feb 21, 2022

It's more complicated than shows up in micro benchmarks like that. Since when you do it, it's pretty much every add, you end up polluting your branch predictor by using jo instructions everywhere and it can lead to worse overall perf.