There were CPUs with whole plethora of optional optimizations. For example Cyrix packed their CPUs with goodies, but had no money to test so made it all optional.
L1, Branch Target Buffer, LSSER (load/store reordering), Loop Buffer, Memory Type Range Registers (Write Combining, Cacheability), all controlled using client side software.
Cyrix 5x86 testing of Loop Buffer showed 0.2% average boost and 2.7% maximum observable speed boost.