Because most code is not running anywhere near saturation of the available resources, and the problem is only getting worse as cores get wider. I mean, look at the Zen5 block diagram - there are 6 ALUs and 4 AGUs on the integer side alone! That's almost two entire Zen1 cores worth of execution resources, which is pretty amazing. Very, very little real world code is going to be able to get anywhere near saturating that every cycle. SMT helps improve the utilization of the execution resources that have already been paid for in the core.
I'll give another example from my own experience. I write a lot of code in the computer graphics domain. Some of the more numeric-oriented routines are able to saturate the execution resources, and get approximately 0% speedup from SMT.
Importantly though, there are other routines that make heavy use of lookup tables. Even though the tables reside completely within L1 cache, there are some really long dependency chains where the 3/4 cycle wait for L1 chains and causes some really low utilization of ALUs. Or at least, that's my theory. :) Regardless in that code running SMT provides about a 30% speedup "for free" which is quite impressive.
I was uncertain of if SMT had a future for a while, but I think for x86 in general it provides some pretty significant gains, for a design complexity that has already been 'paid' for.
With the continuous improvement of out-of-order execution, the SMT gains have been diminishing from Zen 1 to Zen 4.
However you are right that Zen 5, like also the Intel Lion Cove core, has a jump in the number of available execution resources and it is likely that out-of-order execution will not be enough to keep them busy.
This may lead to a higher SMT gain on Zen 5, perhaps on average around 30% (from typically under 20% with Zen 3 or Zen 4), like in the Intel presentation where they compared a Lion Cove without SMT with a Lion Cove with SMT. In the context of a hybrid CPU, where the MT-performance can be better provided by efficient cores than by SMT, Intel has chosen to omit SMT, for better PPA (performance per power and area), but in the context of their future server CPU with big cores, they will use cores with SMT (and with wider SIMD execution units, to increase the energy efficiency).
I'll give another example from my own experience. I write a lot of code in the computer graphics domain. Some of the more numeric-oriented routines are able to saturate the execution resources, and get approximately 0% speedup from SMT.
Importantly though, there are other routines that make heavy use of lookup tables. Even though the tables reside completely within L1 cache, there are some really long dependency chains where the 3/4 cycle wait for L1 chains and causes some really low utilization of ALUs. Or at least, that's my theory. :) Regardless in that code running SMT provides about a 30% speedup "for free" which is quite impressive.
I was uncertain of if SMT had a future for a while, but I think for x86 in general it provides some pretty significant gains, for a design complexity that has already been 'paid' for.