It's been 20 years since I have optimized assembly code at high level (think VTune, pipelining, etc). What kind of workload needs that kind of optimisation nowadays ?
Drivers. High frequency / low latency trading. Networking/telecoms equipment. Core routines (i.e if 10% of your workload is running a single block and this runs at scale then your savings can far outweigh the cost). Cheap electronics produced at scale (if you sell enough of them saving 1c on hardware outweighs the engineering effort).
>>> unless somehow your workload really, really wants optimized L2 access.
It's been 20 years since I have optimized assembly code at high level (think VTune, pipelining, etc). What kind of workload needs that kind of optimisation nowadays ?