I generally only observe value in application code optimizing for correct data structures/batching/data parallelism. Once you're getting into fine tuning the batch size/SimD ops for a particular platform you are quickly into the realm of diminished ROI due to fragility against changing hardware, application usage patterns, and diminished returns where dropping an op from 1 microsecond to 10 nanoseconds just doesn't matter as much.
There are cases at the tail where this will make or break a project, but those are pretty rare.
There are cases at the tail where this will make or break a project, but those are pretty rare.