You can still benefit from AVX for short workloads, the latency of them is no mo...

You can still benefit from AVX for short workloads, the latency of them is no more than other instructions.

Running things on some accelerator (gpu, etc.) usually involves writing a specific kernel in a language subset, manually copying data and generally long latencies. Unless there is a lot of data it won't be faster.

With AVX in the best case the compiler can just vectorize some loop, speeding it up 5x without any added latency or source code changes.