Isn't there a sort of obvious "best of both worlds" by having a vector instruction ISA with a vector length register, plus the promise that a length >= 4 is always supported and good support for intra-group-of-4 shuffles?
Then you can do whatever you did for games and multimedia in the past, except that you can process N samples/pixels/vectors/whatever at once, where N = vector length / 4, and your code can automatically make use of chips that allow longer vectors without requiring a recompile.
Mind you, I don't know if that's the direction that the RISC-V people are taking. But it seems like a pretty obvious thing to do.
In my opinion "the best" would be to support only one, fixed, largest vector size and emulate legacy instructions for smaller sizes using microcode (this is slow, but requires minimum transistor count). There is no need to optimize for legacy software; instead it should be recompiled and compiler should generate versions of code for all existing generations of CPUs.
This way the CPU design can be simplified and all the complexity moved into compilers.
That would be awful for power consumption and only works well if everyone compiles everything from source. Otherwise every binary is awful performance for almost everyone.
One thing I learned from the M1 transition is that people will do it if someone tells them to. I bought a Mac this weekend to do exactly that; lots of users complaining about lack of M1 support. Time to add it (in a way that I can test). I have no choice.
Itanium tried to move branch prediction to the compiler from CPU. Plus it’s failure was not necessary related to CPU design, it can well be a bad business execution.
The grandparent comment is about a compiler generating multiple code copies for different CPU architecture iterations not to support all legacy instructions or at least to allow to implement those via microcode emulation in later CPUs.
Then you can do whatever you did for games and multimedia in the past, except that you can process N samples/pixels/vectors/whatever at once, where N = vector length / 4, and your code can automatically make use of chips that allow longer vectors without requiring a recompile.
Mind you, I don't know if that's the direction that the RISC-V people are taking. But it seems like a pretty obvious thing to do.