The vector registers are powers of two in size, but both ARM SVE and RISC-V V allow you to conveniently process any vector shorter than the register size -- by setting the VL register in RISC-V or by using a vector mask in SVE (RISC-V also does vector element masking) -- or longer than the vector register (by looping, with the last loop possibly being smaller than the vector size)
"What if you have a 4-vector and want its dot product?"
Dot product with what?
Of course on RISC-V V you can load two 4-vectors into two vector registers (perhaps wasting most of the register if it has 256 or 512 or 1024 or more bits), multiply corresponding elements, then do an "add reduce" on the products.
But that's not very interesting. Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors? Load the 0th elements of 4 or 8 or 64 etc into register v0, the 1st elements into v1, 2nd into v2, 3rd into v3. Then the same for the 4-vectors you want to dot product them with into v4, v5, v6, v7. Then do v0=v0v4;v0+=v1v5;v0+=v2v6;v0+=v3v7. And then store all the results from v0.
People used to short SIMD registers look at vector problems along the wrong axis.
> Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors?
Unfortunately not. I was thinking of a raytracer there, and it doesn’t have any more data available. I could speculate some more data or rewrite the program entirely to get some more, but the OoO and multi-core CPU is a good fit for the simplest pixel at a time approach to raytracing.
In the other case I’d use SIMD for, video codecs, there is definitely not any more data available because it’s a decompression algorithm and so it’s maximally unpredictable what the next compressed bit is going to tell you to do.
"What if you have a 4-vector and want its dot product?"
Dot product with what?
Of course on RISC-V V you can load two 4-vectors into two vector registers (perhaps wasting most of the register if it has 256 or 512 or 1024 or more bits), multiply corresponding elements, then do an "add reduce" on the products.
But that's not very interesting. Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors? Load the 0th elements of 4 or 8 or 64 etc into register v0, the 1st elements into v1, 2nd into v2, 3rd into v3. Then the same for the 4-vectors you want to dot product them with into v4, v5, v6, v7. Then do v0=v0v4;v0+=v1v5;v0+=v2v6;v0+=v3v7. And then store all the results from v0.
People used to short SIMD registers look at vector problems along the wrong axis.