Well, for example the Alibaba T-Head Xuantie C910 has 128 bit vector registers but a 256 bit vector ALU, so both LMUL=1 and LMUL=2 instructions execute in a single clock cycle. LMUL=8 instructions take 4 cycles to execute, but that's a 1024 bit vector right there.
Also, the person who said graphics uses 3 or 4 element vectors so you only need short vector registers didn't understand how a vector processor such as RISC-V will be programmed for that take. Usually you have thousands of XYZ or XYZW points that you want to transfor my the same matrix or normalize or whatever. You don't load one point into one vector register. You load all the X coordinates from 4 or 8 or 64 or 256 points into one vector register, all the Y coordinates into another vector register, all the Z coordinates into another one. And you put the 9 or 12 elements of the matrices you want to transform them by into other vector registers (or scalar registers if you're transforming everything by the same matrix) and you transform a whole bunch of points in parallel.
> And you put the 9 or 12 elements of the matrices you want to transform them by into other vector registers (or scalar registers if you're transforming everything by the same matrix) and you transform a whole bunch of points in parallel.
What if you have a 4-vector and want its dot product? x86 SIMD has that. (It’s slow on AMD but was okay for me.)
Vector instructions could add that too, because they can add whatever they want. I’m just suspicious it would actually be efficient in hardware to be worth using.
ARM is developing SVE but is allowing hardware to require only powers of 2 vector sizes IIRC, which surely limits how you can use it.
The vector registers are powers of two in size, but both ARM SVE and RISC-V V allow you to conveniently process any vector shorter than the register size -- by setting the VL register in RISC-V or by using a vector mask in SVE (RISC-V also does vector element masking) -- or longer than the vector register (by looping, with the last loop possibly being smaller than the vector size)
"What if you have a 4-vector and want its dot product?"
Dot product with what?
Of course on RISC-V V you can load two 4-vectors into two vector registers (perhaps wasting most of the register if it has 256 or 512 or 1024 or more bits), multiply corresponding elements, then do an "add reduce" on the products.
But that's not very interesting. Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors? Load the 0th elements of 4 or 8 or 64 etc into register v0, the 1st elements into v1, 2nd into v2, 3rd into v3. Then the same for the 4-vectors you want to dot product them with into v4, v5, v6, v7. Then do v0=v0v4;v0+=v1v5;v0+=v2v6;v0+=v3v7. And then store all the results from v0.
People used to short SIMD registers look at vector problems along the wrong axis.
> Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors?
Unfortunately not. I was thinking of a raytracer there, and it doesn’t have any more data available. I could speculate some more data or rewrite the program entirely to get some more, but the OoO and multi-core CPU is a good fit for the simplest pixel at a time approach to raytracing.
In the other case I’d use SIMD for, video codecs, there is definitely not any more data available because it’s a decompression algorithm and so it’s maximally unpredictable what the next compressed bit is going to tell you to do.
Xuantie 910 has 2x128b ALUs, which is kinda my point that the decoding of RISC-V vector instructions to uops is more complex than it needs to be, because it depends on the value of user-settable registers. XT-910 also highlights that with their speculation of vsetvl, something no other vector ISA has to do to be performant.
While I agree that the RISC-V V extension is questionable, it does not necessarily make decoding into uop more complex.
When the vector register is allocated there is a known vector register size, and we have then 3 ways to go:
1) Producing several uops for each cut in the vector register;
2) Producing a single uop with a size field, the vector unit know how to split the operation;
2) Producing a single uop size-agnostic, the vector unit know the size from the vector register metadata.
The worst is probably 1), it will stress the ROB, uop queue and the scheduler, and make the decoding way more complex. But solutions 2) and 3) keep the decoding simple and are much more resource efficient.
An implementation aspect that may be complicated is the vector register allocation depending on the vector-unit microarchitecture (more specifically, depending on the vector register file microarchitecture)