Hacker News new | past | comments | ask | show | jobs | submit login

The RISC-V design is the sensible one since it does not hardcode the vector register size; the other designs are idiotic since a new instruction set needs to be created every time the vector register size is increased (MMX->SSE2->AVX2->AVX-512) and you can't use shorter vector registers on lower-power CPUs without reducing application compatibility.

And it doesn't "loop" in one instruction, the vsetvl instruction will set the vector length to the _minimum_ of the vector register size and the amount of data left to process.




I have read about ideas to make forward-compatible vector instructions, but personally I don't see much value in such features.

If there appears a new CPU with different vector size then you can just recompile the program and let compiler use new instructions. Therefore it doesn't make sense to sacrifice chip area or performance for this feature.

Regarding backward compatibility (running old software on a next generation CPU), I think that newer CPUs do not need to support obsolete instructions from previous generations. Instead they could just include a decoder that would (using microcode and therefore slowly) emulate old instructions. This allows user run their package manager or recompile their software and switch to new instructions. It is totally fine to completely redesign the ISA provided that there is a minimal converter for old instructions or at least support for software-based translator like qemu.


How do you handle LMUL>1 without "looping" across the multiple destination registers?


Well, for example the Alibaba T-Head Xuantie C910 has 128 bit vector registers but a 256 bit vector ALU, so both LMUL=1 and LMUL=2 instructions execute in a single clock cycle. LMUL=8 instructions take 4 cycles to execute, but that's a 1024 bit vector right there.

Also, the person who said graphics uses 3 or 4 element vectors so you only need short vector registers didn't understand how a vector processor such as RISC-V will be programmed for that take. Usually you have thousands of XYZ or XYZW points that you want to transfor my the same matrix or normalize or whatever. You don't load one point into one vector register. You load all the X coordinates from 4 or 8 or 64 or 256 points into one vector register, all the Y coordinates into another vector register, all the Z coordinates into another one. And you put the 9 or 12 elements of the matrices you want to transform them by into other vector registers (or scalar registers if you're transforming everything by the same matrix) and you transform a whole bunch of points in parallel.


> And you put the 9 or 12 elements of the matrices you want to transform them by into other vector registers (or scalar registers if you're transforming everything by the same matrix) and you transform a whole bunch of points in parallel.

What if you have a 4-vector and want its dot product? x86 SIMD has that. (It’s slow on AMD but was okay for me.)

Vector instructions could add that too, because they can add whatever they want. I’m just suspicious it would actually be efficient in hardware to be worth using.

ARM is developing SVE but is allowing hardware to require only powers of 2 vector sizes IIRC, which surely limits how you can use it.


The vector registers are powers of two in size, but both ARM SVE and RISC-V V allow you to conveniently process any vector shorter than the register size -- by setting the VL register in RISC-V or by using a vector mask in SVE (RISC-V also does vector element masking) -- or longer than the vector register (by looping, with the last loop possibly being smaller than the vector size)

"What if you have a 4-vector and want its dot product?"

Dot product with what?

Of course on RISC-V V you can load two 4-vectors into two vector registers (perhaps wasting most of the register if it has 256 or 512 or 1024 or more bits), multiply corresponding elements, then do an "add reduce" on the products.

But that's not very interesting. Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors? Load the 0th elements of 4 or 8 or 64 etc into register v0, the 1st elements into v1, 2nd into v2, 3rd into v3. Then the same for the 4-vectors you want to dot product them with into v4, v5, v6, v7. Then do v0=v0v4;v0+=v1v5;v0+=v2v6;v0+=v3v7. And then store all the results from v0.

People used to short SIMD registers look at vector problems along the wrong axis.


> Dot product with what?

Meant to say “two 4-vectors”.

> Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors?

Unfortunately not. I was thinking of a raytracer there, and it doesn’t have any more data available. I could speculate some more data or rewrite the program entirely to get some more, but the OoO and multi-core CPU is a good fit for the simplest pixel at a time approach to raytracing.

In the other case I’d use SIMD for, video codecs, there is definitely not any more data available because it’s a decompression algorithm and so it’s maximally unpredictable what the next compressed bit is going to tell you to do.


Xuantie 910 has 2x128b ALUs, which is kinda my point that the decoding of RISC-V vector instructions to uops is more complex than it needs to be, because it depends on the value of user-settable registers. XT-910 also highlights that with their speculation of vsetvl, something no other vector ISA has to do to be performant.


While I agree that the RISC-V V extension is questionable, it does not necessarily make decoding into uop more complex. When the vector register is allocated there is a known vector register size, and we have then 3 ways to go:

1) Producing several uops for each cut in the vector register;

2) Producing a single uop with a size field, the vector unit know how to split the operation;

2) Producing a single uop size-agnostic, the vector unit know the size from the vector register metadata.

The worst is probably 1), it will stress the ROB, uop queue and the scheduler, and make the decoding way more complex. But solutions 2) and 3) keep the decoding simple and are much more resource efficient.

An implementation aspect that may be complicated is the vector register allocation depending on the vector-unit microarchitecture (more specifically, depending on the vector register file microarchitecture)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: