Mary Payne designed the VAX floating point POLY instruction at DEC.
But the microcode and hardware floating point implementations did it slightly differently. Then the MicroVAX dropped it, then picked it up again but wrong, then fixed it, then lost it again.
>The Digital Equipment Corporation (DEC) VAX's POLY instruction is used for evaluating polynomials with Horner's rule using a succession of multiply and add steps. Instruction descriptions do not specify whether the multiply and add are performed using a single FMA step. This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977.
Complex instructions are often good ideas. They’re best at combining lots of bitshifting (hardware is good at that and it can factor things out) but even for memory ops it can be good (the HW can optimize them by knowing things like cache line sizes).
They get a bad rap because the only really complex ISA left is x86 and it just had especially bad ideas about which operations to use its shortest codes on. Nobody uses BOUND to the point some CPUs don’t even include it.
One point against them in SIMD is there definitely is an instruction explosion there, but I haven’t seen a convincing better idea, and I think the RISC-V people’s vector proposal is bad and shows they have serious knowing what they’re talking about issues.
They want to go back to the 70s (not meant as an insult) and use Cray-style vector instructions instead of SIMD. Vector instructions are kind of like loops in one instruction; you set a vector length register and then all the vector instructions change to run on that much data.
That’s ok for big vectors you’d find in like scientific computations, but I believe it’s bad for anything I’ve ever written in SIMD. Games and multimedia usually use short vectors (say 4 ints) or mixed lengths (2 and 4 at once) and are pretty comfortable with how x86/ARM/PPC work.
Not saying it couldn’t be better, but the RISCV designers wrote an article about how their approach would be better basically entirely because they thought SIMD adds too many new instructions and isn’t aesthetically pretty enough. Which doesn’t matter.
Also I remember them calling SIMD something weird and politically incorrect in the article but can’t remember what it was…
Isn't there a sort of obvious "best of both worlds" by having a vector instruction ISA with a vector length register, plus the promise that a length >= 4 is always supported and good support for intra-group-of-4 shuffles?
Then you can do whatever you did for games and multimedia in the past, except that you can process N samples/pixels/vectors/whatever at once, where N = vector length / 4, and your code can automatically make use of chips that allow longer vectors without requiring a recompile.
Mind you, I don't know if that's the direction that the RISC-V people are taking. But it seems like a pretty obvious thing to do.
In my opinion "the best" would be to support only one, fixed, largest vector size and emulate legacy instructions for smaller sizes using microcode (this is slow, but requires minimum transistor count). There is no need to optimize for legacy software; instead it should be recompiled and compiler should generate versions of code for all existing generations of CPUs.
This way the CPU design can be simplified and all the complexity moved into compilers.
That would be awful for power consumption and only works well if everyone compiles everything from source. Otherwise every binary is awful performance for almost everyone.
One thing I learned from the M1 transition is that people will do it if someone tells them to. I bought a Mac this weekend to do exactly that; lots of users complaining about lack of M1 support. Time to add it (in a way that I can test). I have no choice.
Itanium tried to move branch prediction to the compiler from CPU. Plus it’s failure was not necessary related to CPU design, it can well be a bad business execution.
The grandparent comment is about a compiler generating multiple code copies for different CPU architecture iterations not to support all legacy instructions or at least to allow to implement those via microcode emulation in later CPUs.
The RISC-V design is the sensible one since it does not hardcode the vector register size; the other designs are idiotic since a new instruction set needs to be created every time the vector register size is increased (MMX->SSE2->AVX2->AVX-512) and you can't use shorter vector registers on lower-power CPUs without reducing application compatibility.
And it doesn't "loop" in one instruction, the vsetvl instruction will set the vector length to the _minimum_ of the vector register size and the amount of data left to process.
I have read about ideas to make forward-compatible vector instructions, but personally I don't see much value in such features.
If there appears a new CPU with different vector size then you can just recompile the program and let compiler use new instructions. Therefore it doesn't make sense to sacrifice chip area or performance for this feature.
Regarding backward compatibility (running old software on a next generation CPU), I think that newer CPUs do not need to support obsolete instructions from previous generations. Instead they could just include a decoder that would (using microcode and therefore slowly) emulate old instructions. This allows user run their package manager or recompile their software and switch to new instructions. It is totally fine to completely redesign the ISA provided that there is a minimal converter for old instructions or at least support for software-based translator like qemu.
Well, for example the Alibaba T-Head Xuantie C910 has 128 bit vector registers but a 256 bit vector ALU, so both LMUL=1 and LMUL=2 instructions execute in a single clock cycle. LMUL=8 instructions take 4 cycles to execute, but that's a 1024 bit vector right there.
Also, the person who said graphics uses 3 or 4 element vectors so you only need short vector registers didn't understand how a vector processor such as RISC-V will be programmed for that take. Usually you have thousands of XYZ or XYZW points that you want to transfor my the same matrix or normalize or whatever. You don't load one point into one vector register. You load all the X coordinates from 4 or 8 or 64 or 256 points into one vector register, all the Y coordinates into another vector register, all the Z coordinates into another one. And you put the 9 or 12 elements of the matrices you want to transform them by into other vector registers (or scalar registers if you're transforming everything by the same matrix) and you transform a whole bunch of points in parallel.
> And you put the 9 or 12 elements of the matrices you want to transform them by into other vector registers (or scalar registers if you're transforming everything by the same matrix) and you transform a whole bunch of points in parallel.
What if you have a 4-vector and want its dot product? x86 SIMD has that. (It’s slow on AMD but was okay for me.)
Vector instructions could add that too, because they can add whatever they want. I’m just suspicious it would actually be efficient in hardware to be worth using.
ARM is developing SVE but is allowing hardware to require only powers of 2 vector sizes IIRC, which surely limits how you can use it.
The vector registers are powers of two in size, but both ARM SVE and RISC-V V allow you to conveniently process any vector shorter than the register size -- by setting the VL register in RISC-V or by using a vector mask in SVE (RISC-V also does vector element masking) -- or longer than the vector register (by looping, with the last loop possibly being smaller than the vector size)
"What if you have a 4-vector and want its dot product?"
Dot product with what?
Of course on RISC-V V you can load two 4-vectors into two vector registers (perhaps wasting most of the register if it has 256 or 512 or 1024 or more bits), multiply corresponding elements, then do an "add reduce" on the products.
But that's not very interesting. Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors? Load the 0th elements of 4 or 8 or 64 etc into register v0, the 1st elements into v1, 2nd into v2, 3rd into v3. Then the same for the 4-vectors you want to dot product them with into v4, v5, v6, v7. Then do v0=v0v4;v0+=v1v5;v0+=v2v6;v0+=v3v7. And then store all the results from v0.
People used to short SIMD registers look at vector problems along the wrong axis.
> Don't you want to do dot products on hundreds or thousands of pairs of 4-vectors?
Unfortunately not. I was thinking of a raytracer there, and it doesn’t have any more data available. I could speculate some more data or rewrite the program entirely to get some more, but the OoO and multi-core CPU is a good fit for the simplest pixel at a time approach to raytracing.
In the other case I’d use SIMD for, video codecs, there is definitely not any more data available because it’s a decompression algorithm and so it’s maximally unpredictable what the next compressed bit is going to tell you to do.
Xuantie 910 has 2x128b ALUs, which is kinda my point that the decoding of RISC-V vector instructions to uops is more complex than it needs to be, because it depends on the value of user-settable registers. XT-910 also highlights that with their speculation of vsetvl, something no other vector ISA has to do to be performant.
While I agree that the RISC-V V extension is questionable, it does not necessarily make decoding into uop more complex.
When the vector register is allocated there is a known vector register size, and we have then 3 ways to go:
1) Producing several uops for each cut in the vector register;
2) Producing a single uop with a size field, the vector unit know how to split the operation;
2) Producing a single uop size-agnostic, the vector unit know the size from the vector register metadata.
The worst is probably 1), it will stress the ROB, uop queue and the scheduler, and make the decoding way more complex. But solutions 2) and 3) keep the decoding simple and are much more resource efficient.
An implementation aspect that may be complicated is the vector register allocation depending on the vector-unit microarchitecture (more specifically, depending on the vector register file microarchitecture)
To me, the much bigger problem with SIMD is that it is really hard to program for since vector sizes keep getting bigger, and it really doesn't look good at all for big-little designs which seem to be the future. Most compilers still aren't good at generating AVX-512 code, and very few applications use it because it requires optimization for a very small portion of the market. With a variable length vector, everyone gets good code.
Also, I'm not clear why you think riscv style vector instructions will perform worse on 2-4 length vectors.
> Also, I'm not clear why you think riscv style vector instructions will perform worse on 2-4 length vectors.
Assumes there’s no switching cost to changing it and there won’t be restrictions on the values it can be set to. Weird restrictions on memory loads are pretty traditional for SIMD instructions after all.
The switching cost is the biggest problem; this is exactly the kind of thing that x86 has specified, then found they can’t do without microcoding it or otherwise making it cripplingly inefficient on different HW generations.
> the much bigger problem with SIMD is that it is really hard to program for since vector sizes keep getting bigger
This can be fixed with an abstraction layer such as github.com/google/highway (disclosure: I am the main author). The programming model is variable-length vectors and we can map that both to SSE4/AVX2/AVX-512 and SVE/RVV.
It seems to me that the actual pain (of large numbers of unique instructions in the ISA) is pushed to the CPU decoder, binutils, and abstraction layer.
> doesn't look good at all for big-little designs which seem to be the future
Why not? It's not clear to me why OSes shouldn't provide a "don't migrate me while I'm checking CPU capabilities and then running this SIMD/vector kernel".
> very few applications use it [AVX-512] because it requires optimization for a very small portion of the market.
Maybe so (though increasing), but the cost is also low - we literally just pay with binary size and compile time (for the extra codepath).
Compilers and languages have had a very long time to work on autovectorizing existing code for SIMD (54 years since ILLIAC, 25+ since SPARC VIS and MMX from Intel in 1994/1997). It mostly doesn't work for C/C++. The language has to support it, other optimization opportunities will be constantly blocked by lack information and/or restrictive specified semantics of operations.
The potential of transparent SIMD use by compilers has apparently stayed marginal enough that no mainstream application languages in all that time have updated towards facilitating it.
Ok maybe it is just a point against really complex instructions. There's clearly an optimum middle ground.