Cray-2 vectorization instruction notes by former Principal Engineer of Cray

fulafel · on Dec 9, 2019

Bacj when the Cell was around (PS3), there were sometimes discusisons about the similarities between it and the Cray - local memory, different insn set for vector processors vs the "CPU". I guess the Cray was easier to program because you could still address the shared memory from the vector programs without DMA or other hoops.

Const-me · on Dec 9, 2019

I guess cell is more efficient due to the DMA. Note the Cray manual says "There is no path between Local Memory and real memory. Vector registers must be used to implement block copies.", i.e you have to spend CPU cycles copying data.

datenwolf · on Dec 9, 2019

> i.e you have to spend CPU cycles copying data

Same for x86. "String" instructions for copying between regions of memory is the closest thing to a memory-to-memory DMA, a feature commonplace on microcontrollers.

fulafel · on Dec 11, 2019

The cache is the only[1] core-private memory in x86. If you think about the cache as the local storage, there's automatic memory-to-memory DMA between cache and shared main memory :)

[1] Of course there is some architected program state including the registers, fp stack, flags, and various other processor state that can be saved/loaded but let's keep to byte-addressable storage

Erwin · on Dec 9, 2019

I like the infix notation e.g. "Vi Pvj" the Population count instruction, working on vector j as input and outputting into i the count of bits in each element.

Versus Intel's vpopcntX reg1, reg2 where X determines element size.

Seems like Cray had several 64*64 = 4096 bit vector register, but you worked on it only 64 bit at a time while as current Intel CPUs have 512-bit vector registers up from 256-bit for AVX-2.

Are those Intel vector register sizes going to increase until they catch up to the old Cray? Or was going up from 256 to 512 bit chosen to fit something else in the CPU architecture, like that you can fill the register in so many clock cycles?

fsfod · on Dec 9, 2019

Some of the design of decisions that lead AVX512 are explained in these slides from Tom Forsyth http://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4%... One of the reasons mentions is 512 bits is the same size as a cache line.

tom_mellior · on Dec 9, 2019

> Seems like Cray had several 64*64 = 4096 bit vector register, but you worked on it only 64 bit at a time

What makes you say that? There seem to be the usual vector-vector instructions:

    161ijk  Vi Vj+Vk

> Are those Intel vector register sizes going to increase

I don't see how they could. The vector size has increased from XMM to YMM to ZMM, there is obviously no more room for expansion ;-)

dr_zoidberg · on Dec 9, 2019

> I don't see how they could. The vector size has increased from XMM to YMM to ZMM, there is obviously no more room for expansion ;-)

One thought comes to mind:

    WMM, standing for "Wider MM register"

But then another one, like they've already done:

    EXMM, EYMM, EZMM -- as in A, AX, EAX, RAX...

(captain obvious: I know it was a joke...)

Edit: formatting

jabl · on Dec 9, 2019

> > but you worked on it only 64 bit at a time

> What makes you say that?

Ye olde Crays used 'vector pipelining', meaning that while vector registers held many elements, there was only one ALU. So a single vector instruction took many cycles to execute. OTOH this enabled the execution units to be well utilized even without a cache, heroic OoO etc.

gpderetta · on Dec 9, 2019

Yes, if I understand correctly, at the time instruction fetch/dispatch was the bottleneck, so vector instructions would keep the execution units busy with data streamed directly from main memory (there was no need for cache because, at least for throughput oriented applications, main memory was not significantly slower than the cpu itself).

jabl · on Dec 9, 2019

Also, Crays of yore used SRAM for main memory. And back then there was also much less of a gap between memory bus speed and cpu speed. This combined with the vector pipelining made caches somewhat unnecessary.

jtlienwis · on Dec 9, 2019

Cray-2 was all dram except for maybe one of the first to ship which was sram.

dragontamer · on Dec 9, 2019

Note that last-gen AMD Vega64 GPUs would still be "vector-pipelined" across 4-clock ticks.

A 64x wide logical wavefront on a Vega64 would be physically executed by a 16x wide ALU. The ALU would pipeline itself over 4-clock ticks, providing the programmer a logical 64x wide vector.

dragontamer · on Dec 9, 2019

> Are those Intel vector register sizes going to increase until they catch up to the old Cray? Or was going up from 256 to 512 bit chosen to fit something else in the CPU architecture, like that you can fill the register in so many clock cycles?

The opposite. GPUs seem to be converging onto 1024-bits wide (32x 32-bits)

GPUs used to be 64x 32-bits wide (2048-bits), but both AMD and NVidia seem to have settled on 32x 32-bits wide (1024-bits).

It seems that at the point of ~1024-bit wide, its more appropriate to parallelize your processors instead of increasing vector size. Ex: Instead of having 32x (Compute Units) 64x (Threads per CU) 32-bit, you should have 64x (Compute Units) 32x (Threads per CU) 32-bit compute units.

The smaller size (32x instead of 64x) makes thread-divergence easier to handle.

--------

AMD Vega64 was logically 64x wide (2048-bits), although it was physically a 16x wide processor (the 16x cores per vALU would repeat themselves for 4 clock cycles. Logically 64x cores, but physically only 16x cores).

By switching to NAVI 32x wide instead, efficiency went up but overall TFlops went down. The AMD 5700 XT is 40x 2x32x 32-bit in organization (40x compute units, 2x 32x SIMDs per compute unit, x32 bits each). Total of 2560 cores.

Vega64 was 64x 4x16x 32-bit in organization, for a total of 4096 SIMD cores.

Vega64 and 5700xt are roughly the same speed in practice, despite the 5700xt having only 62% of the cores and fewer TFlops than the Vega64. I guess the CryptoCoin miners prefered the ol Vega64, but in practice, its easier to write efficient programs for a narrower SIMD unit.

DmitryOlshansky · on Dec 9, 2019

512 / 8 = 64 bytes which is a cache line size in Intel’s (and AMD’s) CPUs.

acoye · on Dec 9, 2019

I kind of want to run Doom on this too : 167i-k Vi *QVk reciprocal square root approximation

ygra · on Dec 9, 2019

Doom doesn't have arbitrary normals; wasn't that from Quake?

smcl · on Dec 9, 2019

Quake III Arena, iirc

acoye · on Dec 9, 2019

You are correct. I was running low on caffeine . https://en.wikipedia.org/wiki/Fast_inverse_square_root

codezero · on Dec 9, 2019

Really looking forward to someone who can comment on this to break it down for those of us who want to know but can't even. :)

dragontamer · on Dec 9, 2019

The early 80s and 90s "vector supercomputers" serve as the basis of modern GPUs. A GPU-programmer can immediately see the similarity of the assembly language here, with modern GPU-assembly languages (AMD RDNA GPUs or NVidia Turing GPUs).

Just as a modern programmer learns about DEC PDP-11 and its influence on the C-programming language, a modern GPU programmer could look at these Cray-notes and learn about the influence of that machine onto the modern GPU.

------------

The SIMD-principles on this Cray have found their way to normal CPUs, in the form of AVX-commands or AVX512.

acoye · on Dec 9, 2019

Hey look, it had the NSA's magic instruction: 106ij0 Si PSj population count