Yes. If you're a computer science fundamentalist like pixelpoet who wants to sti...

bayindirh · on Sept 15, 2020

If we're going to stick to literal definitions of the things in computer science, we should get the forks out and march to the sotrage manufacturers' HQs to force them to accept 1kB is actually 1024 bytes, not 1000.

From my PoV, any functionally complete computational unit (in the context of the device) can be called a core. Should we say that GPUs has no core because they don't support 3D Now! or SSE or AES instructions?

Consider an FPGA. I can program it to have many small cores which can do a small set of operations or I can program it to be single but more capable core. Which one is a real core then?

paulmd · on Sept 15, 2020

a CU is not a computationally complete unit, it has no independent thread of execution, all cores in a warp follow the same execution path (indeed all execution paths, even if not applicable to that CU).

That is in contrast to your FPGA example. Either one big core or many small cores can all execute their own threads.

It's not that simple either, you've got stuff like SMT and CMT where you have multiple threads executing on a single set of execution resources - but CUs are clearly on the line of "not a self-contained core".

bayindirh · on Sept 15, 2020

Thanks for the clarification. I've read nVidia's architecture back in the day but, it seems I'm pretty rusty on the GPU front.

Is it possible for you to point me to the right direction so, I can read how these things work and bring myself up to speed?

SMT is more like cramming two threads into a core and hoping they don't compete for the same ports/resources in the core. CMT is well... we've seen how that went.

paulmd · on Sept 16, 2020

https://en.wikipedia.org/wiki/Single_instruction,_multiple_t...

https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?med...

The short of it is that GP is right, SIMT is more or less "syntactic sugar" that provides a convenient programming model on top of SIMD. You have a "processor thread" that runs one instruction on an AVX unit with 32 lanes. What they are calling a "CUDA core" or a "thread" is analogous to an AVX lane, the software thread is called a "warp" and is executed using SMT on the actual processor core (the "SM" or "Streaming Multiprocessor"). The SM is designed with a lot of SMT threads (warps) being able to be living on the processor at once (potentially dozens per core), being put to sleep when they need to do long-term data accesses, and then it swaps to some other warp to process while it waits. This covers for the very long latency of GDDR memory accesses.

The distinction between SIMT and SIMD is that basically instead of writing instructions for the high-level AVX unit itself, you write instructions for what you want the AVX lane to be doing and the warp will map that into a control flow for the processor. It's more or less like a pixel shader type language - since that's what it was originally designed for.

In other words, under AVX you would load some data into the registers, then run an AVX mul. Maybe a gather, AVX mul, and then a store.

In SIMT, you would write: outputArr[threadIdx] = a[threadIdx] * b[threadIdx]; or perhaps otherLocalVar = a[threadIdx] * threadLocalVar; The compiler then maps that into loads and stores and allocates registers and schedules ALU operations for you. And of course like any "auto-generator" type thing this is a leaky abstraction, it behooves the programmer to understand the behavior of the underlying processor, since it will faithfully generate code with suboptimal performance.

In particular, in order to handle control flow, basically any time you have a code branch ("if/else" statement, etc), the thread will poll all its lanes. If they all go one way it's good, but if you have them go both ways then it has to run both sides, so it takes twice as long. The warp will turn off the lanes that took branch B (so they just run NOPs) and then it will run Branch A for the first set of the cores. Then it turns off the first set of cores and runs Branch B. This is an artifact of the way the processor is built - it is one thread with an AVX unit, each "CUDA core" has no independent control, it is just an AVX lane. So if you have say 8 different ways through a block of code, and all 8 conditions exist in a given warp, then you have to run it 8 times, reducing your performance to 1/8th. Or potentially exponentially more if there is further branching in subfunctions/etc.

(obviously in some cases you can structure your code so that branching is avoided - for example replacing "if" statements with multiplication by a value, and you just multiply-by-1 the elements where "if" is false, or whatever. But in others you can't avoid branching, and regardless you have to manually provide such optimizations yourself in most cases.)

AMD broadly works the same way but they have their own marketing names, the AVX lane is a "Stream Processor", the "warp" is a "wavefront", and so on.

AVX-512 actually introduces this programming model to the CPU side, where it is called "Opmask Registers". Same idea, there is a flag bit for each lane that you can use to set which lanes an operation will apply to, then you run some control flow on it.

https://en.wikipedia.org/wiki/AVX-512#Opmask_registers

t-writescode · on Sept 16, 2020

1 kB is 1000 bytes, though.

1 kiB, on the other hand, is 1024 :)

bayindirh · on Sept 16, 2020

1kiB (kibibyte) is standardized as 1024 bytes in 1998. Before that it was kilobyte which was used for 1024 bytes [0].

> Prior to the definition of the binary prefixes, the kilobyte generally represented 1024 bytes in most fields of computer science, but was sometimes used to mean exactly one thousand bytes. When describing random access memory, it typically meant 1024 bytes, but when describing disk drive storage, it meant 1000bytes. The errors associated with this ambiguity are relatively small (2.4%).

[0]: https://en.wikipedia.org/wiki/Kibibyte