In a sense, GPUs are only great at matrix-matrix multiplication. For anything else you would only get 7% of the FLOPs/s compared to it(989 vs 67 TFLOP/s for H100)[1].
Funnily, they're far from being optimal for GEMM ops (especially in terms of power consumption).
For GEMM you need to visit each row/vec n-times so theres a bunch of data-reuse going on, which isn't optimal for GPUs since you can't keep that all so close to your processing-units. And while the tensor-cores kinda implement this i think they don't quite scale up to a full sized systolic array, which is you would want for larger matrix multiplications.
Also just a simpler view: with GPUs most of their silicon is spent NOT tensor-core, so just from that you know its not optimal i guess.
Just referring to that FLOP/s number doesn't really mean much nowadays with tensor-cores and sparsity.
In my eyes the big win of GPUs are that not only are they pretty good at GEMMs but also really good at a lot of other easily parallelizable tasks PLUS they're comparatively easy to program ^^
That link says "* With sparsity". For extremely sparse matrixes you can get more than 989 TFLOPS on CPU, if we're counting elided operations in TFLOPS.
That change checks out then. They didn't see much need for FP16 outside of that so no longer run it at double FP32 rate outside of tensor cores (unless I'm mixing that up with AMD).
Other forms of sparsity are heavily used at training time now, like block compression in Deepseek.
lol, I haven't thought about it like that, true. though of course, I mean compared to CPUs :P
I try and use tensor cores for non-obvious things every now and then. The most promising so far seems to be for linear arithmetic in Datalog, but that's just matrix-vector/gemv
This was just a brief moment of thought over a year ago, but I can try to summarize.
I was thinking about how to unify variables in certain simple Datalog settings. If we think of a clause as a vector of variables, then simple unifications can look like just a gather operation. A gather can be thought of as a matrix-vector multiplication, but that's not really useful (performance wise). But if those variables are also in a linear equation, then it becomes possibly-useful, e.g. for something like `P(x, y) :- E(x, 3x+4y)`
I didn't get that at all - to me this looks like a very smart investigation into instruction latency and the precise mechanics of out-of-order execution (no reference made to speculative or branch prediction, though?) without looking at what the instructions do in detail.
GPUs can certainly do bulk integer arithmetic but most use cases prefer FP. Maybe for DSP fixed-point is ideal.
It’s well known GPUs are good at cryptography. Starting with hash functions (e.g. crypto mining) but also zero knowledge proofs and multi party computation.
I think you are confusing uniform registers with the uniform keyword in RSL / GLSL / HLSL?
maybe some vendors have had an equivalent to uniform registers for 20 years, but per the articles’ references they are new in nvidia GPUs in turing (2018)
They are the same thing. The uniform keyword in shading languages is implemented using the uniform registers.
I don't know what Nvidia did in 2018, maybe they opened up access to the uniform registers to CUDA code.
I made Grok research this topic:
> In conclusion, research strongly suggests that the "uniform" keyword in GLSL is implemented in hardware using NVIDIA's "uniform registers," as evidenced by NVIDIA's own documentation on the Turing architecture and historical practices of mapping uniforms to constant registers. While explicit links can be limited due to proprietary details, the combination of technical presentations, community discussions, and historical context supports this connection. The uniform register file, with its capacity and usage in shader instructions, aligns with GLSL's uniform functionality, ensuring efficient data access during shader execution.
While you can use uniform registers to implement the uniform keyword from shading languages, the two are not the same. Uniform registers are not constants, and they are only uniform/shared across one warp. Nvidia architectures before Turing did not have uniform registers.
Edit: learned a bunch, but the "uniform" registers and 64-bit (memory) performance are some easy standouts.