They're calling the lie on needing bleeding edge hardware for performance.
5 yr old silicon (14 nm!!) and no hbm.
Their secret sauce seems to be an ahead-of-time compiler that statically lays out entire computation, enabling zero contention at runtime. Basically, they stamp out all non-determinism.
It's not really a lie though. They require 20x more chips (storing all the weights in sram instead of hbm is expensive!) than Nvidia GPUs for a ~2x speed increase. Overall the power cost is more expensive for groq than GPUs.
cost effective in what sense? groq doesn't achieve high efficiency, only low latency. but that's not done in a cost-effective way. compare sambanova achieving the same performance with 8 chips instead of 568, and with higher precision.
Most important, even ignoring latency, is throughput (tokens) per $$$. And according to their own benchmark [1] (famous last words :)), they're quite cost efficient.
From skimming the link above, it seems like they accepted it's extremely difficult (maybe impossible) to generate high ILP from a VLIW compiler on complex hardware (what Itanium tried to do).
So they attacked the italicized portion and simplified the hardware. Mostly by eliminating memory-layer non-determinism / using time-sync'd global memory instructions as part of the ISA(?).
This apparently reduced the difficulty of the compiler problem to something manageable (but no doubt still "fun")... and voila, performance.
The problem set groq is restricted to (known size tensor manipulation) lends itself to much easier solutions than full blown ILP. The general problem of compiling arbitrary Turing complete algorithms to the arch is NP hard. Tensor manipulation... That's a different story.
I wonder if the use of eDRAM (https://en.wikipedia.org/wiki/EDRAM), which is essentially embedding DRAM into a chip made on a logic process would be a good idea here.
EDRAM is essentially a tradeoff between SRAM and DRAM, offering much greater density at the cost of somewhat worse throughput and latency.
There were a couple of POWER cpus that used EDRAM as L3 cache, but it seems to have fallen out of favor.
5 yr old silicon (14 nm!!) and no hbm.
Their secret sauce seems to be an ahead-of-time compiler that statically lays out entire computation, enabling zero contention at runtime. Basically, they stamp out all non-determinism.
https://wow.groq.com/isca-2022-paper