Hacker News new | past | comments | ask | show | jobs | submit login

They're calling the lie on needing bleeding edge hardware for performance.

5 yr old silicon (14 nm!!) and no hbm.

Their secret sauce seems to be an ahead-of-time compiler that statically lays out entire computation, enabling zero contention at runtime. Basically, they stamp out all non-determinism.

https://wow.groq.com/isca-2022-paper




It's not really a lie though. They require 20x more chips (storing all the weights in sram instead of hbm is expensive!) than Nvidia GPUs for a ~2x speed increase. Overall the power cost is more expensive for groq than GPUs.


Are power and 20x 14-nm chip capacity limiting factors currently?

It's not inconceivable that's a better trade-off than leading-node and HBM requirements.


Edit: 200x more chips, not 20x.


Source? Or how do you know that?


This was discussed extensively in previous threads, e.g. https://news.ycombinator.com/item?id=39431989


No HBM because they use tons of fast SRAM instead. Isn't that the main driver for performance here?

(the way I understood it => it's still cost effective at scale due to throughput increase this brings)


> No HBM because they use tons of fast SRAM instead. Isn't that the main driver for performance here?

No doubt fast SRAM helps, but from a computation pov imho its that they've statically planned computation and eliminated all locks.

Short explainer here: https://www.youtube.com/watch?v=H77tV1KcWIE (Based on their paper).


Thanks for putting this together! Will give it a watch now


cost effective in what sense? groq doesn't achieve high efficiency, only low latency. but that's not done in a cost-effective way. compare sambanova achieving the same performance with 8 chips instead of 568, and with higher precision.


The # of chips is not the most important metric.

Most important, even ignoring latency, is throughput (tokens) per $$$. And according to their own benchmark [1] (famous last words :)), they're quite cost efficient.

[1] https://www.semianalysis.com/p/groq-inference-tokenomics-spe...


So Itanium and its "sufficiently smart compiler" but functional?


From skimming the link above, it seems like they accepted it's extremely difficult (maybe impossible) to generate high ILP from a VLIW compiler on complex hardware (what Itanium tried to do).

So they attacked the italicized portion and simplified the hardware. Mostly by eliminating memory-layer non-determinism / using time-sync'd global memory instructions as part of the ISA(?).

This apparently reduced the difficulty of the compiler problem to something manageable (but no doubt still "fun")... and voila, performance.


The problem set groq is restricted to (known size tensor manipulation) lends itself to much easier solutions than full blown ILP. The general problem of compiling arbitrary Turing complete algorithms to the arch is NP hard. Tensor manipulation... That's a different story.


I wonder if the use of eDRAM (https://en.wikipedia.org/wiki/EDRAM), which is essentially embedding DRAM into a chip made on a logic process would be a good idea here.

EDRAM is essentially a tradeoff between SRAM and DRAM, offering much greater density at the cost of somewhat worse throughput and latency.

There were a couple of POWER cpus that used EDRAM as L3 cache, but it seems to have fallen out of favor.


It fell out of favor because it lost the density advantage in newer processes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: