They're calling the lie on needing bleeding edge hardware for performance. 5 yr ...

ipsum2 · 2024-04-08T00:13:33 1712535213

It's not really a lie though. They require 20x more chips (storing all the weights in sram instead of hbm is expensive!) than Nvidia GPUs for a ~2x speed increase. Overall the power cost is more expensive for groq than GPUs.

ethbr1 · 2024-04-08T01:28:47 1712539727

Are power and 20x 14-nm chip capacity limiting factors currently?

It's not inconceivable that's a better trade-off than leading-node and HBM requirements.

ipsum2 · 2024-04-08T06:23:34 1712557414

Edit: 200x more chips, not 20x.

rattray · 2024-04-08T00:54:17 1712537657

Source? Or how do you know that?

wmf · 2024-04-08T01:23:34 1712539414

This was discussed extensively in previous threads, e.g. https://news.ycombinator.com/item?id=39431989

halflings · 2024-04-07T23:51:16 1712533876

No HBM because they use tons of fast SRAM instead. Isn't that the main driver for performance here?

(the way I understood it => it's still cost effective at scale due to throughput increase this brings)

gandalfgeek · 2024-04-08T00:55:53 1712537753

> No HBM because they use tons of fast SRAM instead. Isn't that the main driver for performance here?

No doubt fast SRAM helps, but from a computation pov imho its that they've statically planned computation and eliminated all locks.

Short explainer here: https://www.youtube.com/watch?v=H77tV1KcWIE (Based on their paper).

halflings · 2024-04-09T21:13:36 1712697216

Thanks for putting this together! Will give it a watch now

germanjoey · 2024-04-08T00:18:27 1712535507

cost effective in what sense? groq doesn't achieve high efficiency, only low latency. but that's not done in a cost-effective way. compare sambanova achieving the same performance with 8 chips instead of 568, and with higher precision.

halflings · 2024-04-09T21:09:51 1712696991

The # of chips is not the most important metric.

Most important, even ignoring latency, is throughput (tokens) per $$$. And according to their own benchmark [1] (famous last words :)), they're quite cost efficient.

[1] https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

dralley · 2024-04-07T23:53:48 1712534028

So Itanium and its "sufficiently smart compiler" but functional?

ethbr1 · 2024-04-08T01:43:31 1712540611

From skimming the link above, it seems like they accepted it's extremely difficult (maybe impossible) to generate high ILP from a VLIW compiler on complex hardware (what Itanium tried to do).

So they attacked the italicized portion and simplified the hardware. Mostly by eliminating memory-layer non-determinism / using time-sync'd global memory instructions as part of the ISA(?).

This apparently reduced the difficulty of the compiler problem to something manageable (but no doubt still "fun")... and voila, performance.

anon291 · 2024-04-08T02:55:08 1712544908

The problem set groq is restricted to (known size tensor manipulation) lends itself to much easier solutions than full blown ILP. The general problem of compiling arbitrary Turing complete algorithms to the arch is NP hard. Tensor manipulation... That's a different story.

torginus · 2024-04-08T06:39:38 1712558378

I wonder if the use of eDRAM (https://en.wikipedia.org/wiki/EDRAM), which is essentially embedding DRAM into a chip made on a logic process would be a good idea here.

EDRAM is essentially a tradeoff between SRAM and DRAM, offering much greater density at the cost of somewhat worse throughput and latency.

There were a couple of POWER cpus that used EDRAM as L3 cache, but it seems to have fallen out of favor.

grandmczeb · 2024-04-08T07:26:28 1712561188

It fell out of favor because it lost the density advantage in newer processes.