In my experience, the 20x performance number is after taking the DSPs into accou...

dragontamer · on Oct 27, 2020

Just looking at raw FLOPs: the 7nm Xilinx Versal series tops out at 8 32-bit TFlops (DSP Cores only), plus whatever the CPU-core and LUTs can do (but I assume CPU-core is for management, and LUTs are for routing and not dense compute).

In contrast: the NVidia A100 has 19 32-bit TFlops. Higher than the Xilinx chip, but the Xilinx chip is still within an order of magnitude, and has the benefits of the LUTs still.

-----

It should be noted that Xilinx Versal "AI engine" is a VLIW SIMD-architecture: https://www.xilinx.com/support/documentation/white_papers/wp..., effectively an ASIC-GPU hardwired into the FPGA.

banjo_milkman · on Oct 27, 2020

Raw FLOPs is completely misleading, which is why Nvidia focus on it as a metric. The GPU can't keep those ops active - particularly during inference when most of the data is fresh so caches don't help. It's the roofline model.

In my experience FPGA>GPU for inference, if you have people who can implement good FPGA designs. And inference is more common than training. Much of this is due to explicit memory management and more memory on FPGA.

dragontamer · on Oct 27, 2020

Well, my primary point is that the earlier assertion: "GPUs are 20x faster than FPGAs" is no where close to the theory of operations, let alone reality.

ASICs (in this case: a fully dedicated GPU) obviously wins in the situation it is designed for. The A100, and other GPU designs, probably will have higher FLOPs than any FPGA made on the 7nm node.

But not a "lot" more FLOPs, and the additional flexibility of an FPGA could really help in some problems. It really depends on what you're trying to do.

------

At best, 7nm top-of-the-line GPU is ~2x more FLOPs than 7nm top-of-the-line FPGA under today's environment. In reality, it all comes down to how the software was written (and FPGAs could absolutely win in the right situation)

TomVDB · on Oct 27, 2020

> GPUs are 20x faster than FPGAs

The original comment by brandmeyer said "ASIC", not "GPU".

Take the same RTL. Synthesize it for ASIC and for FPGA. Observe a 20x difference after normalizing for power, area, and clock speed.

tails4e · on Oct 27, 2020

Yes, and FPGAs can be better than GPUs for some applications, even more power efficient, and cost effective.

davrosthedalek · on Oct 27, 2020

The question is, how much does your algorithm get from the 19 TFlops for a GPU, and how much from the 8 from the Versal. I'm sure many algos fit GPUs fine, but some don't, and might get more out of an FPGA.