My day job is ASIC design and we do some prototyping on FPGAs, so the exact same...

brandmeyer · on Oct 28, 2020

I think it depends a great deal on what you're doing. A fully pipelined double-precision floating-point fused multiply-add in FPGA tech will reach well over 500 MHz on current parts, but takes almost 30 cycles of pipelined latency to deliver each result. On the same process node, a well-optimized CPU will run at 6-8x the clock frequency and only require 4 cycles of latency to deliver each result.

Is this flow filled with divide-and-conquer algorithms with very low work per step? Yes. Is that particularly ill-suited to FPGA logic? Yes. Is it unfair to the FPGA? Not in my opinion.

I stand by my claim: If you normalize a general circuit's speed in units of time instead of cycles, then you'll find that ASICs come out much much farther ahead.

tails4e · on Oct 28, 2020

From this [0] it suggests the xilnx floating point core can run at >600Mhz,and the latency of many operations is just a few cycles. Also a s its pipelined the throughput could mean one result per clock, depending on how you configure the core. Seems closer to the 5x to me.

[0] https://www.xilinx.com/support/documentation/ip_documentatio...

brandmeyer · on Oct 28, 2020

That chart doesn't show you the result latency, only the maximum achievable frequency. You have to use Vivado to instantiate an instance with the specific suite of configurable options. When you do that, it will inform you of the result latency: 27-30 cycles for FMA.