The performance of soft cores are significantly lower than hard cores.
Xilinx already has a RISC soft core in their MicroBlaze architecture so they don't have a pressing need for a low power, reasonable performance RISC soft core. Ref: https://en.wikipedia.org/wiki/MicroBlaze
AMD has high performance CPUs being fabbed by TSMC (same foundry as Xilinx), so (theoretically) AMD CPUs can be grafted onto the Xilinx FPGA as a hard core.
With AMD and the MicroBlaze, they have the high performance and low power processor spectrum covered with no need for 3rd party licensing costs.
FPGAs are to ASICs as interpreted languages are to compiled languages. I don't mean that literally, but I do mean it in the performance sense. At the same process node, an FPGA is over 50x the power and 1/20th the speed of a dedicated ASIC and it isn't getting any better.
Note however that Xilinx has a dsp slice (UltraScale) which is a prefabbed adder / multiplier. This would be PyTorch in the analogy.
FPGA LUTs cannot compete against ASICs, so modern FPGAs have thousands of dedicated multipliers to compete.
The LUTs compete against software, while the dedicated 'Ultrascale DSP48 Slices' competes against GPUs or Tensors.
--------
It's not easy, and it's not cheap. But those UltraScale DSP48 units are competitive vs GPUs.
It's still my opinion that GPUs win in most cases, due to being more software based and easier to understand. It is also cheaper to make a GPU. But I can see the argument for Xilinx FPGAs if the problem is just right...
With one notable detail: Its much easier to stream data through the GPU than it is through an FPGA. And I say that fully knowing how much of a (relative) PITA it is to stream data through a GPU.
I think it also works better to think of the DSP resources as a big systolic array with lots of local connectivity and memory and only sparse remote connectivity. The SIMD model doesn't really apply.
What's annoying is there's no real reason it has to be harder to stream through an FPGA- it's largely just that the ecosystem and FPGA vendor tooling is so utterly garbage that installing it will probably attract raccoons to your /opt.
With one notable detail: Its much easier to stream data through the GPU than it is through an FPGA.
That's an interesting assertion. FPGAs are better at moving data from one place to another than any other general-purpose device I can think of.
How many GPUs have Xilinx GTx-class transceivers, for instance? A GPU with JESD204B/C connectivity would be an extremely interesting piece of hardware.
"Easier" as I'm using it is a metric of engineering effort more than throughput. I don't disagree - there's vastly more aggregate bandwidth available, both internally and on external interfaces.
But its vastly easier to get started with CUDA or OpenCL than it is to get started with a big FPGA.
> But those UltraScale DSP48 units are competitive vs GPUs.
Are they really competitive from a price / performance perspective? Based on my limited understanding, nvidia GPUs, for example, are several times cheaper for similar performance?
Mass produced commodity processors will always win in price/performance. That's why x86 won, despite "more efficient" machines (Itanium, SPARC, DEC Alpha, PowerPC, etc. etc.) being developed.
One of the few architectures to beat x86 in price/performance was ARM, because ARM aimed at even smaller and cheaper devices than even x86's ambitions. Ultimately, ARM "out-x86'd" the original x86 business strategy.
-------------
GPUs managed to commoditize themselves thanks to the video game market. Most computers have a GPU in them today, if only for video games (iPhone, Snapdragon, normal PCs, and yes, game consoles). That's an opportunity for GPU-coders, as well as supercomputers who want a "2nd architecture" more suited for a 2nd set of compute problems.
-----
FPGAs will probably never win in price / performance (unless some "commodity purpose" is discovered. I find that highly unlikely). Where FPGAs win is absolute performance, or performance/watt, in some hypothetical tasks that CPUs or GPUs don't do very well. (Ex: BTC Mining, or Deep Learning Systolic Arrays, or... whatever is invented later)
Computers are so cheap, that even a $10,000 FPGA may save more electricity than an equivalent GPU, over the 3 year lifespan of their usage. Electricity costs of data-centers are pretty huge.
The ultimate winner is of course, ASICs, a dedicated circuit for whatever you're trying to do. (Ex: Deep Blue's chess ASIC. Or Alexa's ASIC to interpret voice commands). But FPGAs serve as a stepping stone between CPUs and ASICs.
------
If you have a problem that's already served by a commodity processor, then absolutely use a standard computer! FPGAs are for people who have non-standard problems: weird data-movement, or compute so DENSE that all those cache-layers in the CPU (or GPU) just gets in the way.
Those AI accelerators aren't really "full CPUs", since there's no cache coherence, really. They're tiny 32kB memory slabs + decoder + ALUs + networking to connect to the rest of the FPGA.
But its certainly more advanced than a DSP slice (which was only somewhat more complicated than a multiply-and-add circuit).
-------
I guess you can think of it as a tiny 32kB SRAM + CPU though. But its still missing a bunch of parts that most people would consider "part of a CPU". But even a GPU provides synchronization functions for its cores to communicate / synchronize together with.
Just looking at raw FLOPs: the 7nm Xilinx Versal series tops out at 8 32-bit TFlops (DSP Cores only), plus whatever the CPU-core and LUTs can do (but I assume CPU-core is for management, and LUTs are for routing and not dense compute).
In contrast: the NVidia A100 has 19 32-bit TFlops. Higher than the Xilinx chip, but the Xilinx chip is still within an order of magnitude, and has the benefits of the LUTs still.
Raw FLOPs is completely misleading, which is why Nvidia focus on it as a metric. The GPU can't keep those ops active -
particularly during inference when most of the data is fresh so caches don't help. It's the roofline model.
In my experience FPGA>GPU for inference, if you have people who can implement good FPGA designs. And inference is more common than training. Much of this is due to explicit memory management and more memory on FPGA.
Well, my primary point is that the earlier assertion: "GPUs are 20x faster than FPGAs" is no where close to the theory of operations, let alone reality.
ASICs (in this case: a fully dedicated GPU) obviously wins in the situation it is designed for. The A100, and other GPU designs, probably will have higher FLOPs than any FPGA made on the 7nm node.
But not a "lot" more FLOPs, and the additional flexibility of an FPGA could really help in some problems. It really depends on what you're trying to do.
------
At best, 7nm top-of-the-line GPU is ~2x more FLOPs than 7nm top-of-the-line FPGA under today's environment. In reality, it all comes down to how the software was written (and FPGAs could absolutely win in the right situation)
The question is, how much does your algorithm get from the 19 TFlops for a GPU, and how much from the 8 from the Versal. I'm sure many algos fit GPUs fine, but some don't, and might get more out of an FPGA.
I agree with the sentiment, but the numbers are off. It's about 10x the power worst case (maybe 5x for some dsp heavy apps) and also around 5 to 10x for speed. An FPGA can easily run at 100s of MHz, up to 500 with good design pipelining, so suggesting an ASIC could do 500x20 times the speed is 10Ghz, so definitely beyond most ASICs, so I think 5x is more reasonable.
My experience is that to get those "high" clock frequencies that the work per cycle has to be extremely small. If you normalize to total circuit delay in units of time than you still end up many times worse, because you need many extra pipeline cycles to get the Fmax that high.
My day job is ASIC design and we do some prototyping on FPGAs, so the exact same RTL is used as an input. We always benchmark power, performance, etc between ASIC and FPGA, so this is based on some real deigns. A 5x reduction in power is fair for most of what I've seen and the FPGA is actually better at achieving FMAX than you'd expect - control paths do need a lot more pipelining than ASIC, but compute intensive (DSP) datapaths are pretty good with a few tweaks. I think sometimes people throw code at them and get 100 MHz and say we'll FPGAs are slow so it's expected, but in my experience with a little tuning you can get most datapaths to run at 500MHz. You do pay the power penalty vs dedicated ASIC, but the performance is very good.
I think it depends a great deal on what you're doing. A fully pipelined double-precision floating-point fused multiply-add in FPGA tech will reach well over 500 MHz on current parts, but takes almost 30 cycles of pipelined latency to deliver each result. On the same process node, a well-optimized CPU will run at 6-8x the clock frequency and only require 4 cycles of latency to deliver each result.
Is this flow filled with divide-and-conquer algorithms with very low work per step? Yes. Is that particularly ill-suited to FPGA logic? Yes. Is it unfair to the FPGA? Not in my opinion.
I stand by my claim: If you normalize a general circuit's speed in units of time instead of cycles, then you'll find that ASICs come out much much farther ahead.
From this [0] it suggests the xilnx floating point core can run at >600Mhz,and the latency of many operations is just a few cycles. Also a
s its pipelined the throughput could mean one result per clock, depending on how you configure the core. Seems closer to the 5x to me.
That chart doesn't show you the result latency, only the maximum achievable frequency. You have to use Vivado to instantiate an instance with the specific suite of configurable options. When you do that, it will inform you of the result latency: 27-30 cycles for FMA.
Xilinx already has a number of devices with ARM hard cores, though (like the Zynq series). There's no compelling reason for them to switch away from that.