Why not license other softcores, e.g. from SiFive?

gvb · on Oct 27, 2020

The performance of soft cores are significantly lower than hard cores.

Xilinx already has a RISC soft core in their MicroBlaze architecture so they don't have a pressing need for a low power, reasonable performance RISC soft core. Ref: https://en.wikipedia.org/wiki/MicroBlaze

AMD has high performance CPUs being fabbed by TSMC (same foundry as Xilinx), so (theoretically) AMD CPUs can be grafted onto the Xilinx FPGA as a hard core.

With AMD and the MicroBlaze, they have the high performance and low power processor spectrum covered with no need for 3rd party licensing costs.

brandmeyer · on Oct 27, 2020

FPGAs are to ASICs as interpreted languages are to compiled languages. I don't mean that literally, but I do mean it in the performance sense. At the same process node, an FPGA is over 50x the power and 1/20th the speed of a dedicated ASIC and it isn't getting any better.

dragontamer · on Oct 27, 2020

That's a decent analogy.

Note however that Xilinx has a dsp slice (UltraScale) which is a prefabbed adder / multiplier. This would be PyTorch in the analogy.

FPGA LUTs cannot compete against ASICs, so modern FPGAs have thousands of dedicated multipliers to compete.

The LUTs compete against software, while the dedicated 'Ultrascale DSP48 Slices' competes against GPUs or Tensors.

--------

It's not easy, and it's not cheap. But those UltraScale DSP48 units are competitive vs GPUs.

It's still my opinion that GPUs win in most cases, due to being more software based and easier to understand. It is also cheaper to make a GPU. But I can see the argument for Xilinx FPGAs if the problem is just right...

brandmeyer · on Oct 27, 2020

With one notable detail: Its much easier to stream data through the GPU than it is through an FPGA. And I say that fully knowing how much of a (relative) PITA it is to stream data through a GPU.

I think it also works better to think of the DSP resources as a big systolic array with lots of local connectivity and memory and only sparse remote connectivity. The SIMD model doesn't really apply.

XMPPwocky · on Oct 27, 2020

What's annoying is there's no real reason it has to be harder to stream through an FPGA- it's largely just that the ecosystem and FPGA vendor tooling is so utterly garbage that installing it will probably attract raccoons to your /opt.

CamperBob2 · on Oct 27, 2020

With one notable detail: Its much easier to stream data through the GPU than it is through an FPGA.

That's an interesting assertion. FPGAs are better at moving data from one place to another than any other general-purpose device I can think of.

How many GPUs have Xilinx GTx-class transceivers, for instance? A GPU with JESD204B/C connectivity would be an extremely interesting piece of hardware.

brandmeyer · on Oct 28, 2020

"Easier" as I'm using it is a metric of engineering effort more than throughput. I don't disagree - there's vastly more aggregate bandwidth available, both internally and on external interfaces.

But its vastly easier to get started with CUDA or OpenCL than it is to get started with a big FPGA.

CamperBob2 · on Oct 28, 2020

True, that much certainly isn't debatable.

NOGDP · on Oct 27, 2020

> But those UltraScale DSP48 units are competitive vs GPUs.

Are they really competitive from a price / performance perspective? Based on my limited understanding, nvidia GPUs, for example, are several times cheaper for similar performance?

dragontamer · on Oct 27, 2020

Mass produced commodity processors will always win in price/performance. That's why x86 won, despite "more efficient" machines (Itanium, SPARC, DEC Alpha, PowerPC, etc. etc.) being developed.

One of the few architectures to beat x86 in price/performance was ARM, because ARM aimed at even smaller and cheaper devices than even x86's ambitions. Ultimately, ARM "out-x86'd" the original x86 business strategy.

-------------

GPUs managed to commoditize themselves thanks to the video game market. Most computers have a GPU in them today, if only for video games (iPhone, Snapdragon, normal PCs, and yes, game consoles). That's an opportunity for GPU-coders, as well as supercomputers who want a "2nd architecture" more suited for a 2nd set of compute problems.

-----

FPGAs will probably never win in price / performance (unless some "commodity purpose" is discovered. I find that highly unlikely). Where FPGAs win is absolute performance, or performance/watt, in some hypothetical tasks that CPUs or GPUs don't do very well. (Ex: BTC Mining, or Deep Learning Systolic Arrays, or... whatever is invented later)

Computers are so cheap, that even a $10,000 FPGA may save more electricity than an equivalent GPU, over the 3 year lifespan of their usage. Electricity costs of data-centers are pretty huge.

The ultimate winner is of course, ASICs, a dedicated circuit for whatever you're trying to do. (Ex: Deep Blue's chess ASIC. Or Alexa's ASIC to interpret voice commands). But FPGAs serve as a stepping stone between CPUs and ASICs.

------

If you have a problem that's already served by a commodity processor, then absolutely use a standard computer! FPGAs are for people who have non-standard problems: weird data-movement, or compute so DENSE that all those cache-layers in the CPU (or GPU) just gets in the way.

XMPPwocky · on Oct 27, 2020

Also note Xilinx's next-gen parts have dedicated VLIW "AI accelerators" (full hard CPUs!)

dragontamer · on Oct 27, 2020

Those AI accelerators aren't really "full CPUs", since there's no cache coherence, really. They're tiny 32kB memory slabs + decoder + ALUs + networking to connect to the rest of the FPGA.

But its certainly more advanced than a DSP slice (which was only somewhat more complicated than a multiply-and-add circuit).

-------

I guess you can think of it as a tiny 32kB SRAM + CPU though. But its still missing a bunch of parts that most people would consider "part of a CPU". But even a GPU provides synchronization functions for its cores to communicate / synchronize together with.

TomVDB · on Oct 27, 2020

In my experience, the 20x performance number is after taking the DSPs into account.

dragontamer · on Oct 27, 2020

Just looking at raw FLOPs: the 7nm Xilinx Versal series tops out at 8 32-bit TFlops (DSP Cores only), plus whatever the CPU-core and LUTs can do (but I assume CPU-core is for management, and LUTs are for routing and not dense compute).

In contrast: the NVidia A100 has 19 32-bit TFlops. Higher than the Xilinx chip, but the Xilinx chip is still within an order of magnitude, and has the benefits of the LUTs still.

-----

It should be noted that Xilinx Versal "AI engine" is a VLIW SIMD-architecture: https://www.xilinx.com/support/documentation/white_papers/wp..., effectively an ASIC-GPU hardwired into the FPGA.

banjo_milkman · on Oct 27, 2020

Raw FLOPs is completely misleading, which is why Nvidia focus on it as a metric. The GPU can't keep those ops active - particularly during inference when most of the data is fresh so caches don't help. It's the roofline model.

In my experience FPGA>GPU for inference, if you have people who can implement good FPGA designs. And inference is more common than training. Much of this is due to explicit memory management and more memory on FPGA.

dragontamer · on Oct 27, 2020

Well, my primary point is that the earlier assertion: "GPUs are 20x faster than FPGAs" is no where close to the theory of operations, let alone reality.

ASICs (in this case: a fully dedicated GPU) obviously wins in the situation it is designed for. The A100, and other GPU designs, probably will have higher FLOPs than any FPGA made on the 7nm node.

But not a "lot" more FLOPs, and the additional flexibility of an FPGA could really help in some problems. It really depends on what you're trying to do.

------

At best, 7nm top-of-the-line GPU is ~2x more FLOPs than 7nm top-of-the-line FPGA under today's environment. In reality, it all comes down to how the software was written (and FPGAs could absolutely win in the right situation)

TomVDB · on Oct 27, 2020

> GPUs are 20x faster than FPGAs

The original comment by brandmeyer said "ASIC", not "GPU".

Take the same RTL. Synthesize it for ASIC and for FPGA. Observe a 20x difference after normalizing for power, area, and clock speed.

tails4e · on Oct 27, 2020

Yes, and FPGAs can be better than GPUs for some applications, even more power efficient, and cost effective.

davrosthedalek · on Oct 27, 2020

The question is, how much does your algorithm get from the 19 TFlops for a GPU, and how much from the 8 from the Versal. I'm sure many algos fit GPUs fine, but some don't, and might get more out of an FPGA.

tails4e · on Oct 27, 2020

I agree with the sentiment, but the numbers are off. It's about 10x the power worst case (maybe 5x for some dsp heavy apps) and also around 5 to 10x for speed. An FPGA can easily run at 100s of MHz, up to 500 with good design pipelining, so suggesting an ASIC could do 500x20 times the speed is 10Ghz, so definitely beyond most ASICs, so I think 5x is more reasonable.

brandmeyer · on Oct 27, 2020

My experience is that to get those "high" clock frequencies that the work per cycle has to be extremely small. If you normalize to total circuit delay in units of time than you still end up many times worse, because you need many extra pipeline cycles to get the Fmax that high.

tails4e · on Oct 28, 2020

My day job is ASIC design and we do some prototyping on FPGAs, so the exact same RTL is used as an input. We always benchmark power, performance, etc between ASIC and FPGA, so this is based on some real deigns. A 5x reduction in power is fair for most of what I've seen and the FPGA is actually better at achieving FMAX than you'd expect - control paths do need a lot more pipelining than ASIC, but compute intensive (DSP) datapaths are pretty good with a few tweaks. I think sometimes people throw code at them and get 100 MHz and say we'll FPGAs are slow so it's expected, but in my experience with a little tuning you can get most datapaths to run at 500MHz. You do pay the power penalty vs dedicated ASIC, but the performance is very good.

brandmeyer · on Oct 28, 2020

I think it depends a great deal on what you're doing. A fully pipelined double-precision floating-point fused multiply-add in FPGA tech will reach well over 500 MHz on current parts, but takes almost 30 cycles of pipelined latency to deliver each result. On the same process node, a well-optimized CPU will run at 6-8x the clock frequency and only require 4 cycles of latency to deliver each result.

Is this flow filled with divide-and-conquer algorithms with very low work per step? Yes. Is that particularly ill-suited to FPGA logic? Yes. Is it unfair to the FPGA? Not in my opinion.

I stand by my claim: If you normalize a general circuit's speed in units of time instead of cycles, then you'll find that ASICs come out much much farther ahead.

tails4e · on Oct 28, 2020

From this [0] it suggests the xilnx floating point core can run at >600Mhz,and the latency of many operations is just a few cycles. Also a s its pipelined the throughput could mean one result per clock, depending on how you configure the core. Seems closer to the 5x to me.

[0] https://www.xilinx.com/support/documentation/ip_documentatio...

brandmeyer · on Oct 28, 2020

That chart doesn't show you the result latency, only the maximum achievable frequency. You have to use Vivado to instantiate an instance with the specific suite of configurable options. When you do that, it will inform you of the result latency: 27-30 cycles for FMA.

dnautics · on Oct 27, 2020

I kind of love this crude analogy.

rjsw · on Oct 27, 2020

I'm guessing you didn't mean to write "softcores", licensing a SiFive design to be a hard core connected to the FPGA fabric would be one option.

duskwuff · on Oct 27, 2020

Xilinx already has a number of devices with ARM hard cores, though (like the Zynq series). There's no compelling reason for them to switch away from that.

rjsw · on Oct 27, 2020

Unless NVidia gives Xilinx a compelling reason to switch away from ARM.

gmueckl · on Oct 27, 2020

Argh, you are right! Thanks for pointing it out. I did pick the wrong wording.