I would rather see Processing in Memory (PIM) become mainstream than FPGAs. FPGAs are basically an assembly line that you can change overnight. Excellent at one task and they minimize end to end latency but if it's actually about performance you are entirely dependent on the DSP slices.
With PIM your CPU resources grow with the size of your memory. All you have to do is partition your data and then just write regular C code with the only difference being that it is executed by a processor inside your RAM.
Having more cores is basically the same thing as having more DSP slices. Since those cores are directly embedded inside memory they have high data locality which is basically the only other benefit FPGAs have over CPUs (assuming same number of DSP and cores). Obviously it's easier to program than either GPUs or FPGAs.
You're comparing two completely different paradigms.
FPGAs are not an assembly line at all; the assembly line analogy applies much more closely to a processor's pipeline.
FPGAs are just a massive set of very simple logic units which can be interconnected in many different ways. FPGAs are best used in situations where you want to perform a series of simple operations on a massive incoming dataset, in parallel, especially in real-time situations. Performing domain transforms on data coming in from sensor arrays is one very good application for FPGAs.
I think GP meant in the sense that reconfiguration time is large. FPGAs cannot be effectively time-division multiplexed, as a full reconfiguration can take up to tens of seconds.
GP is also correct that DSP/SRAM blocks are critical to performance. FPGAs are not very efficient at raw compute if you have to synthesize everything out of LEs.
The performance benefit of FPGAs, which PIMs also share (in theory, there aren't any PIMs ready for real-world deployment AFAIK) is that they can leverage much larger memory bandwidths than general purpose CPUs can. An FPGA might run at a lower clock rate (low 100s of MHz), but be able to operate on several kb per clock cycle. This can work really well when paired with off-chip logic to convert high rate serial interfaces to lower clock rate parallel interfaces, then back after the FPGA is done processing.
There is also a lot of work going on in the space of time-division multiplexing FPGAs effectively. The two main approaches are overlay architectures and partial reconfiguration. The former implements another high-level fabric on top of the FPGA which will be less general-purpose, but can be reconfigured faster. The latter is a feature vendors have added to some high-end chips where specific regions of the FPGA can be reconfigured without affecting other regions.
I agree with your statements regarding reconfiguration and TDM, though I still think GP (and to a lesser extent, your comment) are very focused on traditional computing paradigms. FPGAs are much more promising for real-time systems, particularly those with very large incoming datasets to transform or otherwise process in parallel. Thinking about FPGAs in terms of how 'quickly' they process data is really missing the point IMO.
One common, and very good application for FPGAs is for use in Active Electronically Scanned Array radar, sonar, or camera image processing. You can perform parallel filtering and transforms with various frequency and phase settings, which would be impossible for a similarly-sized processor to do.
FPGAs have the potential to revolutionize sensor arrays, by making them much more useful and affordable.
I agree yes. "Traditional computing paradigms" are (IMO) not all that interesting as research topics at this point. As far as I know, most of the work in that space is in branch prediction and cache replacement policies.
FPGAs are what you really want when you need to deal with high resolution data that is coming in at very high data rates. Often even a very fast general-purpose processor with hand-tuned assembly simply won't have even the theoretical memory throughput to process your data without "dropping frames". They also have the benefit of deterministic performance, which with modern caching/branch prediction systems you can't guarantee (AFAIK, my computer architecture knowledge isn't that cutting edge).
They can also work really well if you have some computation you want to do that is so far off the beaten path for general-purpose processors (or so memory bound) that FPGAs can take the cake.
There is also some work in sprinkling even more hardlogic into the FPGA dies, like processors or accelerator cores for various applications. FPGAs are great for implementing the glue logic to move data between those.
I think you touched on one of the biggest things about FPGAs in your comment, which is that they are perfect for computation that does not involve branches. If you've got a lot of data, and you're doing transforms, you usually don't need to branch, so being able to crunch everything through in parallel is a massive benefit.
Also agree that additional hard logic or peripherals will be a game-changer for FPGAs, though they would make each design more domain-specific. Alternatively, we may see a shift in how the interconnects are done, which allows for flexible use of these 'modules'. It's also possible that we'll see continual increases in LE counts which make more specialized hardware unnecessary. I don't know which way things will go.
Are FPGAs rewritable at will with almost no degradation (for example a rewrite every minute over many days), or do they suffer the same degradation problems as EEPROMs (like the ones in Arduinos)?
FPGAs use SRAM to store their program, while CPLDs (complex programmable logic devices) use flash. Some clever marketeers here & there will stretch this distinction but it's an established convention. The internal architecture between FPGAs and CPLDs is typically different, based on cost of memory vs. logic and typical use cases. FPGAs tend to be used for higher-capacity computations but require more life support; CPLDs tend to serve smaller, true glue logic applications, where the low config overhead (just apply power) and quicker & simpler power-up is a strong pull.
So CPLDs will have some kind of NVRAM wear-out concern, and this is almost always specified as a number of maximum erase & program cycles.
Though you could presumably just do the same thing with new instructions, i.e. have an instruction for secure zeroing which zeros the data in any memory or cache where it exists but doesn't cause the zeros to be cached anywhere they weren't already.
The surface area for security vulnerabilities is already impossibly high. Do we really want to add "firmware running on a DIMM exfiltrating key material" to that list?
There are security problems with every architecture. There is no fundamental reason PIM should be less secure than what we do now. This is just fear of the unknown talking.
With PIM your CPU resources grow with the size of your memory. All you have to do is partition your data and then just write regular C code with the only difference being that it is executed by a processor inside your RAM.
Having more cores is basically the same thing as having more DSP slices. Since those cores are directly embedded inside memory they have high data locality which is basically the only other benefit FPGAs have over CPUs (assuming same number of DSP and cores). Obviously it's easier to program than either GPUs or FPGAs.