To reinvent the processor

timerol · on May 11, 2019

This talks about two major performance concerns in CPUs: control flow and memory access. The transition between the two was my favorite part of the article. It's both funny and accurate.

"If the previous section gave the impression that, say, control flow is a particularly tough problem, but inventive methods can give powerful and practical tools to tackle it… good, that was the intent. In this section I want you to take that expectation and dash it against the rocks — memory is hard."

hinkley · on May 11, 2019

In the 90’s there was a group experimenting with putting small processors directly on the memory chips and doing data processing there.

flohofwoe · on May 11, 2019

This is also the idea behind the Pixel Planes rendering architecture from the 80's, they called it "logic-enhanced memory chips". It's not a "universal CPU" on the memory chips, but each chip can "process" and store 128 pixels, where the processing is implemented with a hardwired linear expression.

Here's a video on Pixel Planes 4 from 1987:

https://www.youtube.com/watch?v=7mzpZ861wEw

leereeves · on May 11, 2019

Sounds rather similar to GPU architecture.

thechao · on May 12, 2019

Pixel Planes is the ancestor to GPUs with the identical relationship of the titans to the gods in Greek mythology.

ip26 · on May 11, 2019

Of course when you start to look at how much die space is dedicated to SRAM on current CPU's, you begin to realize in a sense we have this now.

Convergent evolution.

hinkley · on May 13, 2019

Maybe we should just go NUMA and address the on die “cache” directly.

cfallin · on May 11, 2019

There's still some research on this general idea -- e.g., a fellow grad student's work on in-memory bulk data operations:

- https://archive.org/details/arxiv-1611.09988 (AND/OR/NOT in RAM)

- https://www.pdl.cmu.edu/PDL-FTP/NVM/rowclone_micro13.pdf (cloning data in RAM; full disclosure, I'm a co-author on this one)

And IIRC at least one start-up (Emu Technology) that has built actual processing-in-memory (PIM) hardware.

deepnotderp · on May 11, 2019

Berkeley IRAM

verall · on May 11, 2019

still a good way good way to get your phdce in computer architecture at least according to a couple profs i've talked to

Veedrac · on May 11, 2019

I am reading the comments, and am open to critique or general suggestions. “I found this part confusing” is also very welcome.

This is quite an information dense work, and a little rough around the edges, though I am pretty happy with it overall.

ajb · on May 11, 2019

You might also be interested in Sophie Wilson's "Generation renaming" scheme. Unfortunately AFAIK the only open description of it is in the patents, which of course are written in patentese:

https://patentimages.storage.googleapis.com/e2/c4/39/301ea43... https://patentimages.storage.googleapis.com/fd/de/de/7fdbc30...

h2odragon · on May 11, 2019

I almost understand some of these words :) Truly this is fascinating and makes me want to learn about the many things i do not grasp yet. Thank you.

> Register hazards are an unavoidable consequence from using a limited number of names to refer to an unbounded to number of source-level values

yet I seem to recall seeing compiler writers lament the paucity of registers too. Perhaps "registers" itself as a concept is something we should trade in for named memory spaces or something.

gimme instruction level access to all of what is now cache. (chuck all the cache handler logic btw). let me page in and out sections of that to main memory or the inter cpu or io busses, but otherwise its all storage and the real difference is latency.

if we want to burn hardware there could be special graded coherency memory areas or transactional or content addressable.

deepnotderp · on May 11, 2019

> Perhaps "registers" itself as a concept is something we should trade in for named memory spaces or something

Please no, however hard you think register renaming is, memory disambiguation is harder.

h2odragon · on May 11, 2019

Hey I'll argue for self modifying code if you let me :). Don't let it be ambiguous. I may be trying to express "don't add abstractions" let the hardware limits show, in ignorance of what the hardware limits are.

bullen · on May 12, 2019

I'm also a little bit behind on the semantics and understanding.

From my perspective we need to make the CPU more user friendly first, so I would like a VM (without GC) in silicon. Basically removing the absurdities of the memory and error handling that we get now. I still have no clue if this is possible?

Then I think going back to 8-bits and making a purely parallel multi-core processor with segregated transistor memory and shared DRAM would be an interesting idea, just to see how that would work out with today's lithography and memory bottlenecks.

Of course you would have to code differently so the programming language would have to change by adding local and remote memory to the variables and explicitly handling concurrency without blocking to the remote memory somehow. Possibly by copying, which is slow, but that memory is already going to be slow.

The common path for both of these is lets make things complex in a different ways separately to see if we can make life for programmers better, one by removing frustration and worry the other by keeping the energy efficient scaling and memory speeds going for a tad more.

cnasc · on May 11, 2019

I'm only a little way through it so far, so I'll come back with any questions, but I'd like to thank you for sharing such an interesting read!

raphlinus · on May 11, 2019

I like the branch-predict / branch-verify split. It reminds me strongly of the ldrex/strex strategy in ARM for splitting (for example) a compare-exchange in two parts, so that in the non-contended case it basically just flows straight through.

Having recently done a lot of GPU programming, I'm wondering what your thoughts are on a GPU-like approach, with tons of in-order cores. It's hard to program, but maybe we need to figure that out in order to get performance.

tntn · on May 11, 2019

> with tons of in-order cores

It's worth noting that GPUs are only "in order" because the computer architecture community has decided that "out of order" means "some variation of Tomasulo." Every GPU architecture in the last decade has relied extensively on executing (and committing) instructions in a single instruction stream out of order.

Doing a bunch of math while waiting for memory operations to complete is fundamental to high performance. Modern GPUs cannot achieve anywhere close to peak performance without extensive ILP - relying on latency hiding via threads is insufficient.

atq2119 · on May 11, 2019

Regardless of Tomasulo, I would expect "out of order" to imply that (at the very least) when you have two independent streams of arithmetic (in the same thread) that depend on different memory loads, the order of execution of the arithmetic depends on which of the loads returns first.

I don't think any GPU does that. You could perhaps argue that GPUs issue instructions in order, but memory instructions can retire out of order wrt arithmetic.

If there is a GPU that's more out of order than that, I'd appreciate some pointers to evidence :)

Impossible · on May 11, 2019

The GPU model is a proven way to get better performance out of traditional CPUs also. You will get the best performance from modern CPUs if you treat each SIMD lane as a parallel execution on homogeneous data and also execute on multiple cores. See ispc, and Burst Compiler as examples.

I think there is a view that this style of programming only applies to traditionally high compute areas like games, hpc, rendering, ML, etc but we've recently seen a lot of core building blocks of "normal" web applications like hash tables (https://code.fb.com/developer-tools/f14/) and json parsers (https://github.com/lemire/simdjson) get massive perform gains from SIMD.

Nyan · on May 12, 2019

Note that SIMD on CPUs is somewhat different to GPU style "SIMD" (particularly for cases where you want fixed width SIMD vs arbitrary vector processing (ala SPMD, ISPC, CUDA etc)).

JSON parsing, for example, doesn't scale in the same way as typical graphics applications do with wide vectors, for example. It's just that the traditional model of byte-by-byte parsing is very inefficient - SIMD implementations just exploit some of the unused parallelism available in CPUs. It's still very much a latency bound problem, and has complex control flow problems, which is why it wouldn't run well on a GPU, yet runs well on CPU SIMD.

Veedrac · on May 11, 2019

I pretty much agree with you: it's hard to program, but possibly something we just need to do anyway. It's a much more efficient way to build computers, and the relative benefits only look to get bigger... but if I knew how to make it work, I wouldn't be writing about CPUs!

api · on May 11, 2019

I lean strongly toward your second paragraph. If we can make many core programming easier I think we can dispense with a ton of complexity and just scale out cores with transistor counts.

I think an under researched area is the application of deep learning to compilers. What could be done there for parallelism?

We are still kind of stuck in programming models developed back when core frequencies were scaling geometrically. We hit that wall in the early 2000's and still haven't quite assimilated it.

We are tackling it at the macro scale of multiple discrete systems via modern devops but the systems (e.g. Kubernetes) are baroque, clunky, and hard to manage.

jcranmer · on May 11, 2019

> I think an under researched area is the application of deep learning to compilers. What could be done there for parallelism?

I don't think it's under-researched, because we already know that it's only ever going to do a lousy job. The problem of autoparallelization is mostly stymied by difficulty in computing the legality of transformations, and secondarily by the problem of tuning parameters [1]. We've realized that it's easier to just get the users to tell us about these things, sidestepping the problem for the most part.

> We are still kind of stuck in programming models developed back when core frequencies were scaling geometrically.

The programming model for parallelism is dataflow. We already know that: the tricky thing of parallelism is communication, so annotating all of the edges of communication makes the scheduling problem much, much easier.

[1] Theoretically, machine learning is good at tuning parameters. In practice, you get good results with the dumbest algorithms already, and the available headroom for performance is generally swamped by the fact that small changes such as function size can dramatically alter the performance characteristics of code elsewhere in the application.

deepnotderp · on May 11, 2019

Regarding deep learning for compilers, at Vathys we're using deep learning in compilers for deep learning code and it works well, achieving pretty good results even compared to human implementations.

But it's not that much of a gain and it can often get stuck on some code, especially if it's sufficiently new code. Just as surprisingly, genetic algorithms are almost as good but faster and more robust.

Finally, I think people point to examples of much faster hardware accelerated workloads and optimized assembly code that share these characteristics:

1. Are relatively embarrassingly parallel

2. Are relatively control flow light

3. Generally tend to exhibit high compute/bandwith ratios.

Hence, you see deep learning hardware, bitcoin mining hardware, optimized compression codec assembly, etc. pointed to as examples of acceleration, but this doesn't apply to all workloads, e.g. a web app.

Nyan · on May 12, 2019

> It's hard to program, but maybe we need to figure that out in order to get performance.

There will always be problems which require latency over throughput. Although the idea has been tried - e.g. Xeon Phi.

deepnotderp · on May 11, 2019

The "Branch Vanguard" paper does exactly this.

Veedrac · on May 11, 2019

Thank you, I've edited the article to cite this.

gumby · on May 11, 2019

The mill folks seem nice (at least the ones I've met) but after 14 years with no silicon and nobody poaching their ideas for other designs, is there actually a there there?

Why is there no SiFive for Mill?

ansible · on May 12, 2019

SiFive work with the RISC-V architecture, which is completely open, intentionally so.

The Mill folks have been careful to patent everything interesting before publicly talking about it.

gumby · on May 13, 2019

patent-driven own goal.

ansible · on May 13, 2019

Well, they want to make money. The design is revolutionary enough that if it does make it into production, there's a decent chance of success.

Ericson2314 · on May 12, 2019

I fear the general difficulty of this means nothing will happen...until something really different happens. Let's get our programs in really high level languages in the meantime that will compile to wildly different architectures.

I've compiled Haskell to CPUs and Haskell to FPGAs, but idioms don't overlap except for the most foundational libraries (Functor, Applicative, Monad). That's still great, but that's nowhere yet near "let me compile my CRUD app to my FPGA." More FRP for the CPU could change that, though. FPGAs can do self-rewriting circuits since they are programmable.

bogomipz · on May 12, 2019

The author states:

>"A Skylake CPU has 348 physical registers per core, split over two different register files."

Just looking at some literature on recent Intel I can only see a fraction of this 348 number. I see the following:

16 general purpose registers 6 segment registers 1 flags register 8 x87 registers 16 SSE registers

Could someone explain where this 348 registers figure in the post comes from exactly?

Nyan · on May 12, 2019

You're confusing instruction-set architecture with micro-architecture.

The x86-64 ISA defines 16 integer ("logical") registers, for example, however, Skylake has (don't quote me on the figure) 168 physical integer registers. For register renaming to work, you need more physical registers than logical registers because the logical registers need to be mapped to some physical registers, possibly multiple times (e.g. for speculation).

pkaye · on May 12, 2019

Those are the architectural registers. Physical registers also include those extra ones needed to implement register renaming to eliminate data dependencies. https://en.wikipedia.org/wiki/Register_renaming#Architectura...

bogomipz · on May 12, 2019

Thanks, right I was confusing ISA level registers with those at the implementation/micro arch level. Cheers.

ClassyJacket · on May 12, 2019

I'm not an expert on this topic, but are there additional registers for the extended instructions like MMX and AVX?

ezconnect · on May 11, 2019

I think the future of CPU will be FPGAs for one software. Hardware will be so cheap that your desktop or whatever computer you have will have lots of FPGAs or logics to run a certain amount of software. You can also allocate some of it for general computing so you can run legacy software.

deepnotderp · on May 11, 2019

1. Have you ever tried programming an fpga?

2. Plenty of workloads don't map that well to an fpga without modification. Deep learning, molecular dynamics and other workloads that are relatively embarrassingly parallel, control flow light and high compute/bandwidth ratios map well to specialized hardware, but most real world code does not fit that bill!

ezconnect · on May 11, 2019

Yes I have programmed FPGAs. We are reaching the limit of the silicon and the only way to get faster is to run the CPU on something different or each software get implemented on hardware which is getting cheaper.

tntn · on May 11, 2019

The adoption of FPGAs is, IMO, limited by the quantity of people who can write RTL. Lots of people can write assembly, a lot more can write C or C++, and legion of people can write software in higher level languages. Compared to the quantity of software engineers, there are extremely few people that can write RTL of moderate complexity, and even for those few, the development process is much slower than software.

This could change, and I'd love to see that, but without improved "programmability" FPGAs will remain niche. I haven't followed the latest developments very closely, but Xilinx seems to be moving in this direction with Everest through the addition of vector cores.

jcranmer · on May 11, 2019

> This could change, and I'd love to see that, but without improved "programmability" FPGAs will remain niche. I haven't followed the latest developments very closely, but Xilinx seems to be moving in this direction with Everest through the addition of vector cores.

As a disclaimer, I've never personally worked with FPGAs, but a lot of my coworkers have, and I'm mostly parroting their views. And those views are...

FPGAs are a dead-end technology. For a while, they were considered the "obvious" next frontier of HPC, just once we got the programming model sorted out. And they stalled at that phase for over a decade, until CUDA came out and everyone realized just how much better GPGPUs were at providing the necessary HPC speedup while requiring far less development time.

So how do you fix FPGAs? Well, you start substituting actual hardware logic instead of emulating everything with LUTs. And as you do this, you start to end up at coarse-grained reconfigurable arrays instead.

analognoise · on May 11, 2019

I'd have just stopped at "I have never worked with FPGAs". To say they're a dead end is to totally misunderstand what they're good at and capable of.

Source: Professional FPGA designer.

davemp · on May 12, 2019

I work with FPGAs professionally. I'd tend to agree with OP that FPGAs are a dead end, considering that the trend is to bake in as much hard IP as possible and have a dual core CPU on every chip.

wyatwerp · on May 12, 2019

I don't know if FPGA's are a dead-end. Why did Intel buy Altera, the 2nd biggest FPGA vendor?

analognoise · on May 12, 2019

Hard macros (thinks like BRAM, DSP slices, etc) are blocks which do a fixed function, are faster than reconfigurable logic, and take up less silicon area - they trade off configurability for speed; let's explore that.

A multiplier is a multiplier is a multiplier. You can spin your own, but it is so common that every FPGA vendor says, "If you need to multiply, you can use this block." You want to write code that is general:

    A <= B * C + D;

And have your synthesizer go, "This is a multiply accumulate! I can fit that into the following blocks: Look up tables or a DSP slice. I'll use the DSP slice - it is faster and smaller." Note that you didn't directly call the DSP slice, you just said "multiply". It doesn't always work this way, but that's the goal.

Not every FPGA has a dual core cpu - the Virtex-5's and 6's started that trend with a PPC block, and it really hit its stride with the Zynq-7000 (when Xilinx switched to ARM cores and brought the price down significantly). Again, if you're doing things a CPU can manage, you CAN do them with logic, but why not use the embedded core?

The addition of the ARM cores was a brilliant move by Xilinx because a number of embedded systems out there used FPGAs for fast response, high speed interfaces, high speed datapaths, and as glue, but many included a separate processor to handle "housekeeping" tasks.

Xilinx noticed this, and said, "If you want, you can choose the chip that has a processor core in one corner instead of reprogrammable fabric there." Bam, tons of sales - because now instead of two chips, I need one. Integration.

They're furthering their exploration of that interface with HLS and their newer toolsets. The idea is to make the algorithmic division between the two things null - you can seamlessly switch between control and data dominated computation.

But let me take one step back.

What's the difference between a state machine and a counter?

Nothing. They're a cloud of 'next state' logic, a current state, and an output based on the current state and/or the current state + the current inputs (Mealy/Moore/Medvedev). The 'next state' logic happens to also follow the rules of arithmetic, which is what you're interested in when you're using it as a counter, obviously. But it's a state machine.

So what's the difference between a CPU and a state machine?

Again...nothing. A CPU is just a complicated state machine (or a set of interacting state machines).

To dislike hard macros which are fast and common is to not grok FPGAs. To say you work with FPGAs professionally but you consider them a dead end, when they're built out of the same logic and blocks that underlies everything that's ever been done with a digital computer is...confused, at best.

We're one step "above" transistors in the abstraction hierarchy. But those transistors would implement things like counters (assuming you're not building an analog computer!), and you'd be right back to an FPGA (well, an ASIC, so YOU get to choose what blocks get included or not!).

A lot of people who design ASICs do it from HDL's - they even use synthesizers! But the synthesizers are targeting a library of parts for a silicon process. The only difference is that the blocks being targeted on an FPGA already exist - you can't move a block over to get better timing like you can with an ASIC, or widen transistor ratios to drive more current, etc - you have to use what's on them.

So...yeah. FPGAs use the fundamentals of all computing directly, they're not a dead end unless we all decide to switch back to analog computations or quantum computers or something, and the hard IP is really important.

Even if spinning an ASIC decreases in price to a few grand (crazy hypothetical), they'll be programmed like you programmed an FPGA - full of the same primordial soup components right above transistors that does all the work in the digital abstraction.

davemp · on May 12, 2019

I'm not sure why you felt the need to respond with such a wall of text. It barely has a coherent point and condescends to explain details that are a given to anyone who works with FPGAs.

FPGAs are a dead end, but your comment didn't even address for what goal they are a dead end. If you'd take a look a look at a context of this thread, you'd realized we're talking about general applications.

---

> To dislike hard macros which are fast and common is to not grok FPGAs. To say you work with FPGAs professionally but you consider them a dead end, when they're built out of the same logic and blocks that underlies everything that's ever been done with a digital computer is...confused, at best.

Or maybe it's understanding the limitations of the technology you work with. I also never mentioned or implied disliking hard IP.

> What's the difference between a state machine and a counter?

> So what's the difference between a CPU and a state machine?

What's the difference between the universe and a state machine? What can be encoded as a state machine is largely irrelevant. I can make a CPU inside of minecraft but that's not very useful.

> FPGAs use the fundamentals of all computing directly

This is simply not true. FPGAs emulate the "fundamentals of all computing". The emulation is not efficient, hence why hard logic is so important.

thesz · on May 22, 2019

It is quite easy to get RTL from non-recursive C code.

Unroll your calls, assign a bit to each statement of your program, create state machine that compute state bit transitions and resource changes. And off you go!

You get TTA that is as flexible as can be.

The optimal RTL is slightly more complex, but also does not require hand written RTL.

tejuis · on May 11, 2019

There are some contradictions in here. If you map your functionality from software (on CPU) to FPGA, your software becomes hardware. FPGA is hardware. FPGA is retargetable, but it is not programmable in the sense what SW requires. The resources are preallocated, which is the essential property of hardware.

In software you allocate and deallocate memory, which is your dynamic resource. You don't have that in HW. You can implement a memory manager in HW, but your system will not be flexible enough to be soft. Also you can synthesize your CPU on FPGA, but the performance is poor compared to a hard CPU on FPGA.

Currently FPGAs are becoming popular in accelerating software functions to offload the CPU. The most powerful FPGAs have a lot of fixed logic CPUs in them (i.e. hard macros).

You can't just say that I'll just go ahead and run my software functions as hardware. For certain type of workloads you always need software, in practice (not in theory).

wyatwerp · on May 12, 2019

Does it matter which workloads always need software? If anything, the networking use-case shows that a workload, any workload, that is ubiquitous & can benefit tremendously from programmable hardware is what matters.

In a shared-memory system, just implementing libc (or the Java VM, or the Erlang VM) on an FPGA might be a win. It has to be enough bang for the buck for FPGA, but not so much that somebody would make fixed-hardware for it.

On that note, haven't networking end-points had fixed hardware also for ages now? Maybe it is inevitable that a successful application of FPGA's breeds interest in fixed-function hardware for it.

deepnotderp · on May 11, 2019

> Yes I have programmed FPGAs

And was it a pleasurable experience?

Do you really think people will rewrite all their software for fpgas?

And what if you're memory or I/O bottlenecked as so many processors are today?

ezconnect · on May 11, 2019

For me it's a pleasurable experience.

> Do you really think people will rewrite all their software for fpgas?

A compiler will do it for them. If technology moves in that direction they don't have any choice anyways.

> And what if you're memory or I/O bottlenecked as so many processors are today?

I/O will never run as fast as the CPU.

icxa · on May 11, 2019

> I/O will never run as fast as the CPU.

In the era of kernal bypass and modern IO accelerators seen in some data centers today, I wonder if this is even true today?

rcxdude · on May 12, 2019

something which was on HN a few days ago (I/O is faster than CPU): https://news.ycombinator.com/item?id=19818899

BubRoss · on May 11, 2019

How will that help memory latency, which is the biggest bottleneck to most programs?

wyatwerp · on May 12, 2019

I don't know if desktops have a future, or if even local computing has one. Maybe a portion of "personal computing" ends up only done on mobile devices, and some of the rest of it moves to public clouds.

There might not be a business case for companies to produce chips & other hardware for desktops. So, lets equate buying desktops to buying low-end servers for all practical purposes. As computing gets cloud-ier, is there a business case for low-end servers either, when a high-end server can be virtualized to get the same result?

If we are stuck with power-sensitive mobile devices & virtualized cloud instances on high-end servers as the only businesses that sustain, FPGA's only play is on the servers. You'd expect lots of heterogenity in the workloads there, more than on desktops. Reconfiguring many FPGA's very frequently? Not sure it would fly, even with a small selection of applications vetted to not be damaging to the FPGA.

You can do personal computing with FPGA's with Adapteva boards. How safely can you reconfigure its FPGA, after you have managed to code for it?

doublerebel · on May 12, 2019

Security. Buyers will have to be convinced that remote computing is as secure as local. Especially considering the latest CPU-based exploits, I can't see a near future where buyers will give the same trust to remote computing, regardless of the latest technology advancements.

pkaye · on May 12, 2019

We are reaching the end of Moore's law. Things will not get dramatically cheaper unless some breakthrough technology comes up.

jcranmer · on May 11, 2019

The problem is that FPGAs just can't drive memory bandwidth. (And FPGA configuration times are obnoxiously long).

analognoise · on May 11, 2019

Can't drive memory bandwidth? I can saturate any memory interface with ease. All line rate Ethernet systems (pushing 100+ Gb/s now) are done with FPGAS. What are you talking about??

I can program a kintex ultrascale @ 16 bits transmitted at something like 100Mhz. The system is up fully in less than 5msec from power good. This feature is used in every system where milliseconds count - including military systems. Are you confusing JTAG reconfiguration speeds with production configuration speeds?

I'm very confused by what you said, can you elaborate?

alain94040 · on May 11, 2019

16 bits transmitted at something like 100Mhz

That's 1.6 Gbit/s. Looking up random Intel Xeon, specs, the memory bandwidth from the CPU to memory is 100 Gbytes/s. That's 1000X more than what you are describing.

analognoise · on May 12, 2019

You only need to pass a bin file to program an FPGA - it's like 20Mbits (hence a few msec).

Memory interfaces are not the same as the reconfiguration programming interface.

The memory interfaces (to external memory) are standardized - DDR3/4. Are you talking about to a L1 cache? Because all of the equivalent in an FPGA (BRAM, in Xilinx terminology) is accessible within a single clock cycle on an FPGA (assuming you can meet P&R/timing). Each DDR4 chip can do something like 20Gbit/sec, I can fit 5 of them on a KCU115 1924...

And that's external memory. Internally there is BRAM (and it's newer variants, which means you don't even have to go off chip).

foobiekr · on May 11, 2019

NPUs like Tomahawk, etc are not FPGAs.

analognoise · on May 12, 2019

Not in their final version, no. I see what you meant - I should have said "are developed with". Lemme edit that.

https://store.digilentinc.com/netfpga-sume-virtex-7-fpga-dev...

But for 7k you can get an older design with 4x 10Gbps Ethernet. It's used at Stanford and Cambridge for... High speed network processing research.

High speed trading firms almost universally use FPGAs. You can't go faster without custom silicon. But you can sure as shit max out as many DDR and network interfaces as you can physically attach.

foobiekr · on May 13, 2019

10G is a very slow speed at this point. But yes, there are very good use cases for FPGAs.

ezconnect · on May 11, 2019

You only configure it once when you buy the software, installing software takes time too right now. FPGAs do come with embedded memory with very low latency. Someone will figure it out how to make it faster if things get implemented that way.

zackmorris · on May 11, 2019

I have a computer engineering degree and what you are saying about using FPGAs for general purpose computing was correct when I graduated 20 years ago. It was 10-100 times more correct a decade after that, and is another 10-100 times more correct today. So somewhere between 100 and 10,000 times more correct now than in the 90s. And between 1000 and 1,000,000 times more correct in 2029!

But we're still using the same 3 GHz (effectively single threaded) chips today as then. Sure, RAM frequency is higher, but so is latency. Computers today are closer to 10 times faster than computers of the 90s, not 10,000 times faster like they should be. Except for video cards, which really are 100 or 1000 times faster, because they break with the single-threaded model.

FPGAs would work great for general purpose computing, but we're still missing a proper language for programming them. VHDL/Verilog is more like assembly language than the functional language we would need to do it effectively.

I haven't really kept up on this because it's been too depressing, but here are some promising approaches:

https://github.com/dgrnbrg/piplin

https://catherineh.github.io/programming/2016/12/26/haskell-...

https://www.eetimes.com/document.asp?doc_id=%201329857&page_...

https://news.ycombinator.com/item?id=14546535

So we'd need:

* An HDL wrapper that guarantees that any hardware description downloaded to an FPGA won't short it out. [might already exist, but needs proven examples/unit tests of pathological edge cases]

* A Lisp to HDL compiler, preferably with optimization. [time-consuming, but not difficult]

* A high-level functional or vector or stream language (Clojure/Elixir/MATLAB/Erlang/Go) to Lisp transpiler. [somewhere between trivial and straightforward, might already exist]

* An example implementation of a CPU written in a functional/vector/stream language, probably MIPS [not difficult, just time consuming to convert an existing spec to code]

It would be good to keep the transistor count under 1 million (say 100,000 gates), and possibly start with an 8, 16 or 32 bit implementation and emulate 64 bit and higher operations in microcode. Most CPU gates are wasted on out-of-order execution and cache, which aren't as important in parallel and stream-based/DSP computing:

https://en.wikipedia.org/wiki/Transistor_count

When I graduated, I saw a brand new chip where 75% of the die was for cache, although I can't remember which model number it was. Chips today are probably worse, since their speed to transistor count ratio has mostly gone down.

It looks like FPGAs might have stopped reporting gate count since they've doubled down on the proprietary route. A Xilinx Virtex-7 2000T FPGA from 2011 had 6.8 billion transistors which could implement roughly 20 million ASIC gates:

https://www.eetimes.com/document.asp?doc_id=1316816

So 8 years later with Moore's law that should be 5 doublings (32 times), or roughly 200 billion transistors and 640 million ASIC gates on a 2019 FPGA. The article says it takes about 340 transistors to form a gate on an FPGA, which seems high to me. But conservatively, 640 million gates would allow us to put 6,400 of our 100,000 gate cores on one FPGA.

Maybe someone ambitious can figure out how much RAM could be allocated per core in a 2D grid, based on the total core count. I'd like at least 1 MB per core, but that might conservatively require something like 10 million gates, which might limit us to 64 cores in RAM alone:

https://www.quora.com/How-many-transistors-flip-flops-and-ga...

It’s unfortunate that there is so much hand waving around transistor, gate and logic cell count. But it looks like we could put something like 64, 640 or 6,400 cores on a single FPGA today depending on how much RAM we allocate per core, and if we had the right software. And this would be a true parallel computer, probably running between 100 MHz and 1 GHz. So we could play around with things like copy-on-write (COW), content-addressable memory and embarrassingly parallel computation in all the areas that today's computers are weak at (things like ray tracing, voxels, genetic programming, neural nets and so on) and at least see how classic (UNIX-style) programming compares with the bare-hands/burdensome languages like CUDA and OpenCL.

If an FPGA can do all that, then surely it could devote some of its gates to the kind of reprogrammable logic you mentioned (perhaps for video transcoding, bitcoin mining, emulating video game consoles, things of that nature).

erichocean · on May 11, 2019

> Computers today are closer to 10 times faster than computers of the 90s, not 10,000 times faster like they should be.

This isn't even close to true for many classes of problems.

For 3D rendering software running on CPUs, this is observably not true: todays machines are computing 10,000x faster than what was possible in 1994, when Toy Story was rendered. Not to mention massively larger data sets are being rendered now thanks to storage, memory, and general I/O improvements.

In fact, path tracing (now the dominant 3D rendering method, including at Pixar) was considered elegant but impractical in the 90s because it required way too many CPU cycles and way too much memory (the whole scene, roughly, needs to be in memory at all times). It took a 1000x (now 10,000x) increase in computer performance across all dimension to make path tracing viable.

Does any of that sound like just a roughly 10x improvement in 25 years?

zackmorris · on May 11, 2019

I was excluding video cards, and comparing a 1999 computer (say a 300 MHz 1 core Intel Pentium II with 100 MHz RAM) to a 2019 computer (say a 3 GHz 8 core Intel I9 with 1 GHz RAM). There is some wiggle room there on clock and bus speeds and width, but all within a similar order of magnitude:

https://www.cpubenchmark.net/singleThread.html

This chart doesn’t go as low as we need, but we can extrapolate the Pentium II to a score of about 50 and the I9 to 2500 ( although if we include the I9’s 8 cores, we get a score more like 20,000 https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i9-9900K... ). So we’re looking at about 50x faster single-threaded performance between 1999 and 2019.

That means that since 2019 clocks are 10x faster than 1999 clocks, there has been only a 5x performance increase per cycle (most of this probably due to longer pipelines, larger cache and wider busses which are evolutionary rather than revolutionary technologies).

Yes, there have been major price decreases. But only minor increases in performance. Computers have exponentially more transistors, but only linearly better processing power and bus speed. Had performance kept up with the 100x increase in transistor count per decade (10,000x every 20 years), The I9 with 8 cores should have scored about 5010,0008 = 4,000,000. But it only scored 20,000, so it’s only reaching about half a percent of its potential performance.

We don’t hear any of this talked about much because it’s an inconvenient truth. But from my perspective, Moore’s law ended around 2000. If it was up to me, I would scrap the rat race towards incrementally faster single-threaded performance. One divided by half a percent indicates that chips today should have roughly 8*50 = 400 cores to adequately make use of their potential processing power. For a hardwired (non-FPGA) chipset, I’d reach higher than that and target 1024 cores running at 3 GHz, with as much as 16 GB of on-chip RAM running at 1 GHz (16 MB per core so they can be programmed with local copies of their own operating system), for a chip with roughly the same transistor count as today, and costing under $1000.

My training was for designing a MIPS-style chip running at roughly 100 MHz. I don’t see any mystery in designing the scaled-up chip I’m suggesting. In fact, it would be far simpler and cheaper to design because we’d only have to do it once for perhaps a 4-6 stage pipeline with little or no branch prediction or cache logic and then repeat that in a 2D mesh. A single core would run up to 5x slower than an I9’s, but we’d have over 1000 more of them to do whatever we wanted with.

erichocean · on May 11, 2019

In case I wasn't clear, the kind of 3D rendering I'm describing doesn't use graphics cards. It's all on the CPU.

As for ideas to make a faster CPU, sure, knock yourself out. But there's no question that computers today are much closer to 10,000x faster after 25 years of development on actual, real-world code being run to earn billions of dollars. Your approximately 10x faster claim after 25 year is just wrong, at least with the software systems I'm familiar with.

zackmorris · on May 13, 2019

There is a typo in the above formatting that I can't edit:

5010,0008 = 4,000,000

Should say:

50 * 10,000 * 8 = 4,000,000

That was important to my central point that an I9 has roughly 400 times fewer cores than it would have had it kept up with Moore's law, so I didn't want to leave it unclear.

mattnewport · on May 11, 2019

At the lower end of your range this sounds somewhat like a Xeon Phi - up to 72 cores at 1.5 GHz with 512 kb L2 cache per core. Intel has discontinued the Xeon Phi line, presumably in part because there were limited applications where such an architecture proved compelling. What advantages would you see your FPGA based architecture having over something like the Xeon Phi that might lead it to have more success?

zackmorris · on May 11, 2019

That's a good question, and I'm honestly not sure. I think highly parallel, general purpose and customizable computers have uses outside the mainstream. So they would make great ray tracing video cards and platforms for physics simulations for science research, stuff like that.

I think most problems in computer science today can be traced to the original sin of choosing Von Neumann architecture and imperative programming over reprogrammable hardware and functional programming. That's why we think of things like machine learning as rather difficult problems, even though they are just simple algorithms that are compute-bound.

Basically my complaint is that this stuff has progressed in spite of our choices, not because of them. The proprietary and closed-source hardware, drivers and programming paradigms we have to deal with today have set us back at least 20 years. Probably more like 40 years if you include some of the shenanigans that happened between IBM, Intel, Microsoft and Apple in the 80s and 90s. They solved personal computer problems wonderfully, but inadvertently hindered the progress of computer science in general.

I don't have an answer for why the market tends to reject the stuff I talk about. I just know that mathematically it's very first-principles stuff that I really hoped would have caught on by now. To me it feels like these ideas keep getting sabotaged for economic and political reasons rather than scientific ones.

Edit: I forgot to mention that FPGAs solve this problem by providing generic reprogrammable hardware whose use can't be monopolized the way that single-purpose hardware is. Something like open source FPGAs (that don't burn up when they're programmed wrong) would allow us to explore many more problem spaces in computer science than we can currently.

mattnewport · on May 11, 2019

Xeon Phi evolved from Larrabee which was Intel's attempt to make a GPU based around highly parallel general purpose x86 cores. Many smart people thought it was a good idea and worked on it but it failed. Xeon Phi in theory might be good for ray tracing but all the action in ray tracing acceleration today is with custom hardware added to traditional GPUs.

Intel poured a lot of resources into this approach and it didn't find success so I'm not really sure you can say it was sabotaged for economic or political reasons, rather it failed to find a market.

I think the reasons it failed are complex and perhaps it's an idea whose time has just yet to come rather than being a bad idea but it has been tried without much success more than once. I backed the Parallella Kickstarter and that didn't really go anywhere either.

zackmorris · on May 13, 2019

Ya I actually bought an Nvidia GeForce RTX 2070, specifically to play around with its ray tracing capabilities. I kind of wish I could play around with a Google TensorFlow processor (TPU):

https://en.wikipedia.org/wiki/Tensor_processing_unit

It's an ok chip and has some of the ideas that I mentioned (like an 8 bit design that could possibly emulate 64 bit and higher via microcode or software) but I'm afraid that it's too narrowly scoped to be of much use in other domains. Companies seem to be exploring branches of the parallel/DSP problem space rather than providing general solutions like the Xeon Phi.

I'm not totally sold on high multicore (say 64 or more cores), but it's the only way that I can see forward since Moore's law ended. It would be much better to have increased single-threaded performance since that can emulate multicore. Multicore can't emulate single-threaded above Amdahl's limit of about 4x, which is why very few chips today have more than 4 or 8 cores.

That said, most of the problems that I find interesting are also embarrassingly parallel. And since there is currently no consumer high-multicore processor, I can't explore those problem spaces. From first principles, it should have gone:

1) Single core: 6502, z80, 286 2) Low multicore: Pentium, POWER 3) High multicore (MIMD): Xeon Phi, Parallella 4) Optimized vector processing (SIMD): GPU, TPU

Without a viable #3, we have no way of knowing if #4 is an optimal solution or an evolutionary dead end. We have to drink the kool-aid and proceed out of trust. That's the part that I find extremely disappointing about the last 20-25 years.

Another way to say that is: if there is no market fit for ever-faster processors at cheaper prices.. that consumers would rather have low-cost mobile processors and regressively rent time on data centers.. that they'd rather have photorealistic rasterized video games than true supercomputing at home, then computer science is at the very least wounded (and I would argue, slowly dying).

This is a nuanced issue that the mainstream isn't even aware of. I'm excited to see some people trying to get projects like Parallella off the ground. From what I can tell, they are one of the only chips working from first principles:

https://en.wikipedia.org/wiki/Adapteva

But as you pointed out, it may take more than enlightened engineering to bridge the gap to next-gen computing. That's the reason why we are only today seeing advances in machine learning that were fairly well understood by 2000. I just don't want to wait around another 20 years for the other good stuff to get here. Which is why I think it might take an open source design on inexpensive FPGA hardware to show people what they're missing.

mattnewport · on May 13, 2019

> And since there is currently no consumer high-multicore processor, I can't explore those problem spaces.

It depends on your definition of "consumer" but you can buy a 32 core Threadripper at the top end of consumer now and there's a chance we might see 64 core Threadripper this year. They're not exactly cheap but they are likely cheaper and faster than anything comparable you could do with an FPGA and they are fully compatible with existing toolchains and operating systems.

ip26 · on May 11, 2019

If it's so trivial to write a high performance Lisp-to-LUT or Lisp-to-gates toolchain, why don't we have one?

avhon1 · on May 12, 2019

We do! It's called Fairylog, and it was on HN last month.

https://news.ycombinator.com/item?id=19701279

zackmorris · on May 13, 2019

Thank you very much! Somehow I missed that article, but have favorited it because it's such an important stepping stone towards real mainstream adoption of reprogrammable hardware.

bogomipz · on May 12, 2019

>"* A Lisp to HDL compiler, preferably with optimization. [time-consuming, but not difficult]"

>"* A high-level functional or vector or stream language (Clojure/Elixir/MATLAB/Erlang/Go) to Lisp transpiler.

I am curious about the need to use LISP and a functional language for general purpose FPGA computing. Can you elaborate?

>"So we could play around with things like copy-on-write (COW), content-addressable memory and embarrassingly parallel computation in all the areas that today's computers are weak at (things like ray tracing, voxels, genetic programming, neural nets and so on)"

What is COW here? I am only familiar with it in context of a kernel optimization technique for child processes but I think you might be referring to something else here?

zackmorris · on May 13, 2019

Oh ya, well, most programming today is akin to the macro programming that was done in the 80s in languages like Visual Basic. There is essentially no difference between that and the imperative languages like C++, Javascript, PHP, Ruby, Rust, Swift, etc that have captured the market.

VHDL and Verilog aren't really imperative languages. They're functional programming (FP) like spreadsheets, Lisp and languages built on FP like Elixir, Clojure, Julia, Haskell, etc but with ugly syntax.

Another way of saying that is that a digital circuit is analogous to a spreadsheet (which is FP). So we could use referential transparency and other analyses/optimizations from FP and apply them to circuit analysis like Thevenin and Norton equivalent circuits, but at scale.

Sorry the copy-on-write (COW) thing was a little obscure. What I was getting at there is that buzzwords like Map Reduce and big data are hand waving over the central concept of data locality. So for example, a CPU-RAM computer might store a database on an SSD and use a bunch of fancy algorithms to load, cache and process that data. But a multicore computer could slice up the database and store a percentage of the data locally by each core. That's what Map Reduce and Erlang/Go and the Actor model do.

At that point, we can use concepts from the web like content-addressable memory and have an abstraction that frees us from the micromanagement of caching things. Basically we wouldn't have to worry about finding the fast path anymore, because with 1000 cores, anything we do is fast. We'd just have to keep the cores fed with data.

IMHO this core issue is why multicore computers that connect to a single RAM have generally failed. We need multicore computers with data locality, so that each core kind of acts like it has its own OS or is a node in a cluster. That would allow us to build on supercomputer algorithms that are well-understood. Then with those basic building blocks, things like running genetic algorithms and protein folding and physics simulations become straightforward because we're able to use simple algorithms running in parallel rather than increasingly complex algorithms running serially on CPUs whose performance have stopped improving.

bogomipz · on May 13, 2019

Thanks for the wonderfully detailed response and insights. I had no idea that RTL like VHDL/Verilog were FP.

Would you mind elaborating on the following which I found really interesting:

>"Oh ya, well, most programming today is akin to the macro programming that was done in the 80s in languages like Visual Basic. There is essentially no difference between that and the imperative languages like C++, Javascript, PHP, Ruby, Rust, Swift, etc that have captured the market."

I have no experience with VisualBasic so I think an important point your making is being lost on me. How is there no difference between VisualBasic and all those other disparate languages exactly? Thanks.

krupan · on May 13, 2019

"I had no idea that RTL like VHDL/Verilog were FP."

They really aren't. I mean, not syntax-wise at least. Conceptually there are similarities, but really they are process programming, if anything. Processes are everywhere. Not OS processes, not OS threads, basically green threads. It's really asynchronous event-driven programming. Each processes models a distinct bit of hardware, with all the bits of hardware running in parallel. If you've done node or python's twisted then you are actually in a good place to understand. It's tricky though because there are several different language constructs for creating processes and it's all shared memory to communicate between them. If you aren't careful you don't always realize that you've created yet another process (the assign keyword tricks people). If you get outside the subset of the language that is synthesizable (able to be translated directly into hardware), like for simulations, you get to deal with race conditions and everything.

ravenouswolves6 · on May 11, 2019

Also worth a mention, the Swarm architecture: https://people.csail.mit.edu/sanchez/papers/2015.swarm.micro...

Animats · on May 12, 2019

Link leads to a login page.