>If you want to counter Ken Thompson’s “Trusting Trust” attack, you would want to start with a minimal compiler on a minimal chip; StoneKnifeForth might be a good approach.
Some people have been working on fairly good Forths for different, minimal hardware including the Parallax boards.
The ga144 seems more like a replacement for a cpld+mcu so I wouldn't call it "high performance" in the sense that a mid ranged arm board will out perform it on terms of raw throughput for most tasks.
Well, I think an FPGA might be a closer equivalent. The GA144 can do something like 100 billion 18-bit integer additions per second, or two billion 17×17-bit integer multiplications (my rough notes on http://www.greenarraychips.com/home/documents/greg/WP003-100... are where I'm getting that). With a CPLD like the US$1.00 Altera 5M40ZE64C5N you might be able to get 100 million 16-bit integer adds per second, three orders of magnitude slower — and CPLDs don't usually have multipliers on them, so multiplication will be slower by the same amount.
What do you mean by "a mid ranged arm board"? This Blue Pill is a US$2 ARM board, and a MacBook Pro is a US$2048 ARM board, so maybe something like the geometric mean, US$64? Maybe like https://www.digikey.com/en/products/detail/stmicroelectronic... a US$54 STM32F750N8† eval board?
Well, that has 64 KB of Flash and 340 KB of RAM (an order of magnitude more than the 9216 18-bit words on the GA144), and a screen, and a lot of I/Os and integrated peripherals, and the MCU has a bunch of fixed-function blocks like AES and SHA-2. But "in terms of raw throughput for most tasks" I don't think it's anywhere close: it's just a 216-MHz Cortex-M7, which is a 32-bit 2.1DMIPS/MHz dual-issue core with a 6-stage pipeline, "DSP instructions", and floating point. The "DSP instructions" turn out to be multiply-accumulate and integer SIMD, so you can do four 16-bit multiply-accumulates per cycle instead of one, if the pipeline doesn't stall. Or eight 8-bit multiply-accumulates.
That adds up to about 0.9 billion 16-bit multiplies per second, which is indeed in the same league as the GA144. But it's also only 0.9 billion 16-bit adds per second, or 1.8 billion 8-bit adds, which is short of the GA144's 100 billion adds per second by about a factor of 64! While using an order of magnitude more power (thus "GreenArrays", I suppose.) And it's a very inflexible 0.9 billion multiplies — it's hard to not stall the pipeline and to ensure that you're not accidentally serializing on a single functional unit. The GA144 cores don't have pipelines — but you'll probably build pipelines when you floorplan your app on it.
I mean, I imagine the AES core on the STM32 is actually faster than AES on the GA144, which takes 38 μs per 128-bit block using 17 of the 144 processors.※ That's 420 kilobytes per second, or 3.3 megabytes per second if you replicate it 8 times. I don't know how slow the STM32's AES hardware is but I bet it takes less than 70 clock cycles per byte. That's a bit faster than (one core of) this laptop's CPU, an i7-3840QM at 2.8 GHz, with no AES-NI, which gets 2.1 megabytes per second. But it's not very flexible. You can switch those 17 cores to run a different application in about a microsecond; the AES hardware is forever AES hardware.
With floating-point math I think the comparison might be arguable. The GA144 has to implement floating point in software, like we did on our 486es, which usually imposes a speed penalty on the order of 32×. The ARM core has single- and double-precision floating-point hardware. The GA144's throughput is around 128× as many instructions per second as the ARM's, but it has to multiply a single multiplier bit at a time, so it might end up being slower in this case.
Believe me, I love me some "mid ranged arm boards" for "raw throughput" (see http://canonical.org/~kragen/sa-beowulf/ for some context) but I think an FPGA is a closer analogy. A small FPGA, like a US$6 ICE40UP5K, which has eight 16-bit multipliers running, I think, up to about 100 MHz, so it can do nearly one billion multiplies per second. But when it comes to addition, or state machines, or square roots, or FFTs, the GA144 hopelessly outclasses it. You need to move up to a much larger FPGA to get into the same order of magnitude.
So why would anyone use a CPLD or an FPGA or an MCU when they could use a GA144? Because you can program the MCU in C or Python, and you can program the CPLD or FPGA in Verilog, but if you want to use the GA144 you have to program it in Forth. Worse, until recently you had to program the GA144 in a dialect of colorForth. With an FPGA nextpnr does the floorplanning for you; with arrayForth you have to do the floorplanning. There are efforts to improve this situation, like Chlorophyll‡ ("Compared to MSP430, GA144 is 19 times more energy efficient and 23 times faster when running this [accelerometer-based hand-gesture recognition] application"), but they're still research projects, and they still require you to learn new languages.
C was a research project in 1972, Verilog was a research project in 1986, and yosys was a research project in 2014 (and still is if you want to use the 5M40ZE64C5N), but now they're battle-tested tools that lots of people know how to use. The GreenArrays tools to allow you to program at a higher level are not there yet, because Chuck Moore doesn't think that's the way to do things.
Disclaimer: I haven't used any of the chips mentioned above except the i7, so probably some of the things I said above are wrong. Corrections would be greatly appreciated.
Some people have been working on fairly good Forths for different, minimal hardware including the Parallax boards.
https://github.com/prof-braino/PropForth5.5
There's also Chuck Moore's GreenArrays GA144 if you want a high performance Forth machine which includes a proto area on the board.