Chuck Moore's 144-computers chip is now available

gruseom · on Oct 20, 2010

Moore added this wonderful passage to his blog today. I think it captures something of the specialness of the man:

GreenArrays is starting production of its 144-computer chip with an accompanying evaluation board. Details at greenarrays.com. This is the result of funding, testing and refinement.

On the lighter side, regarding my Five Fingers shoes: continued delight; more miles of hiking. My gait is changing: shorter strides, less heel strike, less toe-out. And I've not experienced any lower back pain. Somehow my soles are less sensitive and I can walk barefoot more easily.

http://colorforth.com/blog.htm

TY · on Oct 20, 2010

This is a very interesting product from the great Chuck Moore that needs a "killer app" to take off. I wish I had time to invest in it...

What's interesting is how different this company is from the idea of a startup that most of us here have in our heads. Just look at the team and try to figure out what the average age of the company is:

http://www.greenarraychips.com/home/about/bios.html

I wonder if this has something to do with the fact that chip design requires a lot of experience or the fact that these people just happen to be in great Chuck Moore's professional network. Or likely both.

nadam · on Oct 20, 2010

I think both. Plus consumer web applications and consumer smart phone apps and social media apps are some kind of fashion among young people. (I am 35 years old and even I feel the generational differences from younger folks a bit.) Older folks are not so sensitive to fashions.

Edit: or maybe every generation have a love affair with the topic which is the 'next big thing' of their youth? Maybe in 2040 a social media startup will be founded by old folks, while the young guys will be enhusiastic about something completely different? My father is a mechanical engineer, I am a programmer, maybe my son will have the profession regarding the 'next big thing'.

ebiester · on Oct 20, 2010

Or, we have the domain experience of social and mobile. Many don't have the domain experience of... say health care.

wazoox · on Oct 20, 2010

Reminds me of the wonderful connection machines of lore... http://en.wikipedia.org/wiki/Thinking_Machines_Corporation

Jun8 · on Oct 20, 2010

What, this thing costs $20! As a hardware novice (read dabbled with Arduino a little), how hard would it be for me to create something useful using this chip?

blacksmythe · on Oct 20, 2010

It would be a significant effort to put a board together to do anything with this, certainly more than the cost of the eval board if you place a value on your time (and aren't looking at this just as a hobby).

Most people who might use this in a commercial setting would buy the eval board first to try it out. That is probably what I would do even if this were a hobby project.

GregBuchholz · on Oct 20, 2010

Of course a student could try to find a QFN 88pin to DIP adapter for like $20 (or make their own), and plug it into a breadboard. I could only find a QFN-72 to DIP adapter with 5 minutes of searching.

http://www.proto-advantage.com/store/product_info.php?produc...

silentbicycle · on Oct 20, 2010

It's a minimum order of ten, though. How many people here would be interested in splitting a ten-pack? :)

Jun8 · on Oct 20, 2010

If there's a simple enough project that can be done using it, I might be able to recruit 4-5 people. That's half the pack.

grogers · on Oct 20, 2010

This is interesting - each node is tiny and can barely do much (mainly RAM constraints). But because they are tiny, you can pack tons of them on a chip/board and coordinate several nodes to do larger work items. Since all actions are asynchronous (as opposed to clocked) you use very little power when you aren't actively doing work, and the minimum amount of time required to actually do the work.

I wonder why they settled on 18 bit words though?

Jach · on Oct 20, 2010

From http://www.yosefk.com/blog/my-history-with-forth-stack-machi... :

Look at the chips he’s making. 144-core, but the cores (nodes) are tiny - why would you want them big, if you feel that you can do anything with almost no resources? And they use 18-bit words. Presumably under the assumption that 18 bits is a good quantity, not too small, not too large. Then they write an application note about imlpementing the MD5 hash function:

"MD5 presents a few problems for programming a Green Arrays device. For one thing it depends on modulo 32 bit addition and rotation. Green Arrays chips deal in 18 bit quantities. For another, MD5 is complicated enough that neither the code nor the set of constants required to implement the algorithm will fit into one or even two or three nodes of a Green Arrays computer."

Then they solve these problems by manually implementing 32b addition and splitting the code across nodes. But if MD5 weren’t a standard, you could implement your own hash function without going to all this trouble.

In his chip design tools, Chuck Moore naturally did not use the standard equations:

"Chuck showed me the equations he was using for transistor models in OKAD and compared them to the SPICE equations that required solving several differential equations. He also showed how he scaled the values to simplify the calculation. It is pretty obvious that he has sped up the inner loop a hundred times by simplifying the calculation. He adds that his calculation is not only faster but more accurate than the standard SPICE equation. He said, 'I originally chose mV for internal units. But using 6400 mV = 4096 units replaces a divide with a shift and requires only 2 multiplies per transistor. ... Even the multiplies are optimized to only step through as many bits of precision as needed.'"

_exec · on Oct 20, 2010

Export is controlled per US EAR ECCN 3A991A.3.

Could somebody clarify what this means? The only reference to "3A991A" I could find is www.uptodateregs.com/_eccn/ECCN.asp?ECCN=3A991, however there is nothing on that page about a .3 subsection.

adbge · on Oct 20, 2010

a. “Microprocessor microcircuits ”, “microcomputer microcircuits”, and microcontroller microcircuits having any of the following:

[...]

a.3 More than one data or instruction bus or serial communication port that provides a direct external interconnection between parallel “microprocessor microcircuits” with a transfer rate of 2.5 Mbyte/s.

www.access.gpo.gov/bis/ear/pdf/ccl3.pdf

robin_reala · on Oct 20, 2010

Exactly 2.5Mb/s?

trafficlight · on Oct 20, 2010

That's the TOP SECRET section that we aren't supposed to know about.

Groxx · on Oct 20, 2010

Good God, that's cheap. I can buy that evaluation board (and I very well may some time). Best of luck to them with the sales, that's at a scale and price that could open up a lot of options.

Compare and contrast to the E-ink evaluation board: http://store.nexternal.com/shared/StoreFront/products.asp?CS... 6" = $3,000, and not too long ago they were over $8,000. Out of the reach of many seeking to innovate without a company pushing them to do so.

doosra · on Oct 20, 2010

How does this array do I/O? How does one bring in data (from disk/network) into the chip? I couldn't find anything demonstrating high throughput I/O on the eval board...

jules · on Oct 20, 2010

Can you do something useful on this chip? It seems to me that while the total number of instructions looks good, what can actually be accomplished per instruction is very bad. Not only do the instructions operate on very small words, you need a ton of communication between the cores to do anything. E.g. how fast would this thing realistically be for (integer -- it doesn't even have floating point) matrix multiplication?

steamer25 · on Oct 20, 2010

So if I'm following correctly...

1 ALU / 1.5 ns ≈ 67 MFLOPS * 144 units = 96 GFLOPS

Obviously, a significant percentage of work in the real world would probably go to distributing instructions amongst the units.

Anyway for comparison, according to the thread here: http://forum.beyond3d.com/showthread.php?t=51677

...an Intel Core 2 Quad QX6850 runs @ 48 GFLOPS.

ramchip · on Oct 20, 2010

Isn't a FLOP a "Floating Point Operation"? If the chip has no FP unit, it does 0 FLOPS.

steamer25 · on Oct 20, 2010

I see... I suppose the efficiency of the presumed software-based floating point library can be factored along with the parallelization then.

seunosewa · on Oct 20, 2010

This is disappointing. What a Flop.

metamemetics · on Oct 20, 2010

"144 computers"=="144 core parallel cpu" ? I'm guessing this is not x86 and targeted at academic researchers looking to build custom, massively parallel, computational clusters on the cheap [ computational neuroscience? ]. If anyone could volunteer additional context or applications for this please do so, I'm not as familiar with hardware as I'd like to be.

thesz · on Oct 20, 2010

It has 64 words of RAM and 64 words of ROM. It has 18-bit word for ALU operations and commands.

I think it has more than one MISC command in a word, I think count is about 3 (six-bit commands).

I cannot wrap my head around how to program that... not a beast, more like a field of tiny windmills. One of the designers of preceding chips once wrote about using it as a systolic engine, but the area of systolic algorithms is quite narrow, AFAIK.

I cannot find any C/Fortran compiler or compiler for any other high-level language.

My overall impression is that this looks like all bad ideas from Cell BE were ported to Forth language.

MISC: http://en.wikipedia.org/wiki/Minimal_instruction_set_compute... John Sokol on early GreenArray alike designs: http://hardware.slashdot.org/comments.pl?sid=274687&cid=...

metamemetics · on Oct 20, 2010

My initial thought was: high density, asynchronous cores -> brain modeling. You treat each asynchronous core as a neuron.

Reading up on Charles Moore this may indeed be the intended case: http://www.pcai.com/web/ai_info/pcai_forth.html

> Charles Moore created Forth in the 1960s and 1970s to give computers real-time control over astronomical equipment. A number of Forth's features (such as its interactive style) make it a useful language for AI programming, and devoted adherents have developed Forth-based expert systems and neural networks.

Still, 100 billion neurons in brain / 144 cores * $20 per chip = ~$13 billion . Also I would guess most modern researchers in this area don't know Forth and are doing high-level programming and virtualizing neurons rather than taking a low-level hardware approach.

unwind · on Oct 20, 2010

If you call them up and ask for a quote on 694,444,444 chips, I would kind of expect them to offer you a discount. But remember to ask, just in case.

thesz · on Oct 20, 2010

Forth was used in AI in a pretty non-linear way.

I have a book where authors developed Lisp on Forth and then proceed developing Prolog on newly created Lisp. Then they demonstrated how to use that Prolog in the development of rule-based expert system.

There was a saying that Forth amplify programmers' ability to develop programs and to make mistakes. If you a need an AI tool, but do not need your mistakes to be amplified, stay away from Forth. I think that apply to other areas of domain-specific development as well.

While I adore Forth, I cannot recommend it to anyone. Especially to simulate brain - what if you introduce an error, Forth amplifies it and we'll get a hidden psychopath? ;)

silentbicycle · on Oct 20, 2010

What book is it? You can't just mention a book that has Forth, Lisp, AND Prolog and not give the title. :)

PAIP and a couple others have Lisp and Prolog, LOL has Lisp and Forth, HOPL 2 has all three (but separately), but it doesn't sound like any of those.

TY · on Oct 20, 2010

I think he probably spoke about "Designing and Programming Personal Expert Systems" by Carl Townsend that dates back to 1986.

From http://www.faqs.org/faqs/computer-lang/forth-faq/part5/

  Contains LISP and Prolog emulations in Forth, 
  including a unification   algorithm.  It also has some
  minimum distance classifier code. 
  The application is fault diagnosis in locomotives.

silentbicycle · on Oct 21, 2010

Salute!

metamemetics · on Oct 20, 2010

Yep pretty much all simulation is currently being done with Phil Goodman's Neocortical Simulator (Matlab/C) and NEURON (C + recently a python api) on Blue Gene super computers. http://en.wikipedia.org/wiki/Blue_Brain_Project . So the mystery of who will use this chip continues.

dmm · on Oct 20, 2010

Moore has been building chips like this for years. Someone must be buying them.

metamemetics · on Oct 20, 2010

Perhaps the military? Small embedded neural nets could be highly useful for visual recognition algos on missiles and drones [ the majority of military planes now being built are unmanned ]. However I believe the military is now trying to move all drone tech code to a common operating system\language to increase code portability between platforms.

GregBuchholz · on Oct 20, 2010

Don't forget that these chips are a couple of million times faster than a neuron running at 200Hz. If we forget that the brain has more interconnect, you'd only need about $7,000 dollars worth of chips.

maushu · on Oct 20, 2010

It would also occupy a plane with the area of 12500 m^2. According to wolfram|alpha equivalent to 1.7 times the area of a FIFA-sanctioned international match soccer field.

cullenking · on Oct 20, 2010

Yeah, however, if someone were to buy $1 billion of them, they may have the ability to shrink it down to a 30-40 nm design, significantly reducing the footprint. Further conceptualize it by thinking of the footprint of a decent cluster, with all the 1u blades spread out over a certain area. Definitely feasible, though i recognize your computations are die size, not total computer size.

RodgerTheGreat · on Oct 20, 2010

Three six-bit instructions per word, followed by a single three-bit instruction that can only assume a restricted subset of instructions. Since the chips have 64 words of RAM and ROM you could theoretically pack 512 instructions into each chip. In practice, this figure will be a fair bit lower- Jumps, for example, store their address in the remaining slots of a word, so they could consume as many as 4 "theoretical instructions".

However, communicating in parallel between CPUs is very easy. I/O lines between CPUs have essentially a hardware semaphore that will cause reading CPUs to block until they get a write and writing CPUs to block until they get a corresponding read. By bit-indexing ports you also get pretty easy fanout.

The docs also mention that CPUs can directly "push" instructions to one another without needing a bootstrap on the receiving end, which allows CPUs to act as extended memory for one another, eases debugging and opens up tantalyzing possibilities for self-modifying code.

You aren't going to get very far trying to execute a conventional language on this architecture, but color me interested.

dmm · on Oct 20, 2010

> My overall impression is that this looks like all bad ideas from Cell BE were ported to Forth language.

What's so bad about the cell? I know developers that express nothing short of love for the cell processor.

thesz · on Oct 20, 2010

Cell BE was made without compiler support. You weren't able to feed your C/C++/Fortran program to the compiler and obtain more or less parallel version of your program. This complicates things, you had to manually parallelize your program.

The tool support for Cell BE SPU was close to existent (no, I didn't mess words up). It was of such low quality so that you pretty had to use Emacs with assembler highlighting mode to to any serious work for SPU. The difference in speed between gcc and hand made assembler code circa 2007 was about 1.5-2 times.

Both PPU and SPU are in-order, so you have to avoid minefield of random memory accesses. You had to manually write allocators and such while holding up SPU constraints.

In-order architectures does not facilitate abstractions. You cannot simply recompile code from out-of-order x86 for in-order Cell BE and obtain reasonable performance (say, about 80% from maximum). You will have to optimize agressively.

You cannot load too much into 256Kbytes combined data and program memory of SPU. Divide those 256Kbytes by two, and you have about 128Kbytes of program memory (you should divide it again - one for working program and another for loaded program) and 128Kbytes, or 8Kquadwords (16 bytes per quad word) of data memory. Data memory you should divide again - one part of your data is constant, perhaps, or you work with one patr and another is being loaded. 4K quadwords. Two operations per cycle on quadwords, 2K cycles to process the whole data block. The latency of Cell memory subsystem is very high, so our 2Kcycles should be comparable to time needed to load that amount of data into SPU. So you have to be very, very careful to keep SPU loaded and working.

Many new chips suffer from lack of compiler support, especially in automatic parallelization. Cell BE surely did. So does GreenArray. Cell BE suffered from lack of memory on SPU, main parallel engine block. GreenArray does that as well. Cell BE used simple to implement but hard to program in-order architecture in all its' processing parts. GreenArray uses stack architecture which is extremely hard to program.

So, in my eyes, GreenArray is Cell BE ported to Forth. ;)

david927 · on Oct 20, 2010

My understanding is that this work is in conjunction with Alan Kay's Viewpoints Research: www.viewpointsresearch.org and what they're working on in languages.

gruseom · on Oct 20, 2010

Are you sure? How do you know this? Any connection between Alan Kay and Charles Moore -- or even between their teams -- would be worth hearing about.

sasvari · on Oct 20, 2010

This very powerful and versatile chip consists of an 18x8 array of architecturally identical, independent, complete F18A computers, or nodes, each of which operates asynchronously. Each computer is capable of performing a basic ALU instruction in approx. 1.5 nanoseconds for an energy cost on the order of 7 picojoules. Nothing else available today comes close to that winning combination. Twenty-two of the computers on the edges of the array have one or more I/O pins and one of several classes of circuitry associated with them, as illustrated below.

http://www.greenarrays.com/home/documents/greg/PB001-100503-... [pdf]

About the mentioned F18A computer:

The F18A is a stable, mature design for a computer and its I/O whose robustness has been proven in many chip configurations. It has been proven in 180nm geometry, and a prototype in 130 nm has also performed well. The computer is small; eight fit in roughly a square millimeter. Depending on chip configurations, this yields between 100,000 and 200,000 computers per 8 inch wafer, contributing to the low cost of our chips.

http://greenarraychips.com/home/documents/greg/PB003-100822-... [pdf]

In the named pdf -files are also some information about possible applications. But it seems to be quite tough to get some outside information about it.

rbanffy · on Oct 20, 2010

> I'm guessing this is not x86

Nothing really interesting is.

nadam · on Oct 20, 2010

I am wondering whether this could be used for real-time ray tracing or even for real-time photon mapping. I always had the impression that in case of highly parallel algorithms an incredible amount of very primitive processors have the best MIPS/transistor_count ratio. I know though that the bottleneck usually is bus communication / memory access.

Keyframe · on Oct 20, 2010

Great to hear stuff like this, and I just asked this several days ago: http://news.ycombinator.com/item?id=1797124 but it went nowhere.

mbeihoffer · on Oct 20, 2010

At first I thought it said Chuck Norris and I was like BAHAA!