Moore added this wonderful passage to his blog today. I think it captures something of the specialness of the man:
GreenArrays is starting production of its 144-computer chip with an accompanying evaluation board. Details at greenarrays.com. This is the result of funding, testing and refinement.
On the lighter side, regarding my Five Fingers shoes: continued delight; more miles of hiking. My gait is changing: shorter strides, less heel strike, less toe-out. And I've not experienced any lower back pain. Somehow my soles are less sensitive and I can walk barefoot more easily.
This is a very interesting product from the great Chuck Moore that needs a "killer app" to take off. I wish I had time to invest in it...
What's interesting is how different this company is from the idea of a startup that most of us here have in our heads. Just look at the team and try to figure out what the average age of the company is:
I wonder if this has something to do with the fact that chip design requires a lot of experience or the fact that these people just happen to be in great Chuck Moore's professional network. Or likely both.
I think both. Plus consumer web applications and consumer smart phone apps and social media apps are some kind of fashion among young people. (I am 35 years old and even I feel the generational differences from younger folks a bit.) Older folks are not so sensitive to fashions.
Edit: or maybe every generation have a love affair with the topic which is the 'next big thing' of their youth? Maybe in 2040 a social media startup will be founded by old folks, while the young guys will be enhusiastic about something completely different? My father is a mechanical engineer, I am a programmer, maybe my son will have the profession regarding the 'next big thing'.
What, this thing costs $20! As a hardware novice (read dabbled with Arduino a little), how hard would it be for me to create something useful using this chip?
It would be a significant effort to put a board together to do anything with this, certainly more than the cost of the eval board if you place a value on your time (and aren't looking at this just as a hobby).
Most people who might use this in a commercial setting would buy the eval board first to try it out. That is probably what I would do even if this were a hobby project.
Of course a student could try to find a QFN 88pin to DIP adapter for like $20 (or make their own), and plug it into a breadboard. I could only find a QFN-72 to DIP adapter with 5 minutes of searching.
This is interesting - each node is tiny and can barely do much (mainly RAM constraints). But because they are tiny, you can pack tons of them on a chip/board and coordinate several nodes to do larger work items. Since all actions are asynchronous (as opposed to clocked) you use very little power when you aren't actively doing work, and the minimum amount of time required to actually do the work.
Look at the chips he’s making. 144-core, but the cores (nodes) are tiny - why would you want them big, if you feel that you can do anything with almost no resources? And they use 18-bit words. Presumably under the assumption that 18 bits is a good quantity, not too small, not too large. Then they write an application note about imlpementing the MD5 hash function:
"MD5 presents a few problems for programming a Green Arrays device. For one thing it depends on modulo 32 bit addition and rotation. Green Arrays chips deal in 18 bit quantities. For another, MD5 is complicated enough that neither the code nor the set of constants required to implement the algorithm will fit into one or even two or three nodes of a Green Arrays computer."
Then they solve these problems by manually implementing 32b addition and splitting the code across nodes. But if MD5 weren’t a standard, you could implement your own hash function without going to all this trouble.
In his chip design tools, Chuck Moore naturally did not use the standard equations:
"Chuck showed me the equations he was using for transistor models in OKAD and compared them to the SPICE equations that required solving several differential equations. He also showed how he scaled the values to simplify the calculation. It is pretty obvious that he has sped up the inner loop a hundred times by simplifying the calculation. He adds that his calculation is not only faster but more accurate than the standard SPICE equation. He said, 'I originally chose mV for internal units. But using 6400 mV = 4096 units replaces a divide with a shift and requires only 2 multiplies per transistor. ... Even the multiplies are optimized to only step through as many bits of precision as needed.'"
Could somebody clarify what this means? The only reference to "3A991A" I could find is www.uptodateregs.com/_eccn/ECCN.asp?ECCN=3A991, however there is nothing on that page about a .3 subsection.
a. “Microprocessor microcircuits ”, “microcomputer microcircuits”, and microcontroller microcircuits having any of the following:
[...]
a.3 More than one data or instruction bus or serial communication port that provides a direct external interconnection between parallel “microprocessor microcircuits” with a transfer rate of 2.5 Mbyte/s.
Good God, that's cheap. I can buy that evaluation board (and I very well may some time). Best of luck to them with the sales, that's at a scale and price that could open up a lot of options.
Compare and contrast to the E-ink evaluation board: http://store.nexternal.com/shared/StoreFront/products.asp?CS... 6" = $3,000, and not too long ago they were over $8,000. Out of the reach of many seeking to innovate without a company pushing them to do so.
How does this array do I/O? How does one bring in data (from disk/network) into the chip? I couldn't find anything demonstrating high throughput I/O on the eval board...
Can you do something useful on this chip? It seems to me that while the total number of instructions looks good, what can actually be accomplished per instruction is very bad. Not only do the instructions operate on very small words, you need a ton of communication between the cores to do anything. E.g. how fast would this thing realistically be for (integer -- it doesn't even have floating point) matrix multiplication?
"144 computers"=="144 core parallel cpu" ? I'm guessing this is not x86 and targeted at academic researchers looking to build custom, massively parallel, computational clusters on the cheap [ computational neuroscience? ]. If anyone could volunteer additional context or applications for this please do so, I'm not as familiar with hardware as I'd like to be.
It has 64 words of RAM and 64 words of ROM. It has 18-bit word for ALU operations and commands.
I think it has more than one MISC command in a word, I think count is about 3 (six-bit commands).
I cannot wrap my head around how to program that... not a beast, more like a field of tiny windmills. One of the designers of preceding chips once wrote about using it as a systolic engine, but the area of systolic algorithms is quite narrow, AFAIK.
I cannot find any C/Fortran compiler or compiler for any other high-level language.
My overall impression is that this looks like all bad ideas from Cell BE were ported to Forth language.
> Charles Moore created Forth in the 1960s and 1970s to give computers real-time control over astronomical equipment. A number of Forth's features (such as its interactive style) make it a useful language for AI programming, and devoted adherents have developed Forth-based expert systems and neural networks.
Still, 100 billion neurons in brain / 144 cores * $20 per chip = ~$13 billion . Also I would guess most modern researchers in this area don't know Forth and are doing high-level programming and virtualizing neurons rather than taking a low-level hardware approach.
I have a book where authors developed Lisp on Forth and then proceed developing Prolog on newly created Lisp. Then they demonstrated how to use that Prolog in the development of rule-based expert system.
There was a saying that Forth amplify programmers' ability to develop programs and to make mistakes. If you a need an AI tool, but do not need your mistakes to be amplified, stay away from Forth. I think that apply to other areas of domain-specific development as well.
While I adore Forth, I cannot recommend it to anyone. Especially to simulate brain - what if you introduce an error, Forth amplifies it and we'll get a hidden psychopath? ;)
Contains LISP and Prolog emulations in Forth,
including a unification algorithm. It also has some
minimum distance classifier code.
The application is fault diagnosis in locomotives.
Yep pretty much all simulation is currently being done with Phil Goodman's Neocortical Simulator (Matlab/C) and NEURON (C + recently a python api) on Blue Gene super computers. http://en.wikipedia.org/wiki/Blue_Brain_Project . So the mystery of who will use this chip continues.
Perhaps the military? Small embedded neural nets could be highly useful for visual recognition algos on missiles and drones [ the majority of military planes now being built are unmanned ]. However I believe the military is now trying to move all drone tech code to a common operating system\language to increase code portability between platforms.
Don't forget that these chips are a couple of million times faster than a neuron running at 200Hz. If we forget that the brain has more interconnect, you'd only need about $7,000 dollars worth of chips.
It would also occupy a plane with the area of 12500 m^2. According to wolfram|alpha equivalent to 1.7 times the area of a FIFA-sanctioned international match soccer field.
Yeah, however, if someone were to buy $1 billion of them, they may have the ability to shrink it down to a 30-40 nm design, significantly reducing the footprint. Further conceptualize it by thinking of the footprint of a decent cluster, with all the 1u blades spread out over a certain area. Definitely feasible, though i recognize your computations are die size, not total computer size.
Three six-bit instructions per word, followed by a single three-bit instruction that can only assume a restricted subset of instructions. Since the chips have 64 words of RAM and ROM you could theoretically pack 512 instructions into each chip. In practice, this figure will be a fair bit lower- Jumps, for example, store their address in the remaining slots of a word, so they could consume as many as 4 "theoretical instructions".
However, communicating in parallel between CPUs is very easy. I/O lines between CPUs have essentially a hardware semaphore that will cause reading CPUs to block until they get a write and writing CPUs to block until they get a corresponding read. By bit-indexing ports you also get pretty easy fanout.
The docs also mention that CPUs can directly "push" instructions to one another without needing a bootstrap on the receiving end, which allows CPUs to act as extended memory for one another, eases debugging and opens up tantalyzing possibilities for self-modifying code.
You aren't going to get very far trying to execute a conventional language on this architecture, but color me interested.
Cell BE was made without compiler support. You weren't able to feed your C/C++/Fortran program to the compiler and obtain more or less parallel version of your program. This complicates things, you had to manually parallelize your program.
The tool support for Cell BE SPU was close to existent (no, I didn't mess words up). It was of such low quality so that you pretty had to use Emacs with assembler highlighting mode to to any serious work for SPU. The difference in speed between gcc and hand made assembler code circa 2007 was about 1.5-2 times.
Both PPU and SPU are in-order, so you have to avoid minefield of random memory accesses. You had to manually write allocators and such while holding up SPU constraints.
In-order architectures does not facilitate abstractions. You cannot simply recompile code from out-of-order x86 for in-order Cell BE and obtain reasonable performance (say, about 80% from maximum). You will have to optimize agressively.
You cannot load too much into 256Kbytes combined data and program memory of SPU. Divide those 256Kbytes by two, and you have about 128Kbytes of program memory (you should divide it again - one for working program and another for loaded program) and 128Kbytes, or 8Kquadwords (16 bytes per quad word) of data memory. Data memory you should divide again - one part of your data is constant, perhaps, or you work with one patr and another is being loaded. 4K quadwords. Two operations per cycle on quadwords, 2K cycles to process the whole data block. The latency of Cell memory subsystem is very high, so our 2Kcycles should be comparable to time needed to load that amount of data into SPU. So you have to be very, very careful to keep SPU loaded and working.
Many new chips suffer from lack of compiler support, especially in automatic parallelization. Cell BE surely did. So does GreenArray. Cell BE suffered from lack of memory on SPU, main parallel engine block. GreenArray does that as well. Cell BE used simple to implement but hard to program in-order architecture in all its' processing parts. GreenArray uses stack architecture which is extremely hard to program.
So, in my eyes, GreenArray is Cell BE ported to Forth. ;)
My understanding is that this work is in conjunction with Alan Kay's Viewpoints Research: www.viewpointsresearch.org and what they're working on in languages.
This very powerful and versatile chip consists of an 18x8 array of architecturally identical, independent, complete F18A computers, or nodes, each of which operates asynchronously. Each computer is capable of performing a basic ALU instruction in approx. 1.5 nanoseconds for an energy cost on the order of 7 picojoules. Nothing else available today comes close to that winning combination. Twenty-two of the computers on the edges of the array have one or more I/O pins and one of several classes of circuitry associated with them, as illustrated below.
The F18A is a stable, mature design for a computer and its I/O whose robustness has been proven in many chip configurations. It has been proven in 180nm geometry, and a prototype in 130 nm has also performed well. The computer is small; eight fit in roughly a square millimeter. Depending on chip configurations, this yields between 100,000 and 200,000 computers per 8 inch wafer, contributing to the low cost of our chips.
In the named pdf -files are also some information about possible applications. But it seems to be quite tough to get some outside information about it.
I am wondering whether this could be used for real-time ray tracing or even for real-time photon mapping. I always had the impression that in case of highly parallel algorithms an incredible amount of very primitive processors have the best MIPS/transistor_count ratio. I know though that the bottleneck usually is bus communication / memory access.
GreenArrays is starting production of its 144-computer chip with an accompanying evaluation board. Details at greenarrays.com. This is the result of funding, testing and refinement.
On the lighter side, regarding my Five Fingers shoes: continued delight; more miles of hiking. My gait is changing: shorter strides, less heel strike, less toe-out. And I've not experienced any lower back pain. Somehow my soles are less sensitive and I can walk barefoot more easily.
http://colorforth.com/blog.htm