Open source 25-core chip can be stringed into a 200,000-core computer

fmstephe · on Aug 25, 2016

If, like me, you wondered how it maintains cache coherence with that many core. It uses something called 'directory based coherence'.

https://en.wikipedia.org/wiki/Cache_coherence

From what I have read it appears that the cores we typically run today, use snooping, constantly listening for interesting changes to other core's caches. Clearly this doesn't scale, you don't want to listen to cache traffic of 200,000 cores. So they have a centralised 'directory' to manage cache coherency. The directory system has higher latency, presumably every load/store has to send a message to the potentially distant cache directory manager.

Does anyone really know this stuff, I would be interested to hear a more knowledgeable take on it :)

digi_owl · on Aug 25, 2016

I find myself thinking that the pattern of computing is to first implement an idea in discreet boxes, and then translate that into regions on a IC.

Just observe the path from Mainframe, via minicomp, to todays SoCs.

And as i was reading your comment i found myself thinking about how to get multiple servers to talk to a common datastore, and how that looks eerily similar to this cache coherence issue.

To wander into the weird for a bit, the phrase "as above, so below" keeps going through my mind while contemplating all this.

gpderetta · on Aug 25, 2016

IIRC the L3 cache of a multicore cpu effectively works as a directory for cache coherence, at least on Intel machines which have fully inclusive caches.

trhway · on Aug 25, 2016

so it is more than a decade old UltraSparc T1, so called CoolThreads. That architecture didn't do any good for Sun back then (exactly as it was expected/predicted at the time by all the normal engineers) - bunch of weak cores (8 or 16, if i remember correctly, at the time when Intel was only dreaming about 2) choked by memory bus and without much of business tasks around that would have available widely parallelized software to solve them (beside those SSL benchmarks) - may be it was just that ahead of time...

amock · on Aug 25, 2016

I think one of the big problems with the T1 was that each core only had one FPU, so anything that needed floating point math was really slow. Since this only has two threads per core that should be much less of a problem along with anything else that came from having so many threads.

digi_owl · on Aug 25, 2016

It seems that no matter how much we would love to do everything in parallel, below a certain scale the tasks simply has to be done serially. And insisting on doing them in parallel just turn into a hot mess.

gpderetta · on Aug 25, 2016

IIRC it did work well for embarrassingly parallel java servers. Not much else though.

hobo_mark · on Aug 25, 2016

Would the T2 (also open source) have been considerably better?

pawadu · on Aug 25, 2016

how about MIPS (also recently open sourced)?

brudgers · on Aug 25, 2016

Current discussion of Piton: http://parallel.princeton.edu/piton/#

semi-extrinsic · on Aug 25, 2016

> The goal was to design a chip that could be used in large data centers that handle social networking requests, search and cloud services. The response time in social networking and search is tied to the horsepower of servers in data centers.

So, if I understand this correctly, they are targeting embarrasingly parallell tasks, although tied to a central database. Why should this super-NUMA architecture be good for that? If it were, why are Facebook and Google running on distributed Intel Xeon clusters, and not one of Fujitsu or Oracles big-SPARC-machine offerings?

cmrdporcupine · on Aug 25, 2016

> Some open-source processor designs are for fun. For example, Open Core Foundation ... design for the SH2 processor, which was in Sega’s 1994 Saturn gaming console.

Pretty sure that project is not for "fun." In the talk I watched they seemed pretty damned serious about producing it for 'serious' uses.

fallingfrog · on Aug 25, 2016

This looks quite similar to what Adapteva did with their Parallella platform- they were targeting the low end of hobbyists though rather than big data centers. Maybe that was their mistake.. https://www.parallella.org/

milesf · on Aug 25, 2016

How would one write code for such a monster?

colanderman · on Aug 25, 2016

Generally you need to make the most of such tiny cores. This usually means dedicating any given core to a specific task, whose work of pulls off of a queue in batches. This way stuff tends to stay in each processor's local cache and you can minimize synchronization. If instead you make every core do everything, stuff falls out of cache and you're forced to synchronize across cores and performance suffers.

reacweb · on Aug 25, 2016

make -j200000 --load-average=200000

59nadir · on Aug 25, 2016

Using erlang.

colanderman · on Aug 25, 2016

Erlang is a great way to write code that's distributed across a network and which is generally I/O bound. It's a terrible way to usefully use compute on manycore systems.

jcranmer · on Aug 25, 2016

Charm++?