From what I have read it appears that the cores we typically run today, use snooping, constantly listening for interesting changes to other core's caches. Clearly this doesn't scale, you don't want to listen to cache traffic of 200,000 cores. So they have a centralised 'directory' to manage cache coherency. The directory system has higher latency, presumably every load/store has to send a message to the potentially distant cache directory manager.
Does anyone really know this stuff, I would be interested to hear a more knowledgeable take on it :)
I find myself thinking that the pattern of computing is to first implement an idea in discreet boxes, and then translate that into regions on a IC.
Just observe the path from Mainframe, via minicomp, to todays SoCs.
And as i was reading your comment i found myself thinking about how to get multiple servers to talk to a common datastore, and how that looks eerily similar to this cache coherence issue.
To wander into the weird for a bit, the phrase "as above, so below" keeps going through my mind while contemplating all this.
IIRC the L3 cache of a multicore cpu effectively works as a directory for cache coherence, at least on Intel machines which have fully inclusive caches.
so it is more than a decade old UltraSparc T1, so called CoolThreads. That architecture didn't do any good for Sun back then (exactly as it was expected/predicted at the time by all the normal engineers) - bunch of weak cores (8 or 16, if i remember correctly, at the time when Intel was only dreaming about 2) choked by memory bus and without much of business tasks around that would have available widely parallelized software to solve them (beside those SSL benchmarks) - may be it was just that ahead of time...
I think one of the big problems with the T1 was that each core only had one FPU, so anything that needed floating point math was really slow. Since this only has two threads per core that should be much less of a problem along with anything else that came from having so many threads.
It seems that no matter how much we would love to do everything in parallel, below a certain scale the tasks simply has to be done serially. And insisting on doing them in parallel just turn into a hot mess.
> The goal was to design a chip that could be used in large data centers that handle social networking requests, search and cloud services. The response time in social networking and search is tied to the horsepower of servers in data centers.
So, if I understand this correctly, they are targeting embarrasingly parallell tasks, although tied to a central database. Why should this super-NUMA architecture be good for that? If it were, why are Facebook and Google running on distributed Intel Xeon clusters, and not one of Fujitsu or Oracles big-SPARC-machine offerings?
> Some open-source processor designs are for fun. For example, Open Core Foundation ... design for the SH2 processor, which was in Sega’s 1994 Saturn gaming console.
Pretty sure that project is not for "fun." In the talk I watched they seemed pretty damned serious about producing it for 'serious' uses.
This looks quite similar to what Adapteva did with their Parallella platform- they were targeting the low end of hobbyists though rather than big data centers. Maybe that was their mistake.. https://www.parallella.org/
Generally you need to make the most of such tiny cores. This usually means dedicating any given core to a specific task, whose work of pulls off of a queue in batches. This way stuff tends to stay in each processor's local cache and you can minimize synchronization. If instead you make every core do everything, stuff falls out of cache and you're forced to synchronize across cores and performance suffers.
Erlang is a great way to write code that's distributed across a network and which is generally I/O bound. It's a terrible way to usefully use compute on manycore systems.
https://en.wikipedia.org/wiki/Cache_coherence
From what I have read it appears that the cores we typically run today, use snooping, constantly listening for interesting changes to other core's caches. Clearly this doesn't scale, you don't want to listen to cache traffic of 200,000 cores. So they have a centralised 'directory' to manage cache coherency. The directory system has higher latency, presumably every load/store has to send a message to the potentially distant cache directory manager.
Does anyone really know this stuff, I would be interested to hear a more knowledgeable take on it :)