The UltraSparc T2 has 8 cores and each core can run 8 concurrent threads, so you have 64 concurrent threads running "simultaneously" on a single chip: http://en.wikipedia.org/wiki/UltraSPARC_T2
One of the techniques they use to solve the memory bandwidth problem is to have four memory controllers.
Why not do for memory controllers the same thing we are doing for cores? If 8 cores is the practical limit of a single memory controller, why not ship 16-core CPUs with 2 independent controllers, one for each set of eight cores?
True, the two groups of cores would probably not be able to share memory (at least not at full speed), but that's something the OS scheduler and memory manager could take care of.
For one, there are plenty of pins. The socket used by Core i7 more than doubled the amount of pins (in the 1300s now), yet a QuickPath Interconnect (Intel's replacement to the FSB) only takes 84 pins. Surely, they can squeeze those in.
I would guess that having a high speed cache on chip would be a better use of real-estate.
Going with multiple mem controllers means that performance would still probably be pretty variable, depending on how the mem controllers are mapped to your physical addresses + how your data is laid out.
In Core 2, the FSB was the CPU's main connection to the outside world, including the I/O bridges and busses and the off-chip memory controller, and thus to RAM.
That's not true for i7. QPI connects to the I/O bridges and busses, but the on-chip memory controller connects to RAM directly.
And I would guess that the memory controller uses a ton of pins, and is the reason why the triple-channel i7 has such a high pin count. The upcoming consumer-grade dual-channel model will have something like 200 fewer pins.
Memory latency problems can be worked around with pipelined requests, if you have enough concurrency in the CPU; memory bandwidth problems can be worked around with more independent banks of memory and more memory channels to the CPU, which means more wires. Both can be worked around by putting more memory locally.
Absent such workarounds, it's true that there are some problems (large-mesh finite element analysis, maybe) that more cores won't help with, and other problems that exhibit enough locality (large cellular automata) that more cores will help with. This article observes that some problems are in the first category. It would be absurd to claim that no problems are in the second category.
Perhaps this would be an extremely inefficient use of chip real estate, but what if extra cores were used to execute the same set of instructions in parallel with both outcomes of conditional branches (i.e. completely replace branch prediction) and take whichever one ends up being correct?
If a problem is parallelisable to that extent, are the cores actually hitting the same memory? (Presumably not, since you don't want multiple threads mutating the same memory).
So the problem is really that we have:
[lots of cores] <=chokepoint of memory bus=> [lots of ram]
when we could have:
k times:
[1/k of our cores] <=single mem bus=> [1/k of our RAM]
That increases our aggregate memory bwidth by a factor of k. This has to be <= number of cores, and the problem being solved needs to be able to be partitioned into k chunks.
This is basically the clustering approach (individual proc+ram working on the problem), with the added advantage that we can leave some of the RAM unsplit so we get 'local' shared RAM for free.
No, the crossbar between the cores and memory controllers is not a problem. The problem is getting enough pins to attach more than four memory channels.
OK, thanks. So the problem is that transistor density is going up faster than pin density? Makes sense.
I guess one approach to that is a bigger die size, so you end up with a fixed transistor/cm^2? That would mean die sizes doubling with Moore's law I guess. So you'd want other ways of packing them in (use the flipside of the mobo, try 3d arrays and suffer heat problems).
Or the cores come with attached (non-shared) RAM, which is of course where we are with adding cache.
Other random thoughts: why have memory busses stayed parallel when peripheral busses (scsi, usb) have gone serial? That would reduce pin count for a connection?
Can we avoid going to full 'macro' pins for the CPU-memory bus (and thus pack more pins into the same area for memory connections)? Instead have a smaller, denser collection of pins which are attached as a group to each memory connector?
Sorry for being clueless and thinking aloud, but it's an interesting problem.
I don't think the die size is a problem; AFAIK you can get very many pins out of a die, but getting them out of the package is where it gets expensive.
FB-DIMM is already a serial-style high-frequency interconnect, but it isn't needed on low-end systems.
I think this is really about business, not technology. What they want is not what the mainstream wants, so they must choose between cheap but memory starved systems or very expensive balanced systems. HPC people have been whining about killer micros for 20 years; this is just another version of it.
Did you actually read the article? How is it in anyway related to the "five computers" thingy?
The article is talking about an architectural limitation not a business/consumer one.
Claiming limitations when it comes to technology usually proves short-sighted. The quotes about five computers, 640k of memory, etc, are not formally comparable but they give the anecdotal hint that limitations will always be beaten.
The point made in this article surrounds current architectures and their limitations. The conclusion that more than 16 cores makes no sense might well be a good conclusion for now, but "more than 16 cores may well be pointless" is by no means a conclusion for the long or even mid term.
AI hasn't lived up to the promises made a few decades ago.
* 1958, H. A. Simon and Allen Newell: "within ten years a digital computer will be the world's chess champion" and "within ten years a digital computer will discover and prove an important new mathematical theorem."[53]
* 1965, H. A. Simon: "machines will be capable, within twenty years, of doing any work a man can do."[54]
* 1967, Marvin Minsky: "Within a generation ... the problem of creating 'artificial intelligence' will substantially be solved."[55]
* 1970, Marvin Minsky (in Life Magazine): "In from three to eight years we will have a machine with the general intelligence of an average human being."[56]
I like how you found 2 people who support your argument, but there is a gap separating predictions from promises. Many classical definitions of AI have been surpassed, but because people are still better at a verity of tasks we say we don't have AI.
PS: A digital computer is the worlds chess champion or would be if we let them compete. Making a useful captia is hard, but computers don't compose poetry so we can still say we don't have AI.
Computer translations fall in between a freshman HS language student and an expert. I am not going to trust it for diplomatic negotiations, but I have still used it for some verifiable tasks. Voice to text translations also fit in this context if you can't type then it's ok, but if you need a high level of accuracy then use a person. Which IMO describes the state of most computer AI. It's picking stocks and rejecting parts but I still want a real doctor.
"The conclusion that more than 16 cores makes no sense might well be a good conclusion for now, but "more than 16 cores may well be pointless" is by no means a conclusion for the long or even mid term."
From the article:
"But, to my knowledge, these die-stacking schemes are further from down the road than the production of a mass-market processor with greater than 16 cores."
The article is pretty clear that "more then 16 cores may well be pointless" is for now.
The "five computers" claim was a reasonable one if one presumed that the size and cost problem for computer technology was insurmountable.
In the same way "more than 16 cores is pointless" is a reasonable claim if one presumes that the technical limitations in the article are insurmountable.
Given our past experience, no technological limitation is truly insurmountable.
Therefore, I'm willing to assign as much belief in this article as I am the "five computers" claim.
You ask a good question and I'm getting sick of people voting down posts (like yours) they merely don't like than those that are actually wrong. That is not the HN way but it's rapidly becoming so, sadly.
Join me in voting up all people who make sensible posts (even if you disagree with them) and who are being voted down by newbies abusing the points system :)
You make a good point. I was, perhaps wrongly, inferring an implicit attribution to the man who is usually accused of say that. This makes me guilty of misattribution.
Assuming "stacking memory chips on top of the processor" means local stores for each processor, it's pretty much a given at this point. That's how GPUs work (~16k local), and that's how the PS3 SPUs are (256k local).
It's extraordinarily painful to code for, but hey, all performance optimization is an exercise in caching. So it goes.
Yeah, essentially the processor cores start looking like ccNUMA boxes. The late 1990s called, they want their architecture back :-)
IMHO, it looks like we'll need some smarter memory bus management. If we're looking at the 1990s, anyone remember the crossbar switches SGI used to put in their short-lived x86 boxes? Thoughts on effectiveness?
That's what I was going to post. Use NUMA - each core gets it's own memory. Doesn't linux support NUMA?
It seems to me that each thread already has it's own memory space, just make sure the memory space for the thread is on the same CPU the thread runs on.
It isn't really necessary for each core to be able to access all memory (or at least it'll be way slower to access memory outside it's area).
Difficult to program for in what language? It sounds like a perfect fit for the Actor model and languages, like Erlang, that directly support the idea of partitioned memory and asynchronous communication.
Well, the basic premise of the article is common sense. The memory needs to feed the processor instructions and data.
If the clock speed and number of cores increases at a faster rate than memory speed increases (assuming all the cores share memory) then at some point the memory can't keep up.
Whether or not 16 cores is the magic number, I don't know.
Your assuming a linear relationship between memory access and number of cores but the size of L2/L3 cache mitigates the problem to some degree. We already have chips with 12+ MB which is plenty of RAM to run windows 3.11. The real question is what type of workload do you have and how well does that play with the number of cores your using etc.