Hacker News new | past | comments | ask | show | jobs | submit login
Analysis: more than 16 cores may well be pointless (arstechnica.com)
35 points by alexandros on Dec 7, 2008 | hide | past | favorite | 46 comments



The UltraSparc T2 has 8 cores and each core can run 8 concurrent threads, so you have 64 concurrent threads running "simultaneously" on a single chip: http://en.wikipedia.org/wiki/UltraSPARC_T2 One of the techniques they use to solve the memory bandwidth problem is to have four memory controllers.


Yeah, but the way it does it is by switching concurrent threads when the last gets stalled on memory.

It's brilliant, and seems effective, but it's still just 8 cores per socket.


This is useful to get around the latency issues, which let's you use more of your memory bandwidth before stalling a core.


Why not do for memory controllers the same thing we are doing for cores? If 8 cores is the practical limit of a single memory controller, why not ship 16-core CPUs with 2 independent controllers, one for each set of eight cores?

True, the two groups of cores would probably not be able to share memory (at least not at full speed), but that's something the OS scheduler and memory manager could take care of.

For one, there are plenty of pins. The socket used by Core i7 more than doubled the amount of pins (in the 1300s now), yet a QuickPath Interconnect (Intel's replacement to the FSB) only takes 84 pins. Surely, they can squeeze those in.


I would guess that having a high speed cache on chip would be a better use of real-estate.

Going with multiple mem controllers means that performance would still probably be pretty variable, depending on how the mem controllers are mapped to your physical addresses + how your data is laid out.


In Core 2, the FSB was the CPU's main connection to the outside world, including the I/O bridges and busses and the off-chip memory controller, and thus to RAM.

That's not true for i7. QPI connects to the I/O bridges and busses, but the on-chip memory controller connects to RAM directly.

And I would guess that the memory controller uses a ton of pins, and is the reason why the triple-channel i7 has such a high pin count. The upcoming consumer-grade dual-channel model will have something like 200 fewer pins.


Memory latency problems can be worked around with pipelined requests, if you have enough concurrency in the CPU; memory bandwidth problems can be worked around with more independent banks of memory and more memory channels to the CPU, which means more wires. Both can be worked around by putting more memory locally.

Absent such workarounds, it's true that there are some problems (large-mesh finite element analysis, maybe) that more cores won't help with, and other problems that exhibit enough locality (large cellular automata) that more cores will help with. This article observes that some problems are in the first category. It would be absurd to claim that no problems are in the second category.


Perhaps this would be an extremely inefficient use of chip real estate, but what if extra cores were used to execute the same set of instructions in parallel with both outcomes of conditional branches (i.e. completely replace branch prediction) and take whichever one ends up being correct?

On second thought, it's probably not worth it. It seems modern branch predictors are at least 90% accurate (http://en.wikipedia.org/wiki/Branch_predictor), and only a few percent of instructions are conditional branches anyway (http://bloggablea.wordpress.com/2007/04/27/so-does-anyone-ev...).


If a problem is parallelisable to that extent, are the cores actually hitting the same memory? (Presumably not, since you don't want multiple threads mutating the same memory).

So the problem is really that we have:

[lots of cores] <=chokepoint of memory bus=> [lots of ram]

when we could have:

k times: [1/k of our cores] <=single mem bus=> [1/k of our RAM]

That increases our aggregate memory bwidth by a factor of k. This has to be <= number of cores, and the problem being solved needs to be able to be partitioned into k chunks.

This is basically the clustering approach (individual proc+ram working on the problem), with the added advantage that we can leave some of the RAM unsplit so we get 'local' shared RAM for free.


No, the crossbar between the cores and memory controllers is not a problem. The problem is getting enough pins to attach more than four memory channels.


OK, thanks. So the problem is that transistor density is going up faster than pin density? Makes sense.

I guess one approach to that is a bigger die size, so you end up with a fixed transistor/cm^2? That would mean die sizes doubling with Moore's law I guess. So you'd want other ways of packing them in (use the flipside of the mobo, try 3d arrays and suffer heat problems).

Or the cores come with attached (non-shared) RAM, which is of course where we are with adding cache.

Other random thoughts: why have memory busses stayed parallel when peripheral busses (scsi, usb) have gone serial? That would reduce pin count for a connection?

Can we avoid going to full 'macro' pins for the CPU-memory bus (and thus pack more pins into the same area for memory connections)? Instead have a smaller, denser collection of pins which are attached as a group to each memory connector?

Sorry for being clueless and thinking aloud, but it's an interesting problem.


I don't think the die size is a problem; AFAIK you can get very many pins out of a die, but getting them out of the package is where it gets expensive.

FB-DIMM is already a serial-style high-frequency interconnect, but it isn't needed on low-end systems.

I think this is really about business, not technology. What they want is not what the mainstream wants, so they must choose between cheap but memory starved systems or very expensive balanced systems. HPC people have been whining about killer micros for 20 years; this is just another version of it.


I think there is a world market for maybe five computers.


Did you actually read the article? How is it in anyway related to the "five computers" thingy? The article is talking about an architectural limitation not a business/consumer one.


Claiming limitations when it comes to technology usually proves short-sighted. The quotes about five computers, 640k of memory, etc, are not formally comparable but they give the anecdotal hint that limitations will always be beaten.

The point made in this article surrounds current architectures and their limitations. The conclusion that more than 16 cores makes no sense might well be a good conclusion for now, but "more than 16 cores may well be pointless" is by no means a conclusion for the long or even mid term.


"they give the anecdotal hint that limitations will ALWAYS be beaten."

Don't you mean usually? I haven't seen the big break throughs in AI that were expected. I haven't seen a solution to the halting problem, etc, etc.


Because, by definition, when an AI breakthrough is achieved, it's no longer AI...


that's a good soundbite, but it's nonsense.

AI hasn't lived up to the promises made a few decades ago.

* 1958, H. A. Simon and Allen Newell: "within ten years a digital computer will be the world's chess champion" and "within ten years a digital computer will discover and prove an important new mathematical theorem."[53]

* 1965, H. A. Simon: "machines will be capable, within twenty years, of doing any work a man can do."[54]

* 1967, Marvin Minsky: "Within a generation ... the problem of creating 'artificial intelligence' will substantially be solved."[55]

* 1970, Marvin Minsky (in Life Magazine): "In from three to eight years we will have a machine with the general intelligence of an average human being."[56]


I like how you found 2 people who support your argument, but there is a gap separating predictions from promises. Many classical definitions of AI have been surpassed, but because people are still better at a verity of tasks we say we don't have AI.

PS: A digital computer is the worlds chess champion or would be if we let them compete. Making a useful captia is hard, but computers don't compose poetry so we can still say we don't have AI.


poetry is a bad example. A computer still can't translate from one language to another at a level anyone would trust for something important.


Computer translations fall in between a freshman HS language student and an expert. I am not going to trust it for diplomatic negotiations, but I have still used it for some verifiable tasks. Voice to text translations also fit in this context if you can't type then it's ok, but if you need a high level of accuracy then use a person. Which IMO describes the state of most computer AI. It's picking stocks and rejecting parts but I still want a real doctor.


exactly, AI hasn't lived up to the promises which was my original point. Remember I was responding to this quote,

"they give the anecdotal hint that limitations will ALWAYS be beaten."

Limitations will not always be beaten.


Perhaps I should have said numerical limitations or limitations of scale, rather than limitations of technique.


"The conclusion that more than 16 cores makes no sense might well be a good conclusion for now, but "more than 16 cores may well be pointless" is by no means a conclusion for the long or even mid term."

From the article: "But, to my knowledge, these die-stacking schemes are further from down the road than the production of a mass-market processor with greater than 16 cores."

The article is pretty clear that "more then 16 cores may well be pointless" is for now.


The "five computers" claim was a reasonable one if one presumed that the size and cost problem for computer technology was insurmountable.

In the same way "more than 16 cores is pointless" is a reasonable claim if one presumes that the technical limitations in the article are insurmountable.

Given our past experience, no technological limitation is truly insurmountable.

Therefore, I'm willing to assign as much belief in this article as I am the "five computers" claim.


research result != hunch


640K ought to be enough for anybody.


All 9000 of your misattributions are belong to Kefka.


How can this be a misattribution if he didn't attribute it to anyone?


You ask a good question and I'm getting sick of people voting down posts (like yours) they merely don't like than those that are actually wrong. That is not the HN way but it's rapidly becoming so, sadly.

Join me in voting up all people who make sensible posts (even if you disagree with them) and who are being voted down by newbies abusing the points system :)


You make a good point. I was, perhaps wrongly, inferring an implicit attribution to the man who is usually accused of say that. This makes me guilty of misattribution.


Yaaooouch! Seafood soup is NOT on the menu!


How does this development affect the issue then?

http://news.ycombinator.com/item?id=389857


Assuming "stacking memory chips on top of the processor" means local stores for each processor, it's pretty much a given at this point. That's how GPUs work (~16k local), and that's how the PS3 SPUs are (256k local).

It's extraordinarily painful to code for, but hey, all performance optimization is an exercise in caching. So it goes.


Yeah, essentially the processor cores start looking like ccNUMA boxes. The late 1990s called, they want their architecture back :-)

IMHO, it looks like we'll need some smarter memory bus management. If we're looking at the 1990s, anyone remember the crossbar switches SGI used to put in their short-lived x86 boxes? Thoughts on effectiveness?


That's what I was going to post. Use NUMA - each core gets it's own memory. Doesn't linux support NUMA?

It seems to me that each thread already has it's own memory space, just make sure the memory space for the thread is on the same CPU the thread runs on.

It isn't really necessary for each core to be able to access all memory (or at least it'll be way slower to access memory outside it's area).


Difficult to program for in what language? It sounds like a perfect fit for the Actor model and languages, like Erlang, that directly support the idea of partitioned memory and asynchronous communication.


Local store vs. ccNUMA is orthogonal to stacking vs. non-stacking. The local store architecture looks like an evolutionary dead end at this point.


Don't super computers have shared direct memory access of all the memory for all CPUs?

Perhaps it is time for home desktops to adopt the super computer architecture!


Alright I'm not a hardware engineer I don't even like assembly.

Can somebody explain how mainframe DMA is different from the DMA in your home PC?

I know mainframes have more then 16 CPUs. The memory problem we're talking about here does not apply to them, right?


Why stack? Just make it an onion.


"640K ought to be enough for anybody." - (Never actually said by) Bill Gates


That quote is irrelevant, except that both the quote and title of the article are misleading.

The real title should have been "With the current architectures and/or memory speeds, more than 16 cores may well be pointless".


"may or may not be pointless, becasue author couldn't find the data to backup his point".

There, FTFY.


Well, the basic premise of the article is common sense. The memory needs to feed the processor instructions and data.

If the clock speed and number of cores increases at a faster rate than memory speed increases (assuming all the cores share memory) then at some point the memory can't keep up.

Whether or not 16 cores is the magic number, I don't know.


Your assuming a linear relationship between memory access and number of cores but the size of L2/L3 cache mitigates the problem to some degree. We already have chips with 12+ MB which is plenty of RAM to run windows 3.11. The real question is what type of workload do you have and how well does that play with the number of cores your using etc.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: