Hacker News new | past | comments | ask | show | jobs | submit login
Intel shows 8 core 528 thread processor with silicon photonics (servethehome.com)
496 points by NavinF on Aug 30, 2023 | hide | past | favorite | 199 comments



66 threads per core looks more like a barrel processor than anything else. We shouldn’t expect those threads to be very fast, but we can assume that, if the processor has enough work, it should be able to be doing something useful most of the time (rather than waiting for memory).


I don’t know if Intel has the appetite to attempt another barrel processor.

The primary weakness of barrel processors is human; only a handful of people grok how to design codes that really exploit their potential. They look deceptively familiar at a code level, because normal code will run okay, but won’t perform well unless you do things that look extremely odd to someone that has only written code for CPUs. It is a weird type of architecture to design data structures and algorithms for and there isn’t a lot of literature on algorithm design for barrel processors.

I love barrel processors and have designed codes for a few different such architectures, starting with the old Tera systems, and became quite good at it. In the hands of someone that knows what they are doing I believe they can be more computationally efficient than just about any other architecture given a similar silicon budget for general purpose computing. However, the reality is that writing efficient code for a barrel processor requires carrying a much more complex model in your head than the equivalent code on a CPU; the economics favors architectures like CPUs where an average engineer can deliver adequate efficiency. At this point, I’ve given up on the idea that I’ll ever see a mainstream barrel processor, despite their strengths from a pure computational efficiency standpoint.


With the advent of GPU coding and the number of people exposed to it there is a chance that enough people are willing and able to try unfamiliar architectures if it gives them an advantage that this just might be viable now.


What kind of advantage barrel processors give a regular programmer (especially things that can't be done efficiently with CPU/GPU)?


Much closer communication lines between the two, thread level guarantees that would be hard to match otherwise. A barrel processor will 'hit' the threads far more frequently on tasks that would be hard to adapt to a GPU (so for instance, more branching).

Though CPU/GPU combinations are slowly moving in that direction anyway. Incidentally, if this sort of thing really interests you: when I was playing around with machine learning my trick to see how efficiently my code was using the GPU was really simple: first I ran a graphics benchmark that maxed out the GPU and measured power consumption. Then I did the same for just the CPU. Afterwards while running my own code I'd compare the ratio between my own code's power draw with the maximum obtained during the benchmarks, this gave a pretty good indication of whether or not I had made some large mistake and showed a nice and steady increase with every optimization step.


Could you not also see an impact to run times? If algorithm A draws 100W and runs in 10 seconds, and algo B draws 200W… shouldn’t it be very much sub 10 seconds to warrant being called a better algorithm?


Not really, it's an apples-to-oranges comparison. If you ran the same distributed algorithm on a single core you wouldn't see the same speed improvements. These chips are ridiculously efficient because they need far fewer gates to accomplish the same purpose if you have programmed to them specifically. Just like you couldn't simulate a GPU for the same power budget on a CPU. The loss of generality is more than made up for by the increase in efficiency. This is very similar in that respect, but with the caveat that a GPU is even more specialized and so even more efficient.

Imagine what kind of performance you could get out of hardware that is task specific. That's why for instance crypto mining went through a very rapid set of iterations: CPU->GPU->ASIC in a matter of a few years with an extremely brief blip of programmable hardware somewhere in there as well (FPGA based miners, approximately 2013).

Any loss of generality will result in efficiency and vice versa, the question is whether or not it is economically feasible and there are different points on that line that have resulted in marketable (and profitable) products. But there are also plenty of wrecks.


1. Thread Switching: • Hardware Mechanisms: Specific circuitry used for rapid thread context switching. • Context Preservation: How the state of a thread is saved/restored during switches. • Overhead: The time and resources required for a thread switch. 2. Memory Latency and Hiding: • Prefetching Strategies: Techniques to load data before it’s actually needed. • Cache Optimization: Adjusting cache behavior for optimal thread performance. • Memory Access Patterns: Sequences that reduce wait times and contention. 3. Parallelism: • Dependency Analysis: Identifying dependencies that might hinder parallel execution. • Fine vs. Coarse Parallelism: Granularity of tasks that can be parallelized. • Task Partitioning: Dividing tasks effectively among available threads. 4. Multithreading Models: • Static vs. Dynamic Multithreading: Predetermined vs. on-the-fly thread allocation. • Simultaneous Multithreading (SMT): Running multiple threads on one core. • Hardware Threads vs. Software Threads: Distinguishing threads at the CPU vs. OS level. 5. Data Structures: • Lock-free Data Structures: Structures designed for concurrent access without locks. • Thread-local Storage: Memory that’s specific to a thread. • Efficient Queue Designs: Optimizing data structures like queues for thread communication. 6. Synchronization Mechanisms: • Barrier Synchronization: Ensuring threads reach a point before proceeding. • Atomic Operations: Operations that complete without interruption. • Locks: Techniques to avoid common pitfalls like deadlocks and contention. 7. Stall Causes and Mitigation: • Instruction Dependencies: Managing data and control dependencies. • IO-bound vs. CPU-bound: Balancing IO and CPU workloads. • Handling Page Faults: Strategies for minimal disruption during memory page issues. 8. Instruction Pipelining: • Pipeline Hazards: Situations that can disrupt the smooth flow in a pipeline. • Out-of-Order Execution: Executing instructions out of their original order for efficiency. • Branch Prediction: Guessing the outcome of conditional operations to prepare.

There are some of the complications related to barrel processing


people aren't really doing GPU coding; they're calling CUDA libraries. As with other Intel projects this needs dev support to survive and Intel has not been good at providing it. Consider the fate of the Xeon Phi and Omnipath.


Huh that is a solid point, how big a fraction would you estimate to be actually capable of doing real GPU coding?


I don't think I have a broad enough perspective of the industry to assess that. In my corner of the world (scientific computing) it's probably three quarters simple use of accelerators and one quarter actual architecture based on the GPU itself.


There's two single threaded cores, but the four multithreaded cores are indeed Tera-style barrel processors. The Tera paper is even cited directly in the PIUMA white paper: https://arxiv.org/abs/2010.06277

edit: s/eight/four


Unfortunate name. Anyone familiar with the ways PUMA was misapplied in the past might reflexively read that as "Peeeuuw!ma".


What are the kinds of challenges to write things efficiently?

From skimming Wikipedia, it looks like a big challenge is cache pollution. Is it possible that the hit to cache locality is what inhibits uptake? After all, most threads in the OS are sitting idle doing nothing, which means you’re penalized for any “hot code” that’s largely serial (ie typically you have a small number of “hot” applications unless your problem is embarrassingly parallel)


> What are the kinds of challenges to write things efficiently?

The challenge is mostly that you have to create enough fine-grained parallelism and that per thread performance is relatively low. Amdahl's law is in full effect here, a sequential part is going to bite you hard. That's why each die on this chip has two sequential-performance cores.

The graph problems this processor is designed to handle have plenty of parallelism and most of the time those threads will be waiting for (uncached) 8B DRAM accesses.

> From skimming Wikipedia, it looks like a big challenge is cache pollution

This processor has tiny caches and the programmer decides which accesses are cached. In practice, you cache the thread's stack and do all the large graph accesses un-cached, letting the barrel processor hide the latency. There are very fast scratchpads on this thing for when you do need to exploit locality.


Is it then similar to how utilizing a GPU fully is much harder than utilizing a CPU fully?


In some aspects, yes. Arguably, it's much harder to max GPUs because they have the added difficulty of scheduling and executing by blocks of threads. If not all N threads in a block have their input data ready or are branching differently, you are idling some execution units.


It is not hard at all to fully utilize a GPU with a problem that maps well onto that type of architecture. It's impossible to fully utilize a GPU with a problem that does not.

A barrel processor would make branching more efficient than on a GPU, at the cost of throughput. The set of problems that are economically interesting and that would strongly profit from that is rather small, hence these processors remain niche. Conversely, the incentive to have problems that map well to a GPU is higher, because they are cheap and ubiquitous.


> The primary weakness of barrel processors is human; only a handful of people grok how to design codes that really exploit their potential. They look deceptively familiar at a code level, because normal code will run okay, but won’t perform well unless you do things that look extremely odd to someone that has only written code for CPUs.

Nowadays, apps that run straight on baremetal servers are the exception instead of the norm. Some cloud applications favour tech stacks based on high level languages designed to run single-threaded processes on interpreters that abstract all basic data structures, and their bottleneck is IO throughput instead of CPU.

Even if this processor is not ideal for all applications, it might just be the ideal tool for cloud applications that need to handle tons of connections and stay idling while waiting for IO operations to go through.


The new Mojo programming language is attempting to deal with such issues by raising the level of abstraction. The compiler and runtime environment are supposed to automatically optimize for major changes in hardware architecture. In practice through I don't know how well that could work for barrel processors.

https://www.modular.com/mojo


But how does it compare to GPUs for compute tasks (e.g. CUDA)? Both performance and difficulty wise?


Using OpenMP is often enough "to really exploit their potential" "to deliver adequate efficiency."

Which is what "average engineer" would use.


If Intel was very bullish on AI, they might be tempted to design arbitrarily complicated architectures and just trust that GPT-6 will be able to write code for it.

This was a bet that completely failed for Itanium, but maybe this time...


Is GPT-6 the proverbial magic wand that makes all your dreams come true? Because you just hand waved away all the complexity just by name dropping it. Let's get back to Earth for one second...


I think the hand-waving was the point. It was a dig at Intel, accusing them of not having a solid plan.


How could GPT-(arbitrarily large number) learn to write code for an architecture for which there is no training data?


By grokking multi threaded programming of course.


From reading the documentation in the context.


There's an obscene amount of chess literature, yet ChatGPT is mediocre and makes elementary mistakes (invalid moves).

This would work only if ChatGPT 6 is AGI, but then it doesn't make much sense to label it as ChatGPT.


If an AGI is a GPT and you can chat with it why wouldn't you call it ChatGPT?


Yeah, but that's a big IF.

If the actual requirement for this is AGI, then it's clearer to just say AGI instead of ChatGPT which might or might not ever become AGI.


Well... This is a research project, and they aren't betting the farm on it. It's not even x86-compatible.

And then, it'd require a lot of software rewriting, because we are not used to write for hundreds of threads for context switching is a very expensive operation on modern CPUs. On this CPU a context switch is fast and happens on any operation that makes the CPU wait for memory access, therefore, thinking in terms of hundreds of threads pays off. But, again, this will need some clever OS designs to hide everything under an API existing programs can recognize.

It may even be that nothing comes out of it except another lesson on how not to build a computer.


Why do we need x86 compatibility? x86 chips are already fast.

Existing programs don't need to see these things at all, they can just act like microservices, and we could load them with an image and call them as black boxes.

I would think a lot more work would be done by various accelerator cards by now.


They probably aren't insanely bullish. GPT is a revolution but it's a revolution because it can do "easy" general tasks for the first time, which makes it the first actually useful general purpose AI.

But it still can't really design things. It's probably about as far away from being able to design a complex CPU architecture as GPT-4 is from Eliza.


Nvidia released a toolkit to assist chip design a few months ago:

https://techmonitor.ai/technology/ai-and-automation/nvidia-a...

Google has been working on AI's to optimize code:

https://blog.research.google/2022/07/mlgo-machine-learning-f...

> But it still can't really design things.

What do you really mean by "really" in that sentence?

In any case, the claim you responeded to was not that the chips would be designed from the ground up by AI, only that AI will enable us to run code on top of chips that are even more complex than current chips.


The Nvidia toolkit you linked is for place & route, which is a much easier task than designing the actual architecture and microarchitecture. It's similar to laying out a PCB.

The Google research you linked is using AI to make better decisions about when to apply optimisations. For example when do you inline a function? Which register do you spill? Again this is a very simple (conceptually) low level task. Nothing like writing a compiler for example.

> What do you really mean by "really" in that sentence?

I mean successfully create complex new designs from scratch.


I really wouldn't say it's an easier task at all. It's so intense it's essentially impossible for a human to do from scratch. But it is "just" an optimization problem which is different from designing an architecture.


I doubt Intel leadership is that naive.


And a bet they didn't even know they made for iAPX 432. The ADA compiler was awful optimization-wise, but that came out in an external research paper too late to save it.


So the mill is going to be released then, right? Right?

Not too psyched to use some proprietary black box compilerthough.


They have been so careful not to leak anything that there is very little interest on it. I completely forgot about them and, if they never launch, I doubt anyone would notice.


Why not just have GPT-6 design the chip to start with?


Would you call that "general purpose" still? If the code needs to be so special to leverage the benefits of that design?


In a way, yes - it's no different from scalar CPUs from back when they didn't need to wait for memory. GPUs are specifically designed to operate on vectors.


64 of the 66 treads are slow threads where each group of 16 threads shares one set of execution units and all 64 threads share a scratchpad memory and the caches.

This part of each core is very similar to the existing GPUs.

What is different in this experimental Intel CPU and unlike in any previous GPU or CPU, is that each core, besides the GPU-like part, also includes 2 very fast threads, with out-of-order execution and a much higher clock frequency than the slow threads. Each of the 2 fast threads has its own non-shared execution units.

Separately, the 2 fast threads and the 64 slow threads are very similar with older CPUs or GPUs, but their combination into a single core with shared scratchpad memory and cache memories is novel.


> Separately, the 2 fast threads and the 64 slow threads are very similar with older CPUs or GPUs, but their combination into a single core with shared scratchpad memory and cache memories is novel.

Getting some Cell[1] vibes from that, except in reverse I guess.

[1]: https://en.wikipedia.org/wiki/Cell_(processor)


I'm far from a CPU or architecture expert but the way you describe it this CPU reminds me a bit of the Cell from IBM, Sony, and Toshiba. Though, I don't remember if the SPEs had any sort of shared memory in the Cell.


While there are some similarities with the Sony Cell, the differences are very significant.

The PPE of the Cell was a rather weak CPU, meant for control functions, not for computational tasks.

Here the 2 fast threads are clearly meant to execute all the tasks that cannot be parallelized, so they are very fast, according to Intel they are eight time faster than the slow threads, so the 2 fast threads concentrate 20% of the processing capability of a core, with only 80% provided by the other 64 threads.

It can be assumed that the power consumption of the 2 fast threads is much higher than that of the slow threads. It is likely that the 2 fast threads consume alone about the same power as all the other 64 threads, so they will be used at full-speed only for non-parallelizable tasks.

The second big difference was that in the Cell the communication between the PPE and the many SPEs was awkward, while here it is trivial, as all the threads of a core share the cache memories and the scratchpad memory.


The SPEs only had individual scratchpad memory that was divorced the traditional memory hierarchy. You needed to explicitly transfer memory in and out.


So this is a processor where you would have 97% of the threads doing some I/O like task? But that can't be disk I/O, so that would leave networking?


DRAM is the new I/O. So yes, this is designed to handle 97% of the threads doing constant bad-locality DRAM accesses.


And with this, DRAM access becomes the new asynchronous IO.


Neat insight

The protocols for HPC are so amorphous that they bubbled up into the lowest common denominator, completely software defined async global workspace


I think generally the threads are spending a lot of time waiting on memory. It can take >100 cycles to get something from ram so you could have all your threads try to read a pointer and still have computation to spare until the first read comes back from memory.

It could be that eg 97% of your threads are looking things up in big hashtables (eg computing a big join for a database query) or binary-searching big arrays, rather than ‘some I/O task’


Let's say that I can get something from RAM in 100 cycles. But if I have 60 threads all trying to do something with RAM, I can't do 60 RAM accesses in that 100 cycles, can I? Somebody's going to have to wait, aren't they?


this would work really well with rambus style async memory if it every got out from under the giant pile of patents

the 'plus' side here is that that condition gets handled gracefully, but yes, certainly you can end up in a situation where memory transactions per second is the bottleneck.

its likely more advtangeous to have a lot of memory controllers and ddr interfaces here than a lot of banks on the same bus. but that's a real cost and pin issue.

the mta 'solved' this by fully dissociating the memory from the cpu with a fabric

maybe you could do the same with cxl today


I’m not exactly sure what you mean. RAM allows multiple reads to be in flight at once but I guess won’t be clocked as fast as the cpu. So you’ll have to do some computation in some threads instead of reads. Peak performance will have a mix of some threads waiting on ram and others doing actual work.


This processor is out of my league, but do you have any idea how a program would use that optimally? How do you code for that?


> But that can't be disk I/O, so that would leave networking?

Networking is a huge part of cloud applications, and network connections take orders of magnitude longer to go through than disk access.

There are components of any cloud architecture which are dedicated exclusively to handling networking. Reverse proxies, ingress controllers, API gateways, message broker handlers, etc etc etc. Even function-as-a-service tasks heavily favour listening and reacting to network calls.

I dare say that pure horsepower servers are no longer driving demand for servers. The ability to shove as many processes and threads on a single CPU is by far the thing that cloud providers and on prem companies seek.


Reminds me of Tera, the original SMT. 128 threads per core in 1990!

https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...

(I’d put up money that the DARPA project that funded this work is from the same lineage of TLA interest that got Tera enough money to buy Cray!)


Following the history here, DARPA also funded work in ~2005 on a project Monarch, that I saw presented at a Google tech talk back then. I believe it refers to the "butterfly" architectures of very scalable interconnects of lightweight processing units

Bill Dally was working on the networking side (maybe equiv to the photonics interconnect here) and almost got it up and running at BBN (search for "monarch") http://franki66.free.fr/Principles%20and%20Practices%20of%20...

Here's some refs for the chip https://slideplayer.com/slide/7739558/ https://viterbischool.usc.edu/news/2007/03/monarch-system-on...

6 main RISC processors with 96 ALUs for the lightweight IO/compute processes


Their multithreaded cores are similar design, yes. It does not do XMT's in-hardware full/empty bit memory access system though.


Feels a bit like async on the programming language side. Just replace "waiting for memory" with "waiting for IO" and you're almost there.


> Just replace "waiting for memory" with "waiting for IO" and you're almost there.

You can see the latter in your code, you can’t—on a modern superscalar—see the former. Is the proposed architecture any different?


Haven't seen the ISA, but it's not insane to imagine one where explicit data cache manipulation instructions are a requirement (that a memory access outside a cache would fault). I think it's even helpful to make those operations explicit when you are writing high-performance code. On a processor like this, any operation to load a cache that's not already loaded should trigger a switch to the next runnable thread (which also become entities exposed to the ISA).

Also, I'm not even sure it'd be too painful to program (in assembly, at least). It'd be perhaps inconvenient to derive those ops from C code, but Rust, with its explicit ownership, may have a better hand here.


I don’t think it necessarily implies a barrel processor. More like a higher count for SMT which could be due to the higher CPU performance relative to the CPU <> memory bandwidth. While the slow fetches occur, the system could execute more instructions for other threads in parallel.


How useful would it be as a GPU?


Very badly, this is the polar opposite design of a GPU.

It does share the latency-hiding-by-parallelism design, but GPUs do that scheduling on a pretty coarse granularity (viz. warp). The barrel processors on this thing round-robin through each instruction.

GPUs are designed for dense compute: lots of predictable data accesses and control flow, high arithmetic intensity FLOPS.

In contrast, this is designed for lots of data-dependent unpredictable accesses at the 4-8B granularity with little to no FLOPS.


These particular chips. They seem more targeted at HPC work (and price point).

This sort of architecture. I wouldn't be surprised if current GPU were doing something similar.

If you think about executing a shader program. You typically are running that same code over a bunch of data. You can map that to multiple threads.

https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...

https://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-...


To me it looks like the opposite, the processor being very fast, and exporting itself as 66 threads to not spend virtually all its time waiting for external circuits.


Marvell made an SMT8 ARM CPU with "768 Threads Per Node"

https://www.servethehome.com/marvell-thunderx3-arm-server-cp...

IIRC it was targeted at database workloads, and the pitch was the same: if the core is usually twiddling its thumbs waiting on RAM, it might as well work on another thread in the meantime.

And I guess IBM and Zen4C kinda fufill this demand, but more SMT16 cloud instances explicity targeted at these low IPC loads would be neat.


Niagara (Ultrasparc T1/T2) did the same thing. CMT8/SMT8 seems to be a sweet spot for these kinds of barrel processor-esque designs.


I never read or heard someone say “mind as well..” is that common where you’re from? I’ve only ever heard “might as well…”

Interesting thank you!


Feels like a typo I think, it doesn't seem to be a thing from a Google search.


It reminds me of the difference I saw in Europe when signs posted may say “mind your step” instead of “watch your step”


Right, the London subway (used to) have an announcement on every stop: "Mind the Gap". It is weird and wonderful.


Yep, typo. It is late...


They seem to be suggesting if you amortize the die cost and speed consequences of electrical to optical, you can then get pretty much distance-independent speeds (obviously not, but lets limit ourselves to the distance of chip carriers inside a single chassis) interconnect at good rates in optical, with low interference and free routing within the bend radius of the light guides.

Maybe I misread it. Maybe the optical component is for something else like die stacking so you get grids of chips on a super carrier, with optical interconnect.


getting the cores apart would do a lot for sinking the heat


They had voxels in 1997 when i was there.... the goal was stacking layers and using voxels in the vertical stack


> Here is the actual die photograph and confirmation that this is being done on TSMC 7nm.

Yikes, Intel. Has to be a pretty low moment as a chipmaker to have to use your competitor's fabs for something like this.


Intel has a prototype 1.8nm node which Nvidia already tested and spoken well of.

But yes, the 10nm node has been a disaster for Intel and set the company back 5/10 years on the chip making business.


If not for Intel's 10nm debacle, Apple probably wouldn't have left. All of Apple's early-2010s hardware designs were predicated on Intel's three-years-out promises of thermal+power efficiency for 10nm, that just never materialized.


I hard disagree. The chassis and cooler designs of the old intel based macs sandbagged the performance a great deal. They were already building a narrative to their investors and consumers that a jump to in house chip design was necessary. You can see this sandbagging in the old intel chassis Apple Silicon MBP where their performance is markedly worse than the ones in the newer chassis.


That doesn’t make sense: everyone else got hit by Intel’s failure to deliver, too. Even if you assume Apple had some 4-D chess plan where making their own products worse was needed to justify a huge gamble, it’s not like Dell or HP were in on it. Slapping a monster heat sink and fan on can help with performance but then you’re paying with weight, battery life, and purchase price.

I think a more parsimonious explanation is the accepted one: Intel was floundering for ages, Apple’s phone CPUs were booming, and a company which had suffered a lot due to supplier issues in the PowerPC era decided that they couldn’t afford to let another company have that much control over their product line. It wasn’t just things like the CPUs failing further behind but also the various chipset restrictions and inability to customize things. Apple puts a ton of hardware in to support things like security or various popular tasks (image & video processing, ML, etc.) and now that’s an internal conversation, and the net result is cheaper, cooler, and a unique selling point for them.


> net result is cheaper, cooler, and a unique selling point for them

That and they are not paying for Intel's profit margins either. Apple is the quintessential vertical integration - they own their entire stack.


I was thinking of that as cheaper but there’s also a strategic aspect: Apple is comfortable making challenging long-term plans, and if one of those required them to run the Mac division at low profitability for a couple of years they’d do it far more readily than even a core supplier like Intel.


Apple doesnt manufacture their own chips or assemble their own devices. They are certainly paying the profit margins of TSMC, Foxconn, and many other suppliers.


That seems a bit pedantic, practically every HN reader will know that Apple doesn't literally mine every chunk of silicon and aluminum out of the ground themselves, so by default they, or the end customer, are paying the profit margins of thousands of intermediary companies.


I doubt it was intentional, but you're very right that the old laptops had terrible thermal design.

Under load, my M1 laptop can pull similar wattage to my old Intel MacBook Pro while staying virtually silent. Meanwhile the old Intel MacBook Pro sounds like a jet engine.


The m1/m2 chips are generally stupid effecient compared to Intel chips (or even amd/arm/etc)... Are you sure the power draw is comparable? Apple is quite well known for kneecapping hardware with terrible thermal solutions and I don't think there are any breakthroughs in the modern chassis.

I couldn't find good data on the older mbpros, but the m1 max mbpro used 1/3 the power vs an 11th gen Intel laptop to get almost identical scores in cinebench r23.

https://www.anandtech.com/show/17024/apple-m1-max-performanc...


> Apple is quite well known for kneecapping hardware with terrible thermal solutions

But that was my entire point (root thread comment.)

It's not that Apple was taking existing Intel CPUs and designing bad thermal solutions around them. It's that Apple was designing hardware first, three years in advance of production; showing that hardware design and its thermal envelope to Intel; and then asking Intel to align their own mobile CPU roadmap, to produce mobile chips for Apple that would work well within said thermal envelope.

And then Intel was coming back 2.5 years later, at hardware integration time, with... basically their desktop chips but with more sleep states. No efficiency cores, no lower base-clocks, no power-draw-lowering IP cores (e.g. acceleration of video-codecs), no anything that we today would expect "a good mobile CPU" to be based around. Not even in the Atom.

Apple already knew exactly what they wanted in a mobile CPU — they built them themselves, for their phones. They likely tried to tell Intel at various points exactly what features of their iPhone SoCs they wanted Intel to "borrow" into the mobile chips they were making. But Intel just couldn't do it — at least, not at the time. (It took Intel until 2022 to put out a CPU with E-cores.)


the whole premise of this thread is that this reputation isnt fully justified, and thats one I agree with.

Intel for the last 10 years has been saying “if your CPU isn't 100c then theres performance on the table”.

They also drastically underplayed TDP compared to, say, AMD, by taking the average TDP with frequency scaling taken into consideration.

I can easily see Intel marketing to Apple that their CPUs would be fine with 10w of cooling with Intel knowing that that they wont perform as well, and Apple thinking that there will be a generational improvement on thermal efficiency.


>Under load, my M1 laptop can pull similar wattage to my old Intel MacBook Pro while staying virtually silent. Meanwhile the old Intel MacBook Pro sounds like a jet engine.

On a 15/16" Intel MBP, the CPU alone can draw up to 100w. No Apple Silicon except an M Ultra can draw that much power.

There is no chance your M1 laptop can draw even close to it. M1 maxes out at around 10w. M1 Max maxes out at around 40w.


Where do you get the info about power draw?

Intel doesn't publish anything except TDP.

Being generous and saying TDP is actually the consumption; most Intel Mac's actually shipping with "configurable power down" specced chips ranging from 23W (like the i5 5257U) to 47W (like the i7 4870HQ); (NOTE: newer chips like the i9 9980HK actually have a lower TDP at 45w)

of course TDP isn't actually a measure of power consumption, but M2 Max has a TDP of 79W which is considerably more than the "high end" Intel CPU's; at least in terms of what Intel markets.


Check here: https://www.anandtech.com/show/17024/apple-m1-max-performanc...

Keep in mind that Intel might ship a 23w chip but laptop makers can choose to boost it to whatever it wants. For example, a 23w Intel chip is often boosted to 35w+ because laptop makers want to win benchmarks. In addition, Intel's TDP is quite useless because they added PL1 and PL2 boosts.


Apple always shipped their chips with "configurable power down" when it was available, which isn't available on higher specced chips like the i7/i9 - though they didn't disable boost clocks as far as I know.

The major pains for Apple was when the thermal situation was so bad that CPUs were performing below base clock. -- at that point i7's were outperforming i9's because they were underclocking themselves due to thermal exhaustion; which feels too weird to be true.


That's not Apple. That's Intel. Intel's 14nm chips were so hot and bad that they had to be underclocked. Every laptop maker had to underclock Intel laptop chips - even today. The chips can only maintain peak performance for seconds.


Can you elaborate?

My 2019 macbook pro 15 with the i9-9880H can maintain the stock 2.3GHz clock on all cores indefinitely, even with the iGPU active.


My 2019 MBP literally burned my fingertips if I used it while doing software development in the summer.


Back in the Dell/2019MBP era every day was summer for me.


> You can see this sandbagging in the old intel chassis Apple Silicon MBP where their performance is markedly worse than the ones in the newer chassis.

And you can compare both of those and Intels newer chips to Apples ARM offerings.


If I was a laptop manufacturer who wanted to make money selling laptops I would not intentionally make my laptops worse.


I dunno, they could have gone to AMD who is on TSMC and have lots of design wins in other proprietary machines where the manufacturer has a lot of say in tweaking the chip (=game consoles).

I think Apple really wanted to unify the Mac and iOS platforms and it would have happened regardless.


If Intel didn't have those problems, Intel would still be ahead of TSMC, and the M1 might have well be behind the equivalent Intel product in terms of performance and efficiency. It is hard to justify a switch to your inhouse architecture under these conditions.


>If not for Intel's 10nm debacle, Apple probably wouldn't have left

I doubt it. Apple loves doing full vertical stack as much as possible.


Apple left for TSMC, not to do in-house chip fabrication.


Apple had been doing their own mobile processors for a decade. It was matter of time before they vertically integrated the desktop. They definitely did not leave Intel over the process tech.


Apple has been investing directly in mobile processors since they bought a stake in ARM for the Newton. Then later they heavily invested in PortalPlayer, the designer of the iPod SoCs.

Their strategy for desktop and mobile processors has been different since the 90s and they only consolidated because it made sense to ditch their partners in the desktop space.


> Apple has been investing directly in mobile processors since they bought a stake in ARM for the Newton. Then later they heavily invested in PortalPlayer, the designer of the iPod SoCs.

Not this heavily. They bought an entire CPU design and implementation team (PA Semi).


I mean, they purchased 47% of ARM in the 90s. That's while defining the mobile space in the first place, and it being much more of a gamble than now. Heavy first line investment to create mobile niches has empirically been their strategy for decades.


Apple invested in them for a chip for Newton, not for the ARM architecture in particular. Apple was creating their own PowerPC architcture around this time, and they sold their share of ARM when they gave up on Newton.

The PA Semi purchase and redirection of their team from PowerPC to ARM was completely different and obviously signaled they were all in on ARM, like their earlier ARM/Newton stuff did not.


Apple didn't, the Mac business unit did. And it makes sense to consolidate on the investment in chips for phones and tablets.


Apple famously doesn’t do business units. It’s all in on functional organization.


Let's not forget Intel had issues with atom soc chips dying like crazy due to power-on-hours in things like Cisco routers, Nas and other devices around this era too. I think that had a big ripple effect on them to play a cog in their machine around 2018 or so.

Yes 10nm+++++ was a big problem, too.

Apple was also busy going independent and I think their path is to merge iOS with MacOX someday here so it makes sense to dump x86 in favor of higher control and profit margins.


Right. It made no sense for Apple to have complete control over most of their devices, with custom innovations moving between them, and still remain dependent on Intel for one class of devices.

Intel downsides for Apple:

1. No reliable control of schedule, specs, CPU, GPU, DPU core counts, high/low power core ratios, energy envelopes.

2. No ability to embed special Apple designed blocks (Secure Enclave, Video processing, whatever, ...)

3. Intel still hasn't moved to on-chip RAM, shared across all core-types. (As far as I know?)

4. The need to negotiate Intel chip supplies, complicated by Intel's plans for other partner's needs.

5. An inability to differentiate Mac's basic computing capabilities from every other PC that continues to use Intel.

6. Intel requiring Apple to support a second instruction architecture, and a more complex stack of software development tools.

Apple solved 1000 problems when they ditched Intel.


> 3. Intel still hasn't moved to on-chip RAM, shared across all core-types. (As far as I know?)

Apple doesnt have on chip RAM either. They do the exact same thing PC manufacturers do: use standard off the shelf DDR.


Ah yes. The CPU and RAM are mounted tightly together in a system on a chip (SOC) package, so that all RAM is shared by CPU, GPU, DPU/Neural and processor cache.

I can't seem to find any Intel chips that are packaged with unified RAM like that.


Yep. Got hit twice by that. Enough power on hours on those Atoms = some clock buffer dies and now your device no longer boots.


Apple switching to ARM also cost some time. It took like 2 years before you could run Docker on M1. Lots of people delayed their purchases until their apps could run


TSMC processes are easier to use and there's a whole IP ecosystem around them that doesn't exist for Intel's in-house processes. I can easily imagine a research project preferring to use TSMC.


Many intel products that aren't strategically tied to their process nodes use TSMC for this reason. You can buy a lot of stuff off-the-shelf and integration is a lot quicker.


Pretty much everything that isn't a logic CPU is a 2nd class citizen at Intel fabs. It explains why a lot of oddball stuff like this is TSMC. Example: Silicon Photonics itself is being done in Albuquerque, which was a dead/dying site.


They've used them for maybe 15 years for their non-PC processors, guys.

If you search Google with the time filter between 2001 and 2010 you'll find news on it.

There must be some reason HN users don't know very much about semiconductors. Probably principally a software audience? Probably the highest confidence:commentary_quality ratio on this site.


I'd assume people who do know about semiconductors are more used to NDAs and less apt to publicly pontificate.


nvidia and amd have already signed contracts to use Intel's fab service (angstrom-class) so it's not unfathomable to consider. AMD did the same when they dropped global foundries and it could be argued that this was the main reason for their jump ahead vs intel.

They're all using ASML lithography machines anyway, so who's feeding the wafers into the machine is kind of inconsequential.


" amd have already signed contracts to use Intel's fab service"

Source requested, I follow this news closely and have not seen anything that AMD is using Intel's fab.

Nvidia praised their test chip and that indicates they might use Intel's fab. They have not definitively announced that either (https://www.tomshardware.com/news/nvidia-ceo-intel-test-chip...)

Amazon was using Intel Foundry for packaging, not fabrication. And Qualcomm was considering 20A (https://www.pcgamer.com/intel-announces-first-foundry-custom...) in 2021, but there was a rumor earlier this year that Qualcomm might not use it after all (https://www.notebookcheck.net/Qualcomm-reportedly-ditches-In...)


who's feeding the wafers into the machine is kind of inconsequential

If that was true Intel wouldn't be years behind.


That’s a colossal overstatement. I suggest you checkout Asianometry on YouTube.

TSMC is in a unique position in the market and its integration with ASML is one of the chapters in this novel.


This seems more like a proof-of-concept CPU rather than a marketable product, highly specialized, problems and workloads that it solves yet to be found.

I anticipate that photonics will be introduced to general-purpose computing in the years to come, even if only to get a handle on the rising excess heat problems. Most notable is the 10 -> 7 nm process.


Sun did something similar many years ago, but they abandoned it in later UltraSPARC CPUs. Was it that the threads were starved? Did they find that less cores but faster was better? I can't find out much detail about it.

Perhaps HN user bcantrill will give us the inside scoop :-)


I used them in the heyday. There were workloads where they were fantastic, but anything that needed good single-core performance suffered. It also often required much more tuning of parameters to perform, or recompilation, etc. For me, it also coincided with lots more need to use SSL, which had been optimized well for x86 but not Sparc. So now you were dealing with more complexity via SSL offload cards, or reverse proxies. Basically, just too fussy to make them work well outside a few niche areas.

Maybe not a big factor, but it also was bothersome for sysadmins because much of the work we had to do was serial, single core, etc. Meaning it showed it's worst side to the group that usually signed the vendor checks.


This would be perfect for graph reduction and dataflow programming. I could build some really col things with these if they ever go into production.


8 cores with 528 thread? What’s the math? I assumed 512 but had to reread it


From the article:

Here is an interesting one. Intel has a 66-thread-per-core processor with 8 cores in a socket (528 threads?) The cache apparently is not well used due to the workload. This is a RISC ISA not x86.


I guess it is like a GPU where you have so many threads that it hides pretty much all the memory latency?



50 shades of Cray


It's simple: 528 = 8 * (2 +64),

where 2 is number of slow threads, like current CPU,

and 64 are more like GPU threads.

This architecture could be next step in GPU integration. Hope they manage to write efficient implementation for standard math libs. It could be faster than separate CPU+GPU in one package.


I think it’s 2 fast threads that anre like normal cpu core threads (ooo, etc) and 64 simpler threads that are expected to be mostly waiting on ram.


"8-core processor with 66 threads per core." 66*8 = 528. Why 66 is not explained.


Look at slide 6, "Die Architecture" (https://www.servethehome.com/wp-content/uploads/2023/08/Inte...).

The diagram shows 8 gray boxes (in 2 columns of 4). Each gray box is a core.

Above that is a detailed view of what each core looks like. There's a middle section labeled "Crossbar", and above that are 3 boxes labeled STP, MTP, and MTP. Below that are another 3 identical boxes. So each core has 4x MTPs and 2x STP.

What are those? The text of the same slide says MTP is "Multi-Threaded Pipelines" with "16 threads per pipeline" and STP is "Single-Threaded Pipelines" with (self-evidently) 1 thread per pipeline.

And 4 * 16 + 2 * 1 = 66, so 66 threads per core.


Good one. These presentations are like 30 min each so it is hard to stay on top of all of them.


Yes it is. They have four 16 thread pipelines, and two single thread pipelines per core. 4*16=64 and 64+2=66


So, why do they consider that 1 core? Isn't that more like a "compute complex" of multiple cores, as seen on zen 2, but with different core architectures attached instead of 4 cores of the same design?


I'd be inclined to call them multiple cores myself, but if they all share the same pipe to main memory that would be a plausible reason to call them all the same core.


Maybe 2 higher-powered primary threads + 64 aux threads or something. I don't know how that makes sense as part of a single core but I don't have a better guess.


That's what it is, as shown on slide 6 (Die Architecture). Personally I would call this 16 big cores and 32 throughput cores but for some reason Intel is calling it 8 cores.


Maybe they just couldn't fit 67?


66 = 64 slow GPU-like threads + 2 very fast CPU-like threads


66 per core does seem a bit odd! maybe 2 are for routing/metadata or something.


Time to finish rewriting all code to be event driven, io_uring, etc.


in this kind of architecture, its generally more fruitful to leave the asynchrony to the hardware. the tera mta, which other people have mentioned has/had hardware synchronization in the memory system to effect this.

there were no interrupts, just a thread waiting for someone to wake it up.


Did I read the article correctly? Is there 32GB of DRAM on-chip? Or was the DRAM just standard DIMMs?


Not on-chip no, it's still DDR5 DIMMs. What is cool about is that it is using custom DIMMs and memory controllers to do 8B-granularity accesses. Plus, each core has its own MC, as you can see on the die shot. If each MC manages 4GB of DRAM, that would explain the 32GB per chip.


Can you elaborate on the granularity? Why is that important?


Your regular DRAM memory controller will use the 64B bus to pull in a single 64B line of memory. In your modern x86 systems, that's equal to a cache line.

If you are only accessing a single 4B or 8B element of that cache line >80% of the time, as shown on the slides, you are wasting 7/8th of all memory bandwidth with irrelevant data. If you were to use that 64B memory bus to access 8x different 8B memory locations, you get a large boost to your effective bandwidth.


[edit: I think I'm wrong!] I'm pretty sure this is HBM (High Bandwidth Memory) on the same substrate: https://en.wikipedia.org/wiki/High_Bandwidth_Memory

There are some other explainers out there that go into more detail, but Nvidia is also doing this with Grace Hopper (I think?), and Apple with their M-series chips.

Update: It looks like it's just DDR5 ("custom DDR5-4400 DRAM") in this case.


Apple is not using HBM on M-Series chips, although I bet the speed benefit would be massive.


It's regular DDR5.


Just a shout-out to the STH team (hey Patrick) who have been live blogging literally every presentation and announcement from hot chips. I've been reading up on next gen xeons and google's TPU datacenter architecture all day.


Thanks. We will have more in a bit, and may not get to all of them. I was talking to lots of folks that came to say hi today. Trying to get at least a good portion of these done this week.


Thank you for your service. Between your coverage and Dr. Ian Cutress', we can pick up on announcements from events like Hot Chips with a lot more clarity than the PR speak the companies themselves put out.


https://youtube.com/@ServeTheHomeVideo?feature=shared

The YouTube channel for those who like me are too impatient to ever read.


Lol, that's the story of me and my best friend.

I'm far too impatient for YouTube videos. Gimme an article I can scan.

He's far too impatient for articles. Give him a YouTube video he can listen to while doing something else.

20 years and we haven't changed each others mind :->


Mind to share the link?


[1] https://www.servethehome.com/intel-granite-rapids-xeon-held-...

[1b] https://www.servethehome.com/intel-xeon-e-cores-for-next-gen...

[2]https://www.servethehome.com/google-details-tpuv4-and-its-cr...

Very interesting to see the adoption of fiber optic in chips (PIC, photonics in chip/photonic integrated circuit) from so many big players and newcomers.


Noob question: Can/will silicon photonics help with the clock?


Can you do efficient matrix multiplication on photonics?

Can it be further exploited by not requiring it to be 100% precise, ie. when close approximation is good enough?


In this case photonics is used to transport data from one place to another. An electrical bit is converted to a photonic bit, moves across the chip, and then is converted back to an electrical bit. I'm not aware of any even vaguely practical photonic logic gates that would start to allow us to do computation in light.



It will be great to have a setting in the OS to set number of threads per core.


Use case?

What would be a good use case for a high-thread / low-core chip?

Parallelism, like Erlang systems?


Graph workloads and other types of sparse compute. https://www.darpa.mil/program/hierarchical-identify-verify-e...


I benchmarked knights-phi at the time (the add on card) for erlang, it was very impressive.

Sadly it never took off.


So what's the instructions per clock on a barrel processor?


It really seems like Intel is designing this chip to support a very specific workload, when it would have been a better use of everyone's time and money to just rewrite the workload to work well on normal chips...


That graph workload fundamentally cannot work well on normal chips.


Graph workloads can work well on normal chips but you really have to know what you are doing. Clever algorithms that demonstrated graphs workloads could be efficiently run on CPUs was a major contributor to the demise of barrel processing research, since a lot of the interest in those architectures was for the purposes of efficient graph analysis. Given the option, economics heavily favors commodity silicon.


Citation needed. Huge sparse graphs can be processed fairly efficiently on both CPU's and GPU's, but the code has to be written with the architecture in mind.


Their inefficiency at large-scale graph problems is the entire reason DARPA wrote out the program that funded this chip.


I have no expertise in this area, but I think the presentation explains the issue quite well. The workloads they consider have a large number of indirection and poor memory locality, so a lot of memory operations have to wait for cache misses, but you have to do the same thing a lot of times independently with different data.

Normal multithreading is not a solution because it is not granular enough, you can not context switch in the middle of each memory access to run a different thread while waiting for the memory access to complete. You could manually unroll loops and instead of only following one path through the graph you could follow several paths at the same time by interleaving their operations. While this will be able to hide the waiting times because you can continue working on other paths, it will make your code more complex and increase register pressure as you still have the same number of registers but are operating on multiple independent paths. Letting the hardware interleave many parallel paths will keep your code simpler and as each thread has its own registers also not increase register pressure.


The solution is SIMD with a queue of 'to be processed' nodes, and a SIMD program that grabs 32 nodes at once from the queue, processes them all in parallel, and adds any new nodes to process to the end of the queue. There is then plenty of parallelism, great branch predictor behaviour, and the length of the queue ensures that blocking on memory reads pretty much never happens.

Downside: Your business logic of what to do with every node needs to be written in SIMD intrinsics. But that has to be cheaper to do than asking Intel to design a new CPU for you.


Does this actually help? I don't know much about microarchitectural details, so I may be completely wrong. Let's take a simple example, I have some nodes and want to load a field and then do some calculations with the value. I load the pointers of 32 nodes into a SIMD register and issue a gather instruction, I get 32 cache misses. Now it seems to me there is nothing I can do but wait. I can not put this work item back into the queue in the middle of a memory access but even if I could, the overhead seems completely prohibitive to get things out and put them back into queues every other instruction or so.

But with hundreds of hardware threads I could just continue executing some other thread that already got his pointer dereferenced. SIMD is great when you can issue loads and they all get satisfied with the same cache line, but if you have to gather data, then values will arrive one after another and you can not do any dependent work until the last one arrived. I guess the whole point is that all the threads can be at slightly different execution points, while some are waiting for reads to complete others can continue working after their read completed without having to wait for an entire group of reads to complete.


Correct.

This is also why hyperthreading yields a benefit for sparse/graph workloads. I was getting good results on the KNL Xeon Phi with its four hyperthreads. But you easily hit the 'outstanding loads' limit if your hyperthreads all do gather instructions every 4-8 instructions. The architecture needs to balanced around it.


Those nodes in the queue have to come from somewhere and that is where the inefficiency lies.

If you use SIMD gather instructions to fetch all the children of a node, that is still going to be very inefficient on e.g. a Xeon. Each of those gather loads are likely to hit different cache lines, which are accessed only once and waste 7/8th of the memory bandwidth pulling them in. SIMD is also not going to help if you are memory bandwidth or latency bound, we're talking ~0.1 ops/byte here.


I can't help picturing the OpenBSD developers getting their hands on this and promptly disabling 65 threads per core due to security concerns.


Those who sacrifice security to performance, deserve neither :-)


Founding fathers aside, if you're not connected to the internet or even wider LAN, but on some private lab running specific workloads, you don't need security from malicious side-loaded apps.


Oh yes you do.

"14 Popular Air-Gapped Data Exfiltration Techniques Used to Steal the Data" - https://thesecmaster.com/14-popular-air-gapped-data-exfiltra...

"A Survey on Air-Gap Attacks: Fundamentals, Transport Means, Attack Scenarios and Challenges" - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10054827/


The data could be public anyway, like publicly funded research data. Not every number crunching means secrecy.

Also none of the above are about software level security.

You'll probably not doing those anyway, even if you have all the security fixes for cpu timing stuff (as was the context) enabled.


Pretty snarky comment in my opinion. From my understanding, OpenBSDs main reason for not supporting things (apart from the obvious lack of resources) is that they are not free (require blobs etc.).

Security concerns are typically solved in software by them as long as they can get access to the hardware in a free (as in freedom) way.

I actually applaud them for their hard stance and am happy that we have this end of the spectrum as well as the Linux end (pragmatic, just get devices to work somehow, willing to accept some non-freedom). It's certainly not the easiest path to follow.


> Pretty snarky comment in my opinion. From my understanding, OpenBSDs main reason for not supporting things (apart from the obvious lack of resources) is that they are not free (require blobs etc.).

Your comment reads as someone who doesn’t really interact with the OpenBSD ecosystem very much.

I’m pretty sure the commenter you’re replying to was referring to the fact that hyperthreading is disable by default on OpenBSD systems out of caution: https://news.ycombinator.com/item?id=17350278

And as for their attitude toward firmware blobs, while they ideally prefer them to be free, they only require them to be redistributable; this is a less hard stance than GNU. Plenty of OpenBSD drivers require proprietary blobs to function.


I mean, considering Intel has had one of the least performant SMT designs in history, easily being trounced by both SPARC and POWER, you'd probably still get 70% of the chips' total performance with them disabled.

Now, when they disable Zen SMT, then there might be a decent talking point.


Can I have context on whether they are innovation killers or prophets vindicated after years of mockery or somewhere in between?


Interesting question. I don't believe that neither Intel nor AMD have actually found a way to make SMT completely safe against Microarchitectural Data Sampling attacks, so maybe it's not actually possible?

If you only care about security, then I think OpenBSDs approach is currently the best, but it also seems like they got lucky a few times, like with Zenbleed, where they for unknown reason never really adopted the AVX to the same extend as Linux or Windows.


I mean, physically speaking, unless you are deliberately going fully Procrustean on your computations, there's no way to really avoid those types of micro-architectural side-channel disclosures. It's a trade-off. Either you get the computation result faster (but you have side-effects that can be measured as an alternate form of info disclosure), or you trade some minimum possible execution time to gain fewer side-channels through which unintended disclosure can happen.


Imo prophets.


How about people disabling JS by default?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: