Jeez - how is this supposed to work with OS scheduling? The charm of big.little on arm is that the instruction sets between the big and the little cores are identical. Now the OS has to pin processes based on support for different instructions? What a ridiculous nuisance. Intel really couldn’t discipline themselves just this once and actually implement the same instructions, even if in microcode, for both types of cores?
Are the little cores’ instructions at least a complete subset of the big cores’? Are we going to have some ridiculous situations where the little cores are completely pegged but the OS can’t migrate their processes off to the big core?
Or do kernel programmers need to start chasing Intel and trap/software implement every single future AVX8192AESNISSSE instruction Intel jams into future instruction sets to provide Xeon market differentiation?
NVIDIA had a much funnier (infuriating) one, where Tegra X1 had caches between the big and little cores that weren't always coherent. Their solution was to just turn off the little cores while continuing to spew marketing bullshit about it being an 8-core chipset.
Yes, you look to be right that the Samsung-designed Mongoose core in the Exynos SoC had 128-bytes cache line size, while ARM designed their generic Big cores to also use 64-bytes like their Little cores. Still though, seems like ARM did not provide guidance that these specs must match between Big & Little cores, and Samsung also didn't catch this either.
Different instruction sets are not too bad. Copy from my comment elsewhere.
> The OS can install an illegal-opcode exception handler. When a process is first run on the small CPU, the unsupported opcode will raise the exception. The exception handler can simply set the processor affinity of the process to the main CPU and put it to sleep. The OS will handle it like it normally would - putting the process in the run queue of the affined processor.
What if my program has like 1 avx2 instruction? Then we still wake up the big core and run my process there? And keep running it there? Or do we try to migrate back to the little core every once in a while? And then go back to the big core when we trap again?
It's not quite relevant to your point, but generally you wouldn't want to have one AVX2 instruction as they are only really useful if you have a bunch of data.
Yes, even if the process has 1 avx2 instruction, it will be scheduled to run on the main CPU. I mean you could switch the process back to the small CPU when you detect a long stretch of opcode without the avx2 instruction, but the constant switching between the main CPU and small CPU probably is not good for performance.
Thread migration only costs on the order of 100 microseconds, including the effect of cold caches. If you keep the AVX thread on the big core for at least 100 milliseconds at a time, you only lose ~0.2% performance.
There are processor architectures (MIPS) that can't do unaligned memory access where Linux will catch the fault, emulate the memory access in kernel and return to the program. On every unaligned access.
At least this only has to be done once, and frankly these AVX instructions are slow initially anyway.
> At least this only has to be done once, and frankly these AVX instructions are slow initially anyway.
FWIW I believe only some of them, and those are just some (or all?) AVX-512 instructions. I think AVX2 is implemented on the main die and doesn't have to be powered up first.
AVX (256 bit) instructions also suffer a penalty.
Both the 256 and 512 bit instructions resulted in a slowdown for 9 microseconds featuring a quarter the instructions per clock, but the 512 bit instructions resulted in an additional penalty of 11 microseconds without executing instructions.
The first penalty was associated with voltage, and the second with frequency. Heavier 256 bit instructions would probably have resulted in the frequency transaction as well.
It only needs to be done once at the process launch. Once the processor affinity is set, it will stick to the main CPU at full speed for the entirety of the process lifetime.
Once per process. I doubt operating systems will start modifying executables and libraries on-disk to tag them as using relevant instruction set extensions.
And generally speaking, when an application first starts issuing SIMD instructions, that's probably not the a great time to be interrupting it even if it only needs to happen once.
the issue is that once you have an active process using a feature only available on the larger cores, you can't shut off the larger cores to save power without paying a large latency to wake up that process.
You could imagine it as an ELF note of some kind. It's not a direction I've love to go in, and catching illegal exception traps seems fast enough that the more complicated design doesn't seem worthwhile.
Well, once when it's about to try to wring maximum performance out of the processor, and if you only do it once you're ensured that your smaller cores never get used…
A trap, followed by migrating the process to a different core, which may have been powered off and definitely has cold caches. Context switches don't get any worse than that.
Oh, I thought the suggestion was to trap to the large core every time an unsupported instruction was executed and let the process drift back to the smaller one later. Which would be slow I would assume. If you never switched back, wouldn't any process using advanced vectorized instructions (like anything using a decent libc) be permanently pinned to the large core?
> Oh, I thought the suggestion was to trap to the large core every time an unsupported instruction was executed and let the process drift back to the smaller one later.
Yes that's the idea.
> Which would be slow I would assume.
How expensive do you think a trap is? It takes about the order of 10 billionths of a second.
> If you never switched back, wouldn't any process using advanced vectorized instructions (like anything using a decent libc) be permanently pinned to the large core?
I think you can switch back next time you schedule.
> How expensive do you think a trap is? It takes about the order of 10 billionths of a second.
are you sure about that? I would expect at least a couple of orders of magnitude more just for the userspace->kernel transition.
edit: for what is worth, a syscall it takes 250ns on my (admittedly vintage) machine. That's using the lowlatency sysenter path. An interrupt is probably going to cost more.
Anyway the cost of scheduling on another core is going to dwarf that.
edit2: for reference, this was a Sandy Bridge turboing at 3.5 Ghz during the test. With spectre mitigations on (which is going to be a good chunk of that overhead).
> I think you can switch back next time you schedule.
Ok, yes, then we're on the same page. I would still think that would be slow? You'd need a full transition-to-kernel and context switch before you could execute again, which AFAIK would take at least microseconds…unless you think there would be a faster path to resume execution?
I believe the os can turn off features in the big cores, disabling the corresponding bits in cpuid so that the two set of cores will advertise the same capabilities.
If this is going to be anything like how arm did it, intel most likely upgraded the atom CPUs to support all the instructions of the big core, probably just very slowly. For example, you will find that the lowly Cortex-A7 supports virtualization extensions. Why? Because it was paired with big cores that supported them.
The only way to make big.LITTLE work and allow task migration between them is to have them support the same instruction set exactly. So, I suspect that that is why AVX512 was removed from the big core here, and I suspect the atom cores aren't exactly garden variety either. They probably support most other extensions of modern x86 CPUs, at a slow speed.
For example, AVX2 can be microcoded by running multiple pieces of the data through the ALU, one at a time.
Brings me back to coding for the ARM7TDMI on the Gameboy Advance - it supported both 16-bit THUMB[0] and the standard 32-bit instruction sets, and you would switch modes at runtime. Fun times. This does seem more annoying, though.
I don’t think users would be happy if it were a lottery whether their code runs slow, using microcode instructions, or fast, on the better CPU, especially given that, I guess, the difference in instruction set is for vector instructions (because adding those is what’s makes a CPU big nowadays)
So, performance oriented code will pin its performance-critical threads to the fast CPU. Now, the question is: what code won’t try to claim the faster CPU, given that benchmarks typically run programs while there is no contention from other programs?
I'm not sure how this is different from the existing big.LITTLE "lottery", between being scheduled on a CPU with a deep speculative pipeline and a high clock rate or a CPU with less.
If the microcoded vector ops cause your code to run slow, then you will be consuming lots of CPU time, and should be migrated to the big core anyway, right?
So, performance oriented code will pin its performance-critical threads to the fast CPU. Now, the question is: what code won’t try to claim the faster CPU
Do apps have CCX pinning logic for Ryzen? No? Why would they add it for an Intel processor that probably won't even be sold in volume.
That seems hard to do with preemptive scheduling. AFAIK, the decision is on function level, and done at startup. Even if you can decide on call-time, it's only function level, and the preemption can happen mid-function.
I have to imagine the differing instruction sets means different extensions, not entirely different ISA.
I'm not a programmer, but doesn't compiled code often offer multiple codepaths, checking flags at runtime? So if your code runs on the big core, it finds avx256 is supported, and uses the avx256 codepath. If it runs on the little core, it finds avx256 is not supported and it takes a legacy floating point codepath instead.
The problem is that you want to be able to migrate processes between cores. It's not the end of the world if a number-crunching process gets stuck on the big core. But if a process detects an instruction that it's only using for convenience purposes, and that causes it to get stuck on the big core or stuck on the little cores, that's not good. In a different scenario, if it was set up so that a process starting on a little core never detected an important instruction, and was stuck in slow mode forever, that would also not be good.
It depends a lot on which instructions differ and how the feature flags work, of course. They might have set it up in a way that's fine.
Tons of asterisks here, but: You'll have to explicitly write the code to do that at runtime if APIs are provided to do so. When you build for a target, say x86, you're usually targeting the lowest common denominator that you're willing to support and the compiler might not output less supported instructions even if the CPU it's compiling on supports them.
Yes, most programs target the generic x86-64 ISA which means sse2 at most. Still some have accellerated paths that are enabled looking at feature bits. So a simple solution is for the os to mask out some of those bits for processes that intends to potentially migrate to the less featured cores, while potentially giving the option to let the user "pin" a process to the fast cores enabling the all features.
Maybe it wouldn't be possible to have several processes working in parallel, but instead a single process with 5 cores available for threads? Consider the PS3 arch for instance: it had 7 cores, 1 for management and the rest for computing. Could it be the way Intel envision its future lines of products?
> Now the OS has to pin processes based on support for different instructions?
The only complication here would be if they have differing extensions like AVX512, but that's easily solved by the OS by just advertising the common baseline. Nothing about this looks difficult to support?
Runtime detection is the process querying what extensions are available, and then selectively using those. You adjust what the query returns to only return the common set.
Runtime detection has been a pretty standard thing for well over a decade now - it's how we all manage to run the same compiled binaries over the years despite variability in SSE & AVX support. You don't download different versions of Chrome/Photoshop/Gimp/Premiere/Blender/Whatever compiled for different CPU micro-architectures, do you? You might if you run Gentoo I suppose, but that'd be about it.
> Does it mean that I can’t compile with -mavx2 anymore?
You already can't if you're shipping binaries to users unless you only support Skylake & newer? There's a lot of CPUs currently in use that don't support AVX2. So... you either already have this problem and you're familiar with it, or you're not doing this and it's moot.
Runtime detection is the process querying what extensions are available, and then selectively using those. You adjust what the query returns to only return the common set.
Except that they almost always do this runtime detection once, on startup, and then choose/thunk codepaths accordingly. If the OS just happens to start my avx2 process on a little core (and how is it going to know better?), that's going to turn off all of my optimizations, regardless of where the process subsequently gets migrated to.
You already can't if you're shipping binaries to users unless you only support Skylake & newer? There's a lot of CPUs currently in use that don't support AVX2. So... you either already have this problem and you're familiar with it, or you're not doing this and it's moot.
Except nobody in 30 years of x86 dev expects to get a different answer from CPUID during runtime.
But that defeats the purpose of supporting any extensions at all in the big core that the little core doesn't support. Software will get the lowest common denominator answer and just not use avx2. So why support it in the first place? Why not just do the right thing and have uniform extension support like big.LITTLE?
Currently Intel likes to disable features on some models of Core cores for product segmentation reasons. I'm pretty sure processors sold under the Pentium or Celeron brands already have enough features fused off that they're roughly comparable to the Atoms this is being pared with. You also fuse off things like sections of cache or whole cores that have manufacturing defects.
Chips aren't designed from scratch. They're assembled out of previously designed components and in this case Core cores and Atom cores were never designed to work together.
No - but that just means that Intel shouldn’t do this at all. Either don’t support stuff like avx and avx2 in the big core by disconnecting those blocks or support a slow microcode version of avx and avx2 in the little cores. Supporting different extensions for a CPU used with modern preempting OS’s doesn’t make any sense.
> Either don’t support stuff like avx and avx2 in the big core by disconnecting those blocks
That's partly what they did. From the article: "One thing we can confirm in advance – the Sunny Cove does not appear to be AVX-512 enabled."
Maybe they also fused off AVX & AVX2 support in the Sunny Cove core as well, we'll see.
And disabling AVX in cores that otherwise support it is already a common thing - see the Pentium & Celeron lineups that Intel currently sells. They don't have AVX/AVX2, even though the cores inside them definitely could offer it.
The typical x86 extension query (cpuid) is an unpriviliged user-mode instruction — unless the application takes care to ask the OS, there's no real way for the OS to select the common denominator.
Edit: Also: AVX2 is a lot older than Skylake. You're probably thinking of AVX512.
The OS could still intercept CPUID - this is, after all, what VMs do.
But it looks like Intel is doing this anyway, as Sunny Cove in this application has had its AVX-512 removed anyway: "One thing we can confirm in advance – the Sunny Cove does not appear to be AVX-512 enabled."
Running every thread in a hypervisor just to trap cpuid sounds like extreme overkill. And probably some applications want the features only available on the big core, and masking those out doesn't solve that end of the problem.
> You adjust what the query returns to only return the common set.
But this shoots the big core in the foot. You run at the lowest common denominator and the big core doesn't have the advantage of higher clocks. And this might just lead to a lot of software that only runs on the big core.
Imagine you wanted to fly around between multiple points but if you want the option to ever switch to a bus then your plane will circle around each airport until it's as slow as the bus. Or you can opt for "plane only".
It only shoots the big core in the foot for things that would make meaningful use of the extensions present on the big cores but not on the little ones. Which in a 7w application is what, exactly?
Anything that uses any AVX or FMA3 off the top of my head.
A better mix would have been simply using lower clocked, lower powered cores of the same type or really close derivatives of the big core where the manufacturing process and clocks are what keep power low. Not a mix of Ice Lake and Atom. But right now Intel would throw everything at the wall to see what sticks.
And it seems like a good way for developers to make sure their software stays on the big core.
The fact that they put the extensions in implies they expect meaningful use, doesn't it? If not on this specific part, on a future part with multiple big cores.
>> Does it mean that I can’t compile with -mavx2 anymore?
> You already can't if you're shipping binaries to users unless you only support Skylake & newer?
Huh? My 6+ years old gaming PC supports AVX2, and it definitely doesn't have a "Skylake & newer" CPU!
AVX2 support started at Haswell, or Intel core 3rd generation. We're at gen 10 now.
While there certainly are still a lot of systems without AVX2 support, new games (and other performance hungry software) requiring it would not be completely unreasonable.
Intel is still launching new processors with AVX2 disabled. They use AVX2 support for product segmentation and disable it on low-end parts. So a Comet Lake Pentium Gold CPU launched in Q2 2020 doesn't support AVX2.
(I recall a recent news story about how unawareness of this among people writing or documenting compilers has started to cause problems.)
The OS can just turn off any instruction extensions by setting the XCR0 register to whatever it wants. Unsupported instructions will then cause general protection faults (SIGSEGV on Linux). Whether that's what the OS wants to do is a totally separate question. It would be kinda funny if this thing lead to the ability of a process to request a certain XCR0 that the kernel would then serve, because I've been asking for that feature for unrelated reasons for quite some time and didn't get the warmest response to it on the mailing lists.
This is not ARM; on Intel, the CPUID instruction can also be used by normal unprivileged programs. Not all programs look only at /proc/cpuinfo or the ELF auxiliary vector.
SSE itself tops out at 4.2, but Tremont does support the newer SHA extensions.
AVX appears to be the only missing thing, but AVX isn't even standard across Intel's other lines, either. The Pentium line doesn't support AVX either, for example, even though they are using Skylake & newer micro-architectures.
So you currently can't assume AVX support, and you still won't be able to assume AVX support. Why does this matter?
Operating systems move threads between cores. If those cores support different features, threads that are migrated to low-feature cores might experience illegal instruction traps despite correctly checking for instruction features.
Maybe standard application code all runs on the big core, and the little cores are like the system assist processors on IBM mainframes - used to run OS jobs or specific application support code, to free up the main processor for other work (or idling).
For this to work, there would need to be a small set of undemanding tasks that account for a lot of machine time. Feeding video to the GPU? All sorts of GUI compositing and housekeeping? Handling network connections in the browser?
I don't think this is a good explanation, but it's fun to think about.
A lot of things are background tasks until suddenly, without warning, the user ends up waiting for them to happen. Take your example of handling network connections, for example: this can definitely be a background thing! Your computer might want to keep an IMAP connection open, periodically poll a CalDAV server, sync photos with your phone, etc., and all of these would be very reasonable things to run on low-power CPU cores. Kick them over to a wimpy core, throttle down the frequency to its most power-efficient setting, insert big scheduler delays to coalesce timer wake-ups, whatever. Good stuff.
But what happens when the user opens up a photo viewer app and suddenly wants those photos to be synced right now?
If your code is running on a recent iPhone -- the heterogeneous-core platform I'm most familiar with -- then the answer is that the kernel will immediately detect the priority inversion when a foreground process does an IPC syscall, bump up the priority of the no-longer-background process, probably migrate it to the fastest core available, and make it run ASAP. Then, once the process no longer has foreground work to do, it can go back to more power-efficient scheduling.
This kind of pattern is super common, and it would be way more annoying and perilous to try to split tasks into always-foreground and always-background.
The standard operating system model today is mostly to do what the application(s) request and then get out of the way quickly. I'd expect to cede all cores to applications most of the time, rather than reserving the low power cores for OS tasks.
That would be pretty disappointing if they can't handle normal processes too. It's easy to end up with a bunch of processes that are using small amounts of CPU but aren't dedicated background tasks.
Even just looking at a browser, I might have half a dozen generic tab processes open, each using a small amount of CPU. But then I navigate one to a game, and I want that particular tab to get near-exclusive access to the big core while the others use only the little cores.
It will. The performance different between an Intel Atom and a Core i3 has basically nothing to do with differing instruction set extensions. AVX2 support is not why a Core i3 runs circles around an Atom, particularly since the overwhelming majority of instructions executed are not AVX anyway.
The only mechanism I can come up with is to detect illegal instruction traps on small cores, then flag the thread as big-core-only and re-start the execution at the bad instruction. That's not ideal but maybe workable.
(If it traps on the big core, too, it's just a bad instruction and SIGILL is raised to userspace like usual.)
I think the obvious solution is that AVX (or any other extension not in Tremont) will not be supported at all on these CPUs. These are 7W chips, after all.
There's no mixed instructions as Anandtech speculated. The Sunny Cove core was cut down to match what the Tremont cores support. So no AVX at all, no weird extension mismatch, no OS headaches beyond the expected big.LITTLE headaches.
How much would the lack of AVX hurt performance apart from scientific computing? It's interesting to see Intel offering a chip with a clear focus on tablet computing, so the lack of AVX might not be so bad if the Tremont cores improve battery life significantly.
I can answer that: a lot. AVX can speed up things anywhere from modest 20% (if not very well implemented) to massive 15 or 30x when carefully used (usually things that very good fits for AVX/AVX2 instructions/use cases).
Anecdotically, I have a "netbook" that has a Goldmont+ Celeron N4000 that works very respectable for everyday business (web browsing, office suites, watching videos, etc) but crawls when trying to run scientific code. 100x slowdowns at worst, 10x slowdowns on the parts that take "nice and easy" (wrt a 7200u notebook that I usually carry around).
You mean not much. The question was apart from scientific computing how much does it hurt. Per your own comment it's only worth maybe 20%, and that everyday needs work fine.
It'll show up occasionally in some things that would be relevant to a device with this SoC, like noise cancellation, but very little else. And even then you can do noise cancellation even better on a GPU (RTX Voice says hi), and this does still have a GPU and a decent one at that (~500 gflops), so it's not even that simple.
Read my second paragraph. Also Intel, despite having theoretically amazing GPUs, has for years been a lackluster player in that area. Ironical too, since they're one of the largests iGPU manufacturers in the world.
I'm really curious how they envision this working, given the different instruction sets between the main core and the small cores. That seems like a bit of a nightmare to handle on the OS side of things, unless you turn things in to a two tier environment somehow, and require software (users?) to explicitly opt-in to the small cores somehow.
IBM did this with the PowerPC chips used in the Playstation 3. It was a development headache but part of that was the low power cores were extremely limited... no shared memory so you had to DMA data over and double buffer. They ran mighty fast though. It was a lot like programming a DSP really. If this is anything like that I would expect only some workloads to be pushed to the small cores in a similar way to the way stuff is pushed to GPU today. Frameworks will pop up to make it easier for developers but it will be rough at the get-go.
> IBM did this with the PowerPC chips used in the Playstation 3
It was a completely different situation, the Cell's SPE were vector coprocessors, they were a completely different architecture than the PPE but they were also interacted with explicitly from the PPE.
How did Cell compare to doing compute on a GPU? Was the difference in complexity due to support libraries etc, or because it required non-standard thinking about how it can be utilised?
I remember reading or watching something by Naughty Dog about how they maximised use of Cell which was fascinating. (And something about how they ported to PS4 and made extremely efficient use of the eight cores by running game/physics in parallel with the previous frame’s rendering/graphics. Or something like that. It was ages ago and I can’t seem to find the talk online.)
The Cell was very different, and horrendously hard to program. The SPUs had no direct access to memory and had to be fed via DMA transfers coordinated from the PPUs.
Yeah, there was no way to pretend the Cell was anything close to a traditional SMP design. But Intel's Lakefield and Arm's big.LITTLE are very much meant to be treated as SMP systems, with just a few tweaks on top to deal with/take advantage of the fact that they're not completely symmetrical. That means the CPU designers actually have to be more careful about what functional differences they allow to creep in.
My guess is the little cores are mainly for OS background jobs and a view applications which explicitly opt. for it.
Also while they have different instructions sets they don't have a different arch, i.e. they have a "shared base" of instructions which likely are good enough for many background applications. Like updaters, downloaders, background mail fetching programs etc.
In the end it fits in with their "always connected" approach. Through I'm still somewhat skeptical about the "always connected" approach. Like even in many first world countries always connected is just not a think for many users. Even on phones it not fully given through much more then on laptops.
I would imagine the difference between the instruction sets of the two CPU types is small since the small CPU is Atom which is not too different from the main CPU.
The program machine code can be scanned when loaded to look for specific assembly opcodes to determine the required capability of the CPU to execute it on. The code with instructions not fitted for Atom will be sent to the main CPU only.
Edit: Just a thought. The OS can install an illegal-opcode exception handler. When a process is first run on the small CPU, the unsupported opcode will raise the exception. The exception handler can simply set the processor affinity of the process to the main CPU and put it to sleep. The OS will handle it like it normally would - putting the process in the run queue of the affined processor.
> The program machine code can be scanned when loaded to look for specific assembly opcodes to determine the required capability of the CPU to execute it on.
That works as a heuristic, but it's not perfect, since JITs and self-modifying code are a thing.
I expect the chip will raise a fault and the OS will move the process.
This is one of the things we tried when I worked (2011-12) at Intel on QuickIA [0] (dual socket, Atom on one side Xeon on the other). One of my first projects was to write a vectorized matrix multiplication that could only run on the Xeon, so we could demo the fault-to-big-core behavior.
That's true. The code could be modified on the fly after the scanning so it's good to have a last ditch illegal-opcode exception and let the OS catch the exception to migrate the process.
You would only need to do that scanning if the program had hard requirements like it was compiled to assume AVX512 support was present. Which is going to be extremely rare, since that would already mean an x86 program that's highly non-portable.
Runtime detection would be the only actual concern here, but you can easily just advertise the common baseline. As in, just pretend the sunny cove core doesn't support AVX512. The only problem then becomes the big core is potentially slower than it could be, but given how rare things like AVX512 is in typical desktop applications will anyone actually care?
EDIT: Oh, and this appears to be what they're basically doing. Even though Sunny Cove itself supports AVX-512, it's being disabled in this application: "One thing we can confirm in advance – the Sunny Cove does not appear to be AVX-512 enabled."
Seems like the OS could solve it easily by trapping on undefined operation exceptions and retrying on the big core. If it still works simply mark the process as big core only.
Trap the undefined instructions, and dynamically replace them with a jump to emulation code if running on a little core.
Then let the OS and software folks do a systemwide profile to find out which bits of code most frequently run on which cores. Then configure the compiler not to output any instructions not available on the little cores for code which usually runs on the little cores.
> Trap the undefined instructions, and dynamically replace them with a jump to emulation code if running on a little core.
This was more or less what VMware pioneered to deal with privileged instructions. I expect there are thickets of patents involved, even if the earliest have expired. But it's likely there is or would be some licensing agreement.
Disclosure: I work for VMware, though not in VMs per se. Speaking for myself only.
No, unfortunately not. Only if you implement it exactly as it was done back in the day for FPUs. Any deviation and you might run into a more recent patent that covers some aspect of a later implementation, maybe even something seemingly obvious or necessary. This is why it is called a patent “minefield.”
I was going to joke about Intel "discovering the big.LITTLE architecture" until you said different instruction sets. It will be interesting to hear how they hope that will work out...
Sure but that’s still a nuisance from the point of view of OS scheduling. You can’t transparently move a process from a big core to a little core or vice versa the way you can with big.little.
No, but it can be a small nuisance: you can leave a process eligible to run on any core until it faults for an illegal instruction; then you pin it to the big core.
The fact that it's heterogenous kills it for me. Remember the Cell Broadband Engine? You could get some pretty crazy performance out of that thing, but nobody used the execution units because the instruction set was different.
GPUs get away with it because they are really a completely different kind of processor, but even there GPU processing is under-utilized for this reason. It's really a pain.
There was a proof-of-concept "smart network adapter" done by either Intel or MS (don't remember the exact details), which allowed finishing the download of a large file, or receiving emails while the host computer was asleep. I'd imagine tasks like these are well suited for the smaller cores.
A lot of game consoles have this model too. PS3, PS4, Wii, and WiiU off the top of my head run pretty complex operations on the I/O processor even when the main system is off.
The big.LITTLE concept was announced in 2011. It's not the concept of using differently sized cores which is odd, it's using cores with different instruction sets.
What's the user and developer experience, do programs need to be recompiled? Are the extra cores meant to typically run something like the JVM so no recompilation is needed?
I doubt you will need to recompile, although doing so is often a good idea as new compilers can use new instructions/optimizations.
I’d bet money that the different ISAs are full x86 and a subset of x86 which is a great idea. X86 has a lot of old instructions that almost No one uses. Perhaps the small cores will trap and move to process to the big core if it uses an old instruction.
I don't see how the cost of supporting outmoded instructions would be significant. The decoder is a tiny proportion of the die, and not a bottleneck at all. Obsolete instructions that can't be translated to the same uops everything else uses are already microcoded anyway.
I'd bet the instruction set difference is mainly level of SIMD support. Sunny Cove will have AVX2, Tremont won't have enough execution units to go wider than SSE2.
Sure big.LITTLE chips have been around for a while. But my understanding is the instruction set is identical between the cores, they just operate at different speeds. That in itself must be a fun balancing act for the scheduler.
To clarify my concerns (I haven't dug in to the specific instruction set differences):
Assume the main core supports AVX2 and the smaller cores don't. Which core do you execute the code on? Which one will get you the best performance per watt? How do you account for that in the OS scheduler? What do you want to optimise for?
If your code is compiled for AVX2, it'll fail on the small cores unless it does continuous runtime checking (which is expensive, but given processes can migrate between cores, presumably necessary).
It's not so much the speeds that differ, but the micro-architecture. The big cores are typically out-of-order cores with a large amount of cache, to get as many instructions as possible per cycle. The SMALL cores are in-order cores with a small amount of cache. These have a lower IPC, but they also use much less energy per instruction compared to the big cores.
You do the same with fp. First floating point instruction traps and the process is then flagged as needing fp. This is then used in the context switch to also include fp instructions.
The advantage is you avoid storing fp registers unless you are going to use them.
That flag could easily determine what you can run where.
As mentioned in your link, that's ultimately against ARM's requirements for big.LITTLE and was due to buggy Samsung patches. A fixed kernel only exposes the subset of features available on all CPUs.
How does the scheduler know which process will use CPU instructions that the small cores don't support? Will it just try it out and then move the process to the higher support core if it detects an error?
>How does the scheduler know which process will use CPU instructions that the small cores don't support?
I don't see that as required? If you catch the illegal instruction signal that CPUs throw you can just run it on the other CPU since the instruction counter would not have incremented. There's a delay in the catch and retry on the other CPU but i don't see the big deal here?
I once ran into the same sort of issue with this very chip, in that the little A55 cores supported half-float compute, while the big Mongoose cores didn't. Seriously a pain.
Can you give an example? I've seen ARM chips with co-processors with different features (like no MMU for instance) but not a completely different instruction set. Besides in my experience in these configurations the OS doesn't really do much with the other core, it's generally left to specific applications to have code explicitly meant to run on it, there's no automatic scheduling.
Unless you think about embedded co-processors that are generally here to control a complex subsystem (like video processing or something like that, arguably GPUs fit in that description) but in general those aren't handled like real CPU cores at the OS level, they have dedicated drivers or userland libraries dedicated to a specific purpose.
Something tells me that if Intel wants this architecture to be popular they'll have to work on a tighter and more transparent integration, otherwise this is going to end up like the Cell.
It's a pretty interesting approach though, I'm genuinely curious to see how that's going to end up working.
I'm very familiar with some of these chips but my point still stands, in my experience it generally requires tailor-made software to really benefit from these architectures.
If these Intel CPUs are meant to be general purpose I wonder how that's going to work out.
With the heavy rumours of Apple moving forward with ARM chips in Macs, Intel clearly showed the plans for this chip to Apple last year, and Apple said no thank you.
While Intel showed this, Apple was currently selling iPhone 11s with processors that would clearly outpace this. The A13 has two large cores and four small cores, whereas either one of the large cores outperform the comparable 'Core' core in this, and I'm betting the smaller cores do as well, possibly at less power. Supposedly their ARM< Mac CPU is a 12 core architecture, I imagine with four large cores and 8 small cores. Thats a massive hardware difference...did I mention that the 12 core CPU is supposed to be build on TSMC's 5nm? Intel is trying to get current and honestly, maybe Apple will ship a Macbook Air with it.. but probably not.
> The A13 has two large cores and four small cores, whereas either one of the large cores outperform the comparable 'Core' core in this
The evidence of this is flimsy at best. It's possible the A13's big cores are faster than Intel's Sunny Cove, but also not very likely, and certainly not established fact. The only real evidence here is Geekbench, which is a highly questionable benchmark that also has extreme OS dependencies (compare results for the same CPU on Windows & Linux, for example).
When/if an Apple releases something running MacOS with their custom ARM cores then you'd finally get a good comparison to isolate the CPU's performance by itself.
This is not true we have a pretty good idea of where the A13 stands. Anandtech has run SpecInt and SpecFP on the A13 and shows it to be in the same range of performance as a Xeon Silver 4112 (2.6ghz base/3.0ghz turbo), which is a 14nm chip from 2017.
So not as fast as the latest processors on a per core basis, but still pretty damn fast.
Can't really make assumptions, but you would think with the higher thermal envelope Apple can boost clock speeds making them even more competitive.
There are lots of other test from SPEC to Javascript benchmarks ( Using same JS Engine ) that shows what A13 is capable of. And so far results are comparable in many cases, better in some cases, and slower in other cases.
We already knew what to expect from Intel Willow Cove, we will have to wait and see what A14 has in store for us.
A12's spec results are close to Intel's single-thread performance at comparable clock speeds in SPEC, yes, but that wasn't the claim. The claim was that A12 was faster and that is what doesn't really have much evidence to support. Particularly since at comparable clock speeds is great and all, but not if your competitor can just outclock you by a landslide. The A12 can almost certainly go higher than 2.5ghz when actively cooled. But can it hit 4ghz? That's a different story.
>Geekbench, which is a highly questionable benchmark that also has extreme OS dependencies (compare results for the same CPU on Windows & Linux, for example)
OS and compiler dependencies, really. On the other hand, comparing macOS and iOS results may well be more meaningful (they use the same compiler and much of the OS is shared, especially at layers likely to significantly affect performance).
Geekbench (v4 especially) is definitely flawed, but the results in this case don't seem out-of-line with what we see on specINT for example.
e.g., https://browser.geekbench.com/v4/cpu/compare/15541405?baseli... comparing shows an A12X roughly matching a Sunny Cove 1060NG7 on single-thread performance (and it mostly follows where I'd expect Intel's strengths to be, with a wider vector unit and specialized instructions); in multi-thread performance the A12X unsurprisingly wins, but it also has double the number cores (okay, four of them are little, but that's still providing extra capacity to get work done).
IMHO, Apple's phones were considered fantastic for a few reasons, in order: brand (perceived quality, luxury status, pricing, "Apple"), software lock-in (iMessage, iCloud), and privacy (also a branding thing since they still use Google for Safari search). The hardware and camera were usually matched or exceeded by top-tier Android manufacturers. Now, Apple-designed CPUs are leaving their competitors in the dust, and provide, perhaps for the first time in a long time, hardware superiority. I don't think consumers quite understand the actual performance gap yet. But once it gets out, it's probably going to be second or third in the list for why people buy iPhones.
I'm going to switch from a Pixel to an iPhone this year. The CPU is just clearly better, and seems to actually matter for taking photos and web browsing. (The Pixel still takes like 5 seconds to post-process a photo. iPhones do it instantly.)
I switched for the opposite reason. Phone hardware is plenty fast for anything I want to do, but Apple's really restrictive closed software ecosystem and lack of hardware variety drove me to Android last year after being on Apple since iPhone 3G.
I just wanted a touch ID, notcheless, fast and current phone, and Apple did not offer one until recently.
I am not going back, and I am also now seeing the same thing with their laptops. Crappy keyboards, non optional and useless touch bars, low performance per dollar and software that gets in your way. No thanks. Windows 10, WSL, Surface Book/Go/Alienware, wow. Just good, open, usable stuff.
I'd do the same if I wasn't concerned about Android privacy. My pi-hole is blocking literally 10x as much phoning home stuff on the Android device I've got vs iPhone. And that is with the iPhone having way more crap installed on it.
The Pixel 4 has renamed it the "Pixel Neural Core" and has given it even more machine learning tasks and offloading such as Google Assistant and face unlock.
What are your sources for the claim that the hardware superiority actually matters in real life? Every once in a while when I'm looking to buy a phone, I like to check out videos that do real life speed tests (mainly opening apps but other tasks as well) and iPhones have never been on top. Curious to see if that has changed recently.
You're asking for sources and then citing unnamed “real life speed test” videos without any way for people to see what they're measuring or how solid their methodology was?
The main area where normal people notice this is in web usage, where Mobile Safari has handily outpaced Android browsing for many years — see e.g. https://discuss.emberjs.com/t/why-was-ember-3x-5x-slower-on-... from 2014. How much that matters depends on how much a particular website is limited by single-core JavaScript performance — well-engineered sites probably don't have a huge impact but it's quite noticeable on anything which has a bloated SPA and the web has been moving in the latter direction for years.
Gaming is the other area where this is fairly noticeable but that varies both in where the bottlenecks are (CPU vs. GPU), how prominent the effect is, and the relative quality of the ports so it's harder to do a fair comparison.
Burden of proof is on the one who makes the claim. I'm just asking for some links and I explained why I doubted the claim.
I appreciate your effort with giving more background, although I think 2014 is a bit dated with Firefox Quantumn becoming more of a thing on Android. Can't remember if the Preview has it yet or not.
Just to add color to that - Apple doesn't want external control of their semiconductors. They want to do their own thing starting with the mobile SOCs, then the secure enclave, then the M2 chip, and the GPU, and now mainstream high power processor. What Intel offers is orthogonal.
What Intel offers they still have to compete with.
It always seems like a good idea to do your own thing when your supplier is stumbling. The problem then is you're on your own. If they find their footing you're in trouble -- either you've blown a huge pile of money developing something which you then don't use because the competition has something better, or you use it anyway and get to relive the final days of Sun Microsystems.
And the same thing happens if anybody can beat you. If Intel can't but AMD does, you lose. If AMD can't but Qualcomm does, you lose.
Worse, success precipitates failure. If you make an in-house processor which is only a couple of percent faster, that's boring. You spend a lot for a little. If you make one which is more than a couple of percent faster, that's war. Intel can't have that. Google can't have that. Samsung can't have that. Microsoft can't have that. Even if you're bigger than any of them, you're not bigger than all of them. So they double their R&D, combine their resources, whatever it takes, and soon you're Sun Microsystems again.
When was the last time that a commodity mobile phone soc was even remotely competitive with Apple's latest? 2012? That seems like rather strong evidence against your theory.
I see a lot of Android devices at the top of those lists.
Apple likes to cherry pick. For example, they require everything on iOS to use their browser engine, then they spend a lot of time optimizing their browser engine for their CPUs. But that's not superior hardware performance, it's software optimization, which they could do with whichever commodity CPU they chose as well.
Intel does the same thing. They spend resources optimizing popular software for their CPUs. Qualcomm not so much.
Agreed- all those competitors have been shown up by Apple for years. They've had to take it because they haven't been able to come up with a technical response- even those using the same foundries...
no, it's not purely about control. apple wants reliability and (predictable) advancement in it's supply chain. if intel had delivered on that by meeting it's own roadmap, apple probably wouldn't be looking to move away from them.
it's not apple simply being control freaks (although they are and can afford to be), it's intel shooting it's own foot.
Even when Intel started recycling 14nm why would Apple actually care? Some of their computers have gone years without updates despite newer processors being available, it doesn't seem like an important feature of most of their computers.
In 2008 Apple bought PA Semi to design processors for iOS, make processors for Macs is likely something they've been working on since way before Intel was late on 10nm.
I have only seen a sentence or two mentioned in Bloomberg, etc. But it’s widely known that Intel haven’t delivered a true generational upgrade since Skylake.
10nm Icelake in the new 13" MBP is looking like the first solid generational upgrade in recent memory, but yeah, one qualified success in the last 4ish years does not a roadmap make.
It's one of a few reasons. Macs have worse thermal characteristics than they should; part of that is Apple's fault for making such thin computers with poor ventilation, but part of that is Intel's fault for making such shitty hot chips. Of course there are other matters that can't be laid on Intel, like the keyboard fiasco, but that's beyond the point I think.
The display cables fraying "Flexgate"[1], the numerous battery recalls[2][3], that the iPhone and the MBP, as shipped, could not be connected to each other (the MBP came with USB-C to USB-C; the iPhone with USB-A to Lightning; the magic trackpads also suffered this)[4], the kernel panics with USB-C until they finally got patched, that the flagship monitor (admittedly an LG device, but advertised on Apple's website at the time) had issues with … Wi-Fi (just being near it, from improper shielding), that same display also had huge issues with disconnecting peripherals (you would connect to the display and get picture/power, but no peripherals), the crappy noisey RSI-inducing redesign of the already poor keyboard (admittedly subjective except for the objective pain in my wrist, but, e.g., [5]¹), the poor design of the device s.t. any breakage in a component results in the complete waste of the remaining good hardware as nothing is independently replaceable/servicable, routinely receiving a score of 1 out of 10 by iFixit, Apple fighting right to repair[6], … oh, and the keyboards breaking the entire laptop from a single grain of sand, the terrible suspend/resume times, … and I haven't upgraded to Catalina yet. I hear that's gone so well for so many.
Yeah, Intel's chips are not the biggest fish in the Apple fish fry.
I'm not even an Apple customer, I'm just forced to use them by my employer. (And it feels like most employers nowadays.)
From what I heard when I worked there, Lakefield was purpose-built for this new Microsoft platform, so it's possible that they would not have showed this to Apple until after some time.
*Edit: I wanted to add that I left about 2 years ago, and thought the project had been cancelled, along with so many other 10nm products.
Yes, BK was fired right around the time I gave my notice.
> Why left?
A lot of reasons that would sound like griping. I was recruited by Google for a similar job, but with a chance to build something from the ground up. That sounded like the adventure I was looking for, so I took the offer. I must say, it's been an adventure!
Thanks. I guess that does sums it up and align with lots of other similar sentiments I heard over the years.
Off Topic: I think we live in a world where we put too much emphasis on the positives and neglect any negativities. So we end up having any valid complains to be viewed as moaning or griping. I wish as a society there are better ways to handle these rather than simply ignoring it, which would leads to all sort of bad things happening as we seen in the world today.
When Kaby Lake-G (Intel CPU with RADEON iGPU) is released, I thought that it's for Apple devices but Apple doesn't released devices that use it. I'm curious what's purpose.
This has a lot of parallels to how Intel incorporated some of the best ideas from RISC processor designs in their x86 CPU’s to produce a better product under intense competiton in the 90’s - they're doing the same thing here with ARM processor designs ideas, like the “Big Little” ARM core designs for low power, translated to x86.
Why? I do not understand. It would make sense if the Atoms were performance per watt champions but https://www.cpubenchmark.net/power_performance.html#intel-cp... in this chart you basically only see various generations of ultra low (Y) and some low (U) Core chips, going back to Broadwell when the Y chips were introduced. Atoms make a very rare appearance. The top performing Sunny Cove (which itself has much poorer score than the latest Amber Lake Y but I digress) brings in 623 CPU Mark / Max TDP where the top performing Atom, a Pentium Silver N5000 features 455. You can argue on the finer details of this particular benchmark but we are talking of an absolutely brutal 40% difference. Core CPUs switch P states extraordinaly quick so if less performance is needed, they will consume less power just fine. What's the point...?
Cheat sheet to reading Intel CPU codes: if the number starts with 10 and contains a G it's an Ice Lake. Otherwise, if there is an Y somewhere, it's ultra low power. The last U letter means 15W, rarely 28W. Last letter T means a 35W desktop chip. E means embedded. Letter H means 45W mobile. All of these were Core chips, first letter N, J, Z means Atom. First digit (or two digits for 10) is generation which became an absolute mess past 7th gen, most important change is that 8-U is quad core where 7-U was dual core.
They really aren't comparable. For example, standby power for Lakefield appears to be 2.5mW. Idle is probably significantly lower as well. They're targeting different use cases.
I can't find any figures for the standby power for Ryzen 4000 series, but standby is basically off. The standby power consumption should be negligible in both cases.
It's also not obvious what that use case would be. You can put most laptops into standby and they'll run on battery like that for many weeks. What's the thing that needs more than that?
It would be interesting to compare idle power consumption, but for that we'd have to know what it actually is.
Also, don't most x86 CPU's support being powered off entirely during standby? Ie. flush the caches, write the registers back to RAM, set RAM to self-refresh, and then write some config registers in some power control IC to cut the power to the CPU.
Then it doesn't matter what the standby power consumption is - it simply matters how quickly it can get back into a working state from OFF.
However those are Zen+ fabricated on 14nm, so not really the best that AMD could do in this area. They could fight either with a slowed down Renoir (that could meet the TDP requirements while not requiring any development) or a specific processor based on Zen2.
The battery costs money too, so roll it into the denominator.
The other questions are perf/gram and perf/W (the latter matters more for temperature management if you’ve rolled batteries into the other two metrics)
They do say they’ve improved idle wattage by an order of magnitude. That’s good. (Imagine a laptop that stays on with the screen off for a week. Some idle cell phones can do that.)
It all sounds like a respin of the Cell Processor concept from the PS3 .. big standard CPU core surrounded by little DSP like cores that you have to farm out work to and manage and keep fed.
I'm unsure of how this is being positioned in terms of CPU "horsepower". Is this supposed to be something like the next gen version of their "U" chips? More powerful? Less powerful but more portable with lower watt/TDP?
Also the "stacking" approach is really interesting, I wonder how far that approach may go though-- heat dissipation seems like it would quickly become a problem with multiple layers.
Be nice if they could shift some of the CPU core towards https://en.wikipedia.org/wiki/Asynchronous_circuit design and that way be able to run those parts at a variable clock rate instead of one part for this frequency and another for a slightly lesser frequency.
For power efficiency for dynamic loads, an asynchronous circuit design would win over current solutions. However designing asynchronous circuits is a magnitude more complex than a synchronous one.
But for parts like AVX, extensions that will only be available upon the main core, that would yield dividends and with that. However, may find they offer such extensions as a separate chiplet/stack and negate the issue of does this core support it as they all could tap into that.
What i'd find interesting would be the actual design and instruction set under the hood, the x86 gets translated via microcode and do wonder how much of the underlying way things work has diverged from that original instruction set.
Async design is basically snake oil AFAIK that's only produced cores in the 10Ks of gates.
And there's still benefits to lower speed sections as there's physical differences to the transistors to emphasize power consumption of switching speed that'd still continue into async designs.
On top of that, part of what we're seeing is dark silicon and the specifics of Dennard scaling. You can't light up the whole chip and not melt the chip, so you're going to see mobile TDP chips where you turn half the chip on or off at a time either way.
Thank you for that, had somewhat overlooked the aspect of parts not used would in effect be shifting heatsinks and the whole big/little does in many ways give you whole easily controlled area's which for multi core designs can work well.
But biggest issue with any adoption of growth in asyn design is the tools and skills to do such work. Though it does prove hard to compare and most of the development in the async area been driven by the EM advantages for space based usage.
It's easier to copyright words you make up, is what I imagine the explanation is. Apparently they use face to face connections instead of vias with that which limits them to two layers but that's all that's needed here.
I mean if it can do minimalist 3D, with opengl ES, with good performance or the same performance of a playstation 2, it would still be interesting, since high end, bleeding edge GPU graphics are not always interesting for everybody (and developing on bleeding edge GPU seems like it requires a lot of work).
A dedicated GPU always made sense for high performance, but at some point, having a console or a gaming PC that can run games with a single big chip might not be a bad idea. I have been able to play wow classic on a laptop's i5 with integrated graphics, and it was just fine.
Although I'm not sure if that new CPU is just that kind of "hybrid" design.
Are the little cores’ instructions at least a complete subset of the big cores’? Are we going to have some ridiculous situations where the little cores are completely pegged but the OS can’t migrate their processes off to the big core?
Or do kernel programmers need to start chasing Intel and trap/software implement every single future AVX8192AESNISSSE instruction Intel jams into future instruction sets to provide Xeon market differentiation?