> Now the OS has to pin processes based on support for different instructions? T...

CountSessine · on June 10, 2020

by the OS by just advertising the common baseline.

What does that mean? Advertise to whom? The process/process loader? Does it mean that I can’t compile with -mavx2 anymore? What if I do?

The extensions are the whole problem.

kllrnohj · on June 10, 2020

> What does that mean? Advertise to whom?

Runtime detection is the process querying what extensions are available, and then selectively using those. You adjust what the query returns to only return the common set.

Runtime detection has been a pretty standard thing for well over a decade now - it's how we all manage to run the same compiled binaries over the years despite variability in SSE & AVX support. You don't download different versions of Chrome/Photoshop/Gimp/Premiere/Blender/Whatever compiled for different CPU micro-architectures, do you? You might if you run Gentoo I suppose, but that'd be about it.

> Does it mean that I can’t compile with -mavx2 anymore?

You already can't if you're shipping binaries to users unless you only support Skylake & newer? There's a lot of CPUs currently in use that don't support AVX2. So... you either already have this problem and you're familiar with it, or you're not doing this and it's moot.

CountSessine · on June 10, 2020

Runtime detection is the process querying what extensions are available, and then selectively using those. You adjust what the query returns to only return the common set.

Except that they almost always do this runtime detection once, on startup, and then choose/thunk codepaths accordingly. If the OS just happens to start my avx2 process on a little core (and how is it going to know better?), that's going to turn off all of my optimizations, regardless of where the process subsequently gets migrated to.

You already can't if you're shipping binaries to users unless you only support Skylake & newer? There's a lot of CPUs currently in use that don't support AVX2. So... you either already have this problem and you're familiar with it, or you're not doing this and it's moot.

Except nobody in 30 years of x86 dev expects to get a different answer from CPUID during runtime.

wmf · on June 10, 2020

If all the cores are configured to advertise the lowest common denominator instructions it will work.

CountSessine · on June 10, 2020

But that defeats the purpose of supporting any extensions at all in the big core that the little core doesn't support. Software will get the lowest common denominator answer and just not use avx2. So why support it in the first place? Why not just do the right thing and have uniform extension support like big.LITTLE?

Symmetry · on June 10, 2020

Currently Intel likes to disable features on some models of Core cores for product segmentation reasons. I'm pretty sure processors sold under the Pentium or Celeron brands already have enough features fused off that they're roughly comparable to the Atoms this is being pared with. You also fuse off things like sections of cache or whole cores that have manufacturing defects.

wmf · on June 10, 2020

Chips aren't designed from scratch. They're assembled out of previously designed components and in this case Core cores and Atom cores were never designed to work together.

CountSessine · on June 10, 2020

No - but that just means that Intel shouldn’t do this at all. Either don’t support stuff like avx and avx2 in the big core by disconnecting those blocks or support a slow microcode version of avx and avx2 in the little cores. Supporting different extensions for a CPU used with modern preempting OS’s doesn’t make any sense.

kllrnohj · on June 10, 2020

> Either don’t support stuff like avx and avx2 in the big core by disconnecting those blocks

That's partly what they did. From the article: "One thing we can confirm in advance – the Sunny Cove does not appear to be AVX-512 enabled."

Maybe they also fused off AVX & AVX2 support in the Sunny Cove core as well, we'll see.

And disabling AVX in cores that otherwise support it is already a common thing - see the Pentium & Celeron lineups that Intel currently sells. They don't have AVX/AVX2, even though the cores inside them definitely could offer it.

davrosthedalek · on June 10, 2020

.. and within a function because the scheduler decided to move you to a different core.

loeg · on June 10, 2020

The typical x86 extension query (cpuid) is an unpriviliged user-mode instruction — unless the application takes care to ask the OS, there's no real way for the OS to select the common denominator.

Edit: Also: AVX2 is a lot older than Skylake. You're probably thinking of AVX512.

kllrnohj · on June 10, 2020

The OS could still intercept CPUID - this is, after all, what VMs do.

But it looks like Intel is doing this anyway, as Sunny Cove in this application has had its AVX-512 removed anyway: "One thing we can confirm in advance – the Sunny Cove does not appear to be AVX-512 enabled."

loeg · on June 10, 2020

Running every thread in a hypervisor just to trap cpuid sounds like extreme overkill. And probably some applications want the features only available on the big core, and masking those out doesn't solve that end of the problem.

saagarjha · on June 10, 2020

Well then you need to set up a VMM for the OS…

close04 · on June 10, 2020

> You adjust what the query returns to only return the common set.

But this shoots the big core in the foot. You run at the lowest common denominator and the big core doesn't have the advantage of higher clocks. And this might just lead to a lot of software that only runs on the big core.

Imagine you wanted to fly around between multiple points but if you want the option to ever switch to a bus then your plane will circle around each airport until it's as slow as the bus. Or you can opt for "plane only".

kllrnohj · on June 10, 2020

It only shoots the big core in the foot for things that would make meaningful use of the extensions present on the big cores but not on the little ones. Which in a 7w application is what, exactly?

close04 · on June 10, 2020

Anything that uses any AVX or FMA3 off the top of my head.

A better mix would have been simply using lower clocked, lower powered cores of the same type or really close derivatives of the big core where the manufacturing process and clocks are what keep power low. Not a mix of Ice Lake and Atom. But right now Intel would throw everything at the wall to see what sticks.

And it seems like a good way for developers to make sure their software stays on the big core.

Dylan16807 · on June 10, 2020

The fact that they put the extensions in implies they expect meaningful use, doesn't it? If not on this specific part, on a future part with multiple big cores.

kllrnohj · on June 10, 2020

Meaningful use in a particular market != meaningful use in all markets.

AVX2/AVX-512 is great in HPC workloads, for example. But nobody is running an HPC workload on a 7w netbook, now are they?

What is useful on Xeon and what is useful on Atom are different. This is an Atom-class SoC used in Atom-class applications, not a Xeon-class one.

Dylan16807 · on June 10, 2020

That doesn't address what I said at all. This is about extensions that are in this Atom-class SoC.

saagarjha · on June 10, 2020

memcpy?

vardump · on June 10, 2020

>> Does it mean that I can’t compile with -mavx2 anymore?

> You already can't if you're shipping binaries to users unless you only support Skylake & newer?

Huh? My 6+ years old gaming PC supports AVX2, and it definitely doesn't have a "Skylake & newer" CPU!

AVX2 support started at Haswell, or Intel core 3rd generation. We're at gen 10 now.

While there certainly are still a lot of systems without AVX2 support, new games (and other performance hungry software) requiring it would not be completely unreasonable.

wtallis · on June 10, 2020

Intel is still launching new processors with AVX2 disabled. They use AVX2 support for product segmentation and disable it on low-end parts. So a Comet Lake Pentium Gold CPU launched in Q2 2020 doesn't support AVX2.

(I recall a recent news story about how unawareness of this among people writing or documenting compilers has started to cause problems.)

blattimwind · on June 10, 2020

Using IA for market segmentation purposes is such a classic from the "Bad Intel Ideas" basket...

vardump · on June 10, 2020

That's very surprising, disappointing and extremely short-sighted from Intel.

> (I recall a recent news story about how unawareness of this among people writing or documenting compilers has started to cause problems.)

You can't blame them!

KenoFischer · on June 10, 2020

The OS can just turn off any instruction extensions by setting the XCR0 register to whatever it wants. Unsupported instructions will then cause general protection faults (SIGSEGV on Linux). Whether that's what the OS wants to do is a totally separate question. It would be kinda funny if this thing lead to the ability of a process to request a certain XCR0 that the kernel would then serve, because I've been asking for that feature for unrelated reasons for quite some time and didn't get the warmest response to it on the mailing lists.

agumonkey · on June 10, 2020

wait for OpenMultiCore that will standardize heterogeneous cpu sets

cesarb · on June 10, 2020

This is not ARM; on Intel, the CPUID instruction can also be used by normal unprivileged programs. Not all programs look only at /proc/cpuinfo or the ELF auxiliary vector.

masklinn · on June 10, 2020

> The only complication here would be if they have differing extensions like AVX512

Which they do, Atom tops out at SSE4.2.

kllrnohj · on June 10, 2020

> Atom tops out at SSE4.2.

SSE itself tops out at 4.2, but Tremont does support the newer SHA extensions.

AVX appears to be the only missing thing, but AVX isn't even standard across Intel's other lines, either. The Pentium line doesn't support AVX either, for example, even though they are using Skylake & newer micro-architectures.

So you currently can't assume AVX support, and you still won't be able to assume AVX support. Why does this matter?

loeg · on June 10, 2020

> Why does this matter?

Operating systems move threads between cores. If those cores support different features, threads that are migrated to low-feature cores might experience illegal instruction traps despite correctly checking for instruction features.

twic · on June 10, 2020

Maybe standard application code all runs on the big core, and the little cores are like the system assist processors on IBM mainframes - used to run OS jobs or specific application support code, to free up the main processor for other work (or idling).

For this to work, there would need to be a small set of undemanding tasks that account for a lot of machine time. Feeding video to the GPU? All sorts of GUI compositing and housekeeping? Handling network connections in the browser?

I don't think this is a good explanation, but it's fun to think about.

pjscott · on June 10, 2020

A lot of things are background tasks until suddenly, without warning, the user ends up waiting for them to happen. Take your example of handling network connections, for example: this can definitely be a background thing! Your computer might want to keep an IMAP connection open, periodically poll a CalDAV server, sync photos with your phone, etc., and all of these would be very reasonable things to run on low-power CPU cores. Kick them over to a wimpy core, throttle down the frequency to its most power-efficient setting, insert big scheduler delays to coalesce timer wake-ups, whatever. Good stuff.

But what happens when the user opens up a photo viewer app and suddenly wants those photos to be synced right now?

If your code is running on a recent iPhone -- the heterogeneous-core platform I'm most familiar with -- then the answer is that the kernel will immediately detect the priority inversion when a foreground process does an IPC syscall, bump up the priority of the no-longer-background process, probably migrate it to the fastest core available, and make it run ASAP. Then, once the process no longer has foreground work to do, it can go back to more power-efficient scheduling.

This kind of pattern is super common, and it would be way more annoying and perilous to try to split tasks into always-foreground and always-background.

loeg · on June 10, 2020

The standard operating system model today is mostly to do what the application(s) request and then get out of the way quickly. I'd expect to cede all cores to applications most of the time, rather than reserving the low power cores for OS tasks.

Dylan16807 · on June 10, 2020

That would be pretty disappointing if they can't handle normal processes too. It's easy to end up with a bunch of processes that are using small amounts of CPU but aren't dedicated background tasks.

Even just looking at a browser, I might have half a dozen generic tab processes open, each using a small amount of CPU. But then I navigate one to a game, and I want that particular tab to get near-exclusive access to the big core while the others use only the little cores.

moonchild · on June 10, 2020

Audio is a big one. You usually want to peg a core, for latency, but it doesn't have to be a particularly fast core.

_ugfj · on June 10, 2020

They can easily disable the differing extensions in the BIG core.

NullPrefix · on June 10, 2020

Won't be BIG anymore if it's gimped, will it?

kllrnohj · on June 10, 2020

It will. The performance different between an Intel Atom and a Core i3 has basically nothing to do with differing instruction set extensions. AVX2 support is not why a Core i3 runs circles around an Atom, particularly since the overwhelming majority of instructions executed are not AVX anyway.

rbanffy · on June 11, 2020

I'm pretty sure my computer does a lot of AVX while it matches virtual backgrounds to my video feed, or while it tries to figure out voice commands.

All that can be done with SSE and probably with x87 instructions, but, still, software will try to pick the best option.

fluffy87 · on June 10, 2020

Binaries need to target a minimum subset, but can use run time feature detection to query differences.

If a binary does that, the OS can just not migrate that binary across cores.

loeg · on June 10, 2020

How would the OS know?

The only mechanism I can come up with is to detect illegal instruction traps on small cores, then flag the thread as big-core-only and re-start the execution at the bad instruction. That's not ideal but maybe workable.

(If it traps on the big core, too, it's just a bad instruction and SIGILL is raised to userspace like usual.)