That sounds really slow.

stefan_ · on June 10, 2020

There are processor architectures (MIPS) that can't do unaligned memory access where Linux will catch the fault, emulate the memory access in kernel and return to the program. On every unaligned access.

At least this only has to be done once, and frankly these AVX instructions are slow initially anyway.

saagarjha · on June 10, 2020

Yeah, and that was suuuper slow and everyone told you to not do it for exactly that reason…

stefan_ · on June 10, 2020

It's absurdly slow on hot paths, yet entirely unnoticeable to the vast majority of software out there.

H8crilA · on June 10, 2020

Most performance anti-patterns in almost any program are not important at all, as most of the code is not on the hot path.

zokier · on June 10, 2020

As it happens, MIPS is also practically dead as an general purpose architecture...

sigotirandolas · on June 11, 2020

It also happens on 32-bit ARM, which is maybe on its way out, but still way more common.

loeg · on June 10, 2020

And MIPS is slow as a dog, in part as a result of software handlers like this (software handles TLB and virtual memory walks, too, IIRC).

gumby · on June 10, 2020

> At least this only has to be done once, and frankly these AVX instructions are slow initially anyway.

FWIW I believe only some of them, and those are just some (or all?) AVX-512 instructions. I think AVX2 is implemented on the main die and doesn't have to be powered up first.

celrod · on June 11, 2020

https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

AVX (256 bit) instructions also suffer a penalty. Both the 256 and 512 bit instructions resulted in a slowdown for 9 microseconds featuring a quarter the instructions per clock, but the 512 bit instructions resulted in an additional penalty of 11 microseconds without executing instructions.

The first penalty was associated with voltage, and the second with frequency. Heavier 256 bit instructions would probably have resulted in the frequency transaction as well.

ww520 · on June 10, 2020

It only needs to be done once at the process launch. Once the processor affinity is set, it will stick to the main CPU at full speed for the entirety of the process lifetime.

ahupp · on June 10, 2020

It only has to happen once.

wtallis · on June 10, 2020

Once per process. I doubt operating systems will start modifying executables and libraries on-disk to tag them as using relevant instruction set extensions.

And generally speaking, when an application first starts issuing SIMD instructions, that's probably not the a great time to be interrupting it even if it only needs to happen once.

gpderetta · on June 10, 2020

once per process is perfectly fine.

the issue is that once you have an active process using a feature only available on the larger cores, you can't shut off the larger cores to save power without paying a large latency to wake up that process.

loeg · on June 10, 2020

You could imagine it as an ELF note of some kind. It's not a direction I've love to go in, and catching illegal exception traps seems fast enough that the more complicated design doesn't seem worthwhile.

saagarjha · on June 10, 2020

Well, once when it's about to try to wring maximum performance out of the processor, and if you only do it once you're ensured that your smaller cores never get used…

chrisseaton · on June 10, 2020

A single trap? That's not slow.

wtallis · on June 10, 2020

A trap, followed by migrating the process to a different core, which may have been powered off and definitely has cold caches. Context switches don't get any worse than that.

monocasa · on June 10, 2020

If the core has been powered off, then that shows how rare it is.

mlyle · on June 10, 2020

Yes, so you pay one really bad context switch penalty per process.

saagarjha · on June 10, 2020

Oh, I thought the suggestion was to trap to the large core every time an unsupported instruction was executed and let the process drift back to the smaller one later. Which would be slow I would assume. If you never switched back, wouldn't any process using advanced vectorized instructions (like anything using a decent libc) be permanently pinned to the large core?

chrisseaton · on June 10, 2020

> Oh, I thought the suggestion was to trap to the large core every time an unsupported instruction was executed and let the process drift back to the smaller one later.

Yes that's the idea.

> Which would be slow I would assume.

How expensive do you think a trap is? It takes about the order of 10 billionths of a second.

> If you never switched back, wouldn't any process using advanced vectorized instructions (like anything using a decent libc) be permanently pinned to the large core?

I think you can switch back next time you schedule.

gpderetta · on June 10, 2020

> How expensive do you think a trap is? It takes about the order of 10 billionths of a second.

are you sure about that? I would expect at least a couple of orders of magnitude more just for the userspace->kernel transition.

edit: for what is worth, a syscall it takes 250ns on my (admittedly vintage) machine. That's using the lowlatency sysenter path. An interrupt is probably going to cost more.

Anyway the cost of scheduling on another core is going to dwarf that.

edit2: for reference, this was a Sandy Bridge turboing at 3.5 Ghz during the test. With spectre mitigations on (which is going to be a good chunk of that overhead).

saagarjha · on June 10, 2020

> I think you can switch back next time you schedule.

Ok, yes, then we're on the same page. I would still think that would be slow? You'd need a full transition-to-kernel and context switch before you could execute again, which AFAIK would take at least microseconds…unless you think there would be a faster path to resume execution?

chrisseaton · on June 10, 2020

> which AFAIK would take at least microseconds…

No that's around 30 ns on modern hardware I believe.