Hacker News new | past | comments | ask | show | jobs | submit login

That sounds really slow.



There are processor architectures (MIPS) that can't do unaligned memory access where Linux will catch the fault, emulate the memory access in kernel and return to the program. On every unaligned access.

At least this only has to be done once, and frankly these AVX instructions are slow initially anyway.


Yeah, and that was suuuper slow and everyone told you to not do it for exactly that reason…


It's absurdly slow on hot paths, yet entirely unnoticeable to the vast majority of software out there.


Most performance anti-patterns in almost any program are not important at all, as most of the code is not on the hot path.


As it happens, MIPS is also practically dead as an general purpose architecture...


It also happens on 32-bit ARM, which is maybe on its way out, but still way more common.


And MIPS is slow as a dog, in part as a result of software handlers like this (software handles TLB and virtual memory walks, too, IIRC).


> At least this only has to be done once, and frankly these AVX instructions are slow initially anyway.

FWIW I believe only some of them, and those are just some (or all?) AVX-512 instructions. I think AVX2 is implemented on the main die and doesn't have to be powered up first.


https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

AVX (256 bit) instructions also suffer a penalty. Both the 256 and 512 bit instructions resulted in a slowdown for 9 microseconds featuring a quarter the instructions per clock, but the 512 bit instructions resulted in an additional penalty of 11 microseconds without executing instructions.

The first penalty was associated with voltage, and the second with frequency. Heavier 256 bit instructions would probably have resulted in the frequency transaction as well.


It only needs to be done once at the process launch. Once the processor affinity is set, it will stick to the main CPU at full speed for the entirety of the process lifetime.


It only has to happen once.


Once per process. I doubt operating systems will start modifying executables and libraries on-disk to tag them as using relevant instruction set extensions.

And generally speaking, when an application first starts issuing SIMD instructions, that's probably not the a great time to be interrupting it even if it only needs to happen once.


once per process is perfectly fine.

the issue is that once you have an active process using a feature only available on the larger cores, you can't shut off the larger cores to save power without paying a large latency to wake up that process.


You could imagine it as an ELF note of some kind. It's not a direction I've love to go in, and catching illegal exception traps seems fast enough that the more complicated design doesn't seem worthwhile.


Well, once when it's about to try to wring maximum performance out of the processor, and if you only do it once you're ensured that your smaller cores never get used…


A single trap? That's not slow.


A trap, followed by migrating the process to a different core, which may have been powered off and definitely has cold caches. Context switches don't get any worse than that.


If the core has been powered off, then that shows how rare it is.


Yes, so you pay one really bad context switch penalty per process.


Oh, I thought the suggestion was to trap to the large core every time an unsupported instruction was executed and let the process drift back to the smaller one later. Which would be slow I would assume. If you never switched back, wouldn't any process using advanced vectorized instructions (like anything using a decent libc) be permanently pinned to the large core?


> Oh, I thought the suggestion was to trap to the large core every time an unsupported instruction was executed and let the process drift back to the smaller one later.

Yes that's the idea.

> Which would be slow I would assume.

How expensive do you think a trap is? It takes about the order of 10 billionths of a second.

> If you never switched back, wouldn't any process using advanced vectorized instructions (like anything using a decent libc) be permanently pinned to the large core?

I think you can switch back next time you schedule.


> How expensive do you think a trap is? It takes about the order of 10 billionths of a second.

are you sure about that? I would expect at least a couple of orders of magnitude more just for the userspace->kernel transition.

edit: for what is worth, a syscall it takes 250ns on my (admittedly vintage) machine. That's using the lowlatency sysenter path. An interrupt is probably going to cost more.

Anyway the cost of scheduling on another core is going to dwarf that.

edit2: for reference, this was a Sandy Bridge turboing at 3.5 Ghz during the test. With spectre mitigations on (which is going to be a good chunk of that overhead).


> I think you can switch back next time you schedule.

Ok, yes, then we're on the same page. I would still think that would be slow? You'd need a full transition-to-kernel and context switch before you could execute again, which AFAIK would take at least microseconds…unless you think there would be a faster path to resume execution?


> which AFAIK would take at least microseconds…

No that's around 30 ns on modern hardware I believe.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: