Golang proudly serves as the archetype of a particular kind of language design h...

Golang proudly serves as the archetype of a particular kind of language design hubris which seems to be spreading like a disease lately, that although feature X was required for my implementation you (shouldn't need it / can't be trusted with it). It's famously the language which aims to be "90% perfect 100% of the time". In practice, this should be interpreted as being measured on a logarithmic scale for new algorithms or novel implementations using newer/obscure hardware features.

The standard distribution of Golang has been historically non-receptive to the notion of providing intrinsics, the capability for inlining assembly function calls (note: not inlining assembly source), and in general proposals to trivially extend the language or standard library to address performance concerns whose need would not be considered controversial in any other context (I'm not talking about generics, here). In quite a few cases it has been communicated that certain features will never be made available because fuck you that's why, although in recent releases there has been a trend of backtracking on these "promises" under the onslaught of justifiable need.

These conscious "trade-offs" severely impact the specific cases where dedicated hardware instructions exist that are capable of providing sometimes orders of magnitude improvement but are not yet encapsulated by builtins or blessed with snowflake exceptions in the standard library.

Your options for implementing the kinds of compute-bound tight loops which benefit from specialized instructions in Golang are: 1) Write the whole thing in Golang assembly so you only pay the price of a function call once on entry (Let's call this 0.4ns for a 2.5GHz CPU using a single cycle latency instruction per operation) 2) Use hilarious bit twiddling hacks that the compiler can inline which perform the equivalent operation with 10-20 cycle latency (4-8ns) 3) Use a Golang assembly library which presents a single invocation of the hardware instruction as a Golang function and have all of the gains absorbed by the function call overhead (~10ns) 4) Use the primitives available via the standard library to implement the function (as a general rule not worth wasting the time to optimize and comparatively benchmark for this class of function) 5) Fork golang, implement what you need, offer it as a pull request, and have upstream tell you to just go back to C or assembly if you care about performance (Literally.) 6) Target gccgo/(llgo?) (much better for this specific class of function but less performant in other areas, a different set of trade-offs).

In short: optimization trade-offs.