I'd like to see some numbers comparing "backwards compatible" x86_64 performance with "bleeding edge" x86_64. That was something I had never considered, but it seems obvious in hindsight that you cannot use any modern instruction sets if you want to retain binary compatibility with all x86_64 systems.
.NET Core did quite a bit of work on introducing "hardware intrinsics". They updated their standard libraries to use those intrinsics and let the JIT optimize userland code. .NET Core performance is pretty good, if not the best [1].
I have myself seen 2-3x improvements in older .NET applications migrating to the .NET Core (called just .NET from version 5 onwards). I think JITed code have a unique advantage here that any hardware-specific optimizations are automatically applied by the JIT, with newer versions of JITs bringing even more performance improvements when the ISAs change.
I suspect the difference will not be very large, as compilers are fairly bad at automatic vectorization. Most of the places that could benefit reside in your libc anyways and those are certainly tuned to your processor model.
I dunno, the "Trickle-Down Performance" section on Cosmopolitan describing the hand-tuned memcpy optimization gives me the impression that there might still be quite a lot of missed optimization opportunities there due to pessimistic assumptions about register use:
I probably agree with Justine about instruction cache bloat for these functions, but I remain unconvinced that diverging from System-V is something worth its tradeoffs. The discussion for that would likely be lengthy and unrelated to this topic, as compiling with newer CPU features would likely make performance comparisons worse under her scheme as the vector registers are temporaries.
The Cosmopolitan Libc memcpy() implementation is still binary compatible with the System V Application Binary Interface (x86_64 Processor Supplement). It gains the performance by restricting itself to a subset of the ABI, which the optional memcpy() macro is able to take advantage of. For example, the macro trick isn't even defined in strict ANSI mode.
> I remain unconvinced that diverging from System-V
I'm afraid I'm not well-versed enough in the subject to follow the leap here. I though I was just linking to an example of what (to me) appears to be manually doing register coloring optimizations (or something close to it). How does that lead to "diverging from System-V"?
(I'm not asking you to dive into the discussion you alluded to, I'll take your word for it that it's a complicated one, just asking to clarify what the discussion even is)
I assume parent meant System V Application Binary Interface, Large AMD64 Architecture Processor Supplement, which defines the calling convention and as a part of it defines how registers are used.
So unless it is OK to break the ABI (i.e. you are going to call it only yourself, go wild), there won't be register coloring optimization over exported symbols.
From experience using `-march=native` is not a magical release brake and speed up switch. It works great for some projects and does almost nothing for others.
Including multiple code paths usually has very little overhead unless used in a tight loop or the branch predictors become confused by it.
That's mostly true. For something like TensorFlow -march=native can be like night and day. What I try to do with Cosmopolitan Libc, is CPUID dispatching, so it doesn't matter what the march flag is set to, for crucial code paths like CRC32, which go 51x faster if Westmere is available.
Your libc being tuned to your processor model seems extremely unlikely unless you’re compiling from source? I dunno, I only have one amd64 DVD of ubuntu
There is a Linux distro that builds packages with newer instruction sets and compiler flags for performance reasons. It's Clear Linux: https://clearlinux.org/