In this space, it is more they you don't want to lose all of the progress made in these older libraries. Coming up with a new ecosystem is an immediate race to parity with the older. And a lot of smart people were involved with that. (Such that you aren't competing with a single idea or implementation, but an ecosystem of them.)
Hats off for making a good shot. But don't be surprised to see reluctance to move.
There's very promising work on upgrading older libraries/legacy code with a technique called verified lifting. The technique has been used successfully at Adobe to automatically lift image processing code written in C++ to use Halide. The technique also guarantees semantic equivalence so users can trust the lifted code.
Before reading the paper: is "verified lifting" kind of like "deterministic decompilation" — where you ensure, with every modification to the generated HLL source, that it continues to compile back into the original LL object code?
(See e.g. the Mario 64 decompilation — https://github.com/n64decomp/sm64 — , whose reverse-engineering process at all times kept a source tree that could build the original byte-identical ROMs of the games, despite being increasingly [manually] rewritten in an HLL.)
I'm not an expert. But with current CPUs, doesn't HPC critically depend on memory alignment? (For two reasons: Avoiding cache misses, and enabling SIMD.) Sure, algorithms matter, probably more than memory alignment. But when you've got the algorithm perfect and you still need more speed, you're probably going to need to be able to tune memory alignment.
C lets you control memory alignment, in a way that very few other languages do. Until the competitors in HPC catch up to that, C is going to continue to outrun those other languages.
(Now, for doing the tweaking to get the most out of the algorithm, do I want to do that in C? No. I want to do it in something that makes me care about fewer details while I experiment.)
Nothing C does is even remotely special anymore. If anything it's total crap by modern standards because it makes performant abstractions harder. C code uses a lot of linked lists because linked lists are easy.
C isn't the Lingua Franca of fast software anymore. The reason why C programs can sometimes be faster, today, however is not an intrinsic property of the language but rather its inability to allow incompetent programmers to hide their bad data structures.
C also encourages programming styles that are harder to optimize so some loop optimizations are no longer possible.
> C code uses a lot of linked lists because linked lists are easy.
Strong disagree. I don't see any reason why a competent programmer would hand-code a linked list, which is more of a hassle to make than a simple array (e.g. using buffer pointer + capacity and/or length field), unless the linked list makes actual sense for performance.
I'm not saying you're duty bound to do it but rather giving an example of something which in my experience is encouraged by C since it's the path of least friction.
Try writing a type that can automatically switch layout from AOS to SOA in C, you just can't do it.
Well picking an invalid example (C encourages the use of linked lists in situations where they aren't a good idea) and a very questionable example (automatic conversions between different SAO/AOS representations are not mostly at an experimental stage but are an important feature for performance programming in 2022) is not a good way to support a claim that isn't empirically evident.
The only language I know for sure to do it for you (as in you don't have to write the type) was Jai a while back (I'm told Blow removed that feature).
The only language I've actually done it in, is D. It's probably doable in many other nu-C languages these days, but D at very least can make it basically seamless as long as you do some try-and-break-shit testing to make sure nothing is relying on saving pointers when they shouldn't. This obviously constrains the definition of automatic ;)
I don't have my implementation to hand because it grew out of patch that failed due to aforementioned pointer-saving in code that I'm not paid enough to refactor (), but here's one someone else made https://github.com/nordlow/phobos-next/blob/master/src/nxt/s... there's another one in that repository too. I've never used those particular implementations but they're both by people I know so hopefully they're not too bad.
A more subtle thing, which I haven't used in anger, but would like to try at some point is to use programmer annotations (probably in the form of user defined attributes) to try and group things so things are stored such that temporal locality <=> spacial locality, but I've never bothered to actually do it.
There are some arrays of structs in an old bit of the D compiler that are roughly the size of a cacheline, and aren't accessed particularly uniformly. I profiled this and found that something like 75% of all LLC misses (hitting DRAM) were due to 2 particularly miserable lines... inside an O(n^2) algorithm.
> C isn't the Lingua Franca of fast software anymore.
What is it, then?
> The reason why C programs can sometimes be faster, today, however is not an intrinsic property of the language but rather its inability to allow incompetent programmers to hide their bad data structures.
I'm not sure I manage to parse that properly.
What's wrong about not being able to hide bad data structures? And how is this making C faster?
You're correct that low level details very much matter in the HPC space. The types of optimizations described in this paper are exactly that! To elucidate, check out Halide's cool visualizations/documentation (which this paper compares itself to) https://halide-lang.org/tutorials/tutorial_lesson_05_schedul...
Memory alignment is important and languages let you control that and/or make them align automatically given sensible assumptions does have an edge here.
Just a random example,
I wrote a two lines Python function using Numba jit,
because I know in my mind that some Python behavior is going to make it less performant
and it is a high performance kernel so we need that fast (and also lower memory footprint.)
Compiling using Numba jit is no brainer because I just add one more line there
with like 10 seconds effort.
But the code base I am merging to has a policy against Numba
(reasonably as we are targeting HPC platform
where Numba has some performance problem related to oversubscribing
if not setting up carefully.)
So I end up rewrote that in C++ and wrap it with pybind11.
And the result is that it is faster by around 30%.
Since the algorithm is entirely trivial,
the only explanation I have is exactly memory alignment,
where I can control that in C++,
but in Numba jit there's no way
(both to guarantee the array allocated is aligned,
or tell the compiler to assume that.)
(The 30% number is also reasonable
in textbook examples.)