I would rather conclude that automatic vectorizers are still less than ideal, despite SIMD instructions have been widely available in commodity processors for 25 years now.
The language is actually great with SIMD, you just have to do it yourself with intrinsics, or use libraries. BTW, here’s a library which implements 4-wide vectorized exponent functions for FP32 precision on top of SSE, AVX and NEON SIMD intrinsics (MIT license): https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...
When I tested this before (some years ago), adding "restrict" to the C pointers resulted in the same behavior, so the aliasing default seem to be the key issue.
Your compiler very likely assumes strict aliasing at whatever optimization level you're already using, so restrict usually won't do much. GCC does at O2, keil (and clang?) at O1.
“strict aliasing” refers to disallowing pointers of different types to alias. In the context of a function that accepts two arrays of floats, for example, it doesn’t apply
The article seems to conflate C and C++ in its very first sentence, though in reality these are very different languages. In C++ you can use std::array, and you can make your SIMD containers use restrict to make it more like Fortran.
Anyway auto-vectorization will never compete with intrinsics for performance, so this all seems rather silly. Try expressing pshub in auto-vectorizer. Or a bunch of optimal load/swizzles to out perform gather.
I believe it’s ecosystem. For example, Python is very high level language, not sure it even has a memory model. But it has libraries like NumPy which support all these vectorized exponents when processing long arrays of numbers.
Everything that's heavily vectorized in the Python ecosystem, including numpy achieves it using optimized backend code written in other languages - fortran in particular. Python is only a thin veneer over those backends. In fact, you're constantly reminded to offload the control flow as much as possible to those backends for the sake of performance, instead of doing things like looping in Python. If that's enough to consider Python to be good in vectorization, I can just link high performance fortran libraries with C, handle non-vector control flow from there and call it a day. I guarantee you that this arrangement will be far more performant than what Python can ever achieve. I have to strongly agree with the other commenter's observation that the memory model is the key to vector performance.
And of course Python has a memory model. While that model is not as well understood as C's model, it is the key to Python's success and popularity as a generic programming language and as a numeric/scientific programming language. Python's memory model unlike C's or Fortran's, isn't designed for high performance. It's designed for rich abstractions, high ergonomics and high interoperability with those performant languages. For most people, the processing time lost executing python code is an acceptable tradeoff for the highly expressive control that Python gives them over the scheduling of lower level operations.
NumPy has no Fortran code, for quite a long time now. SciPy has and it is being rewritten. What you mention is the ufunc machinery underneath which is all C. NumPy also has SIMD support (albeit limited to certain functions). BLAS is also C/Assembly but only LAPACK is F77 (which is too much code to be rewritten).
This does not mean Fortran is bad (obligatory disclaimer for Fortran fans).
Also, Fortran has progressed immensely since F77 days. (I actually wrote some F77 code back in the day, when it was already ancient.) C has also progressed quite a bit since K&R days, but, to my mind, not nearly as much.
right, the problem for scipy developers I believe is that not enough of them know fortran, whereas C knowledge is necessary to hack on CPython native extensions.
I wouldn't say Python is good _at_ vectorization so much as good _with_ it. It's also good with AI, web systems, etc, as a result of its "play well with others" philosophy, which pragmatically accepted from the start that no one language could be best for all purposes.
One of the best improvements to Python's memory model was extending it to allow sensible access to blobs of memory containing regular structures such as vectors and matrices, making it relatively easy to interface with and control already-available, well-tried, highly-optimised algorithms.
There's no language-level difference between int[] and int* in C. Indexing is just pointer arithmetics. No support for (non-jagged) multidimensional arrays either.
A proper array would at least know its length. It does not require a lot of runtime to do, and most C code require libc anyway for basic stuff like malloc().
This isn't true. int[] decays into an int* but is a different type.
An array member of a struct will have its data allocated contiguously with other struct members (subject to compiler padding). An int* would point outside the struct. This is possible even with variable length arrays. Specifically, you can declare the final member of a struct as an int[] which makes it a "flexible array member". Then you have to malloc the struct to be the sizeof the struct + whatever size you want the array to be.
This is rather pedantic but the memory model for arrays is important here. If you're iterating over multiple struct members that are arrays, the fact the arrays are stored contiguously in memory is going to matter to SIMD.
>Then you have to malloc the struct to be the sizeof the struct + whatever size you want the array to be.
Ultra pedantic, but you actually do not need to alloc struct+size; you want something like:
// A flexible array member may have lower alignment requirements than the struct
// overall - in that case, it can overlap with the trailing padding of the rest
// of the struct, and a naive sizeof(base) + sizeof(flex) * count calculation
// will not take into account that overlap, and allocate more than is required.
#define SIZEOF_FLEX(type, member, count) MAX(sizeof(type), offsetof(type, member[count]))
I'm sure there's a more elegant way to calculate this, but I am no macro master.
if using an int* over an int[] changes the memory layout of a struct, that necessarily implies a difference in the ABI.
As an example, C++'s std::array can be implemented as a struct containing a fixed-size C-style array. This can be passed-by-value. This means that returning a std::array can be more performant than returning a std::vector (which might be implemented as an int* that is reallocated when you add too many elements), because a std::array is returned on the stack while a std::vector is returned on the heap.
I was bit by this once when returning a std::unordered_map because I was doing a heap-allocation in a performance-critical section.
> There's no language-level difference between int[] and int* in C
sizeof(int[n]) vs sizeof(int*). Then there's: 'int getint(int *I){return *I;}' which wont let me pass '&int[n]' but I can 'int* = int[n]' then pass &int*. It's not so much a hack but a very primitive implementation on top of pointers (alright maybe it is a bit hacky.)
> A proper array would at least know its length. It does not require a lot of runtime to do, and most C code require libc anyway for basic stuff like malloc().
Something like a fat pointer 'typedef struct {long len, void *data} MyArray;' then make a libMyArray to operate on MyArray objects. But now you have a libMyArray dependency. You also loose type information unless you add fields to MyArray with an enum or have a MyArray for each type which gets messy. Then you cant do shit like just use an identifier in a conditional e.g. 'while(MyArray[n++])' unless you change how C works which makes it notC (uh, maybe pick another name.)
I feel that C is fine as it's an old machine oriented language. Leave it alone. I would look to reach for another tool when the primitive design of C makes life difficult.
Yeah, but if you need C versions of low-level SIMD algorithms from there, copy-paste the implementation and it will probably compile as C code. That’s why I mentioned the MIT license: copy-pasting from GPL libraries may or may not work depending on the project, but the MIT license is copy-paste friendly.
The language is actually great with SIMD, you just have to do it yourself with intrinsics, or use libraries. BTW, here’s a library which implements 4-wide vectorized exponent functions for FP32 precision on top of SSE, AVX and NEON SIMD intrinsics (MIT license): https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...