Hacker News new | past | comments | ask | show | jobs | submit login

I would rather conclude that automatic vectorizers are still less than ideal, despite SIMD instructions have been widely available in commodity processors for 25 years now.

The language is actually great with SIMD, you just have to do it yourself with intrinsics, or use libraries. BTW, here’s a library which implements 4-wide vectorized exponent functions for FP32 precision on top of SSE, AVX and NEON SIMD intrinsics (MIT license): https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...




Why does FORTRAN, which is older than C, has no trouble with auto-vectorization?

Mostly because of a very different memory model, of course. Arrays are first-class things, not hacks on top of pointers, and can't alias each other.


When I tested this before (some years ago), adding "restrict" to the C pointers resulted in the same behavior, so the aliasing default seem to be the key issue.


Your compiler very likely assumes strict aliasing at whatever optimization level you're already using, so restrict usually won't do much. GCC does at O2, keil (and clang?) at O1.


“strict aliasing” refers to disallowing pointers of different types to alias. In the context of a function that accepts two arrays of floats, for example, it doesn’t apply


That looks correct from playing around in godbolt.


But restrict says "I promise that these things don't overlap, on penalty of my program going to hell". Is that what's going on in Fortran?


yes, Fortran forbids referring to the same memory location by multiple names.


Nope. There’s lots of ways to alias storage in Fortran.


sorry, I meant in the case of function parameters, though I'm far from a fortran expert so there may be exceptions.


The article seems to conflate C and C++ in its very first sentence, though in reality these are very different languages. In C++ you can use std::array, and you can make your SIMD containers use restrict to make it more like Fortran.

Anyway auto-vectorization will never compete with intrinsics for performance, so this all seems rather silly. Try expressing pshub in auto-vectorizer. Or a bunch of optimal load/swizzles to out perform gather.


I believe it’s ecosystem. For example, Python is very high level language, not sure it even has a memory model. But it has libraries like NumPy which support all these vectorized exponents when processing long arrays of numbers.


Everything that's heavily vectorized in the Python ecosystem, including numpy achieves it using optimized backend code written in other languages - fortran in particular. Python is only a thin veneer over those backends. In fact, you're constantly reminded to offload the control flow as much as possible to those backends for the sake of performance, instead of doing things like looping in Python. If that's enough to consider Python to be good in vectorization, I can just link high performance fortran libraries with C, handle non-vector control flow from there and call it a day. I guarantee you that this arrangement will be far more performant than what Python can ever achieve. I have to strongly agree with the other commenter's observation that the memory model is the key to vector performance.

And of course Python has a memory model. While that model is not as well understood as C's model, it is the key to Python's success and popularity as a generic programming language and as a numeric/scientific programming language. Python's memory model unlike C's or Fortran's, isn't designed for high performance. It's designed for rich abstractions, high ergonomics and high interoperability with those performant languages. For most people, the processing time lost executing python code is an acceptable tradeoff for the highly expressive control that Python gives them over the scheduling of lower level operations.


NumPy has no Fortran code, for quite a long time now. SciPy has and it is being rewritten. What you mention is the ufunc machinery underneath which is all C. NumPy also has SIMD support (albeit limited to certain functions). BLAS is also C/Assembly but only LAPACK is F77 (which is too much code to be rewritten).

This does not mean Fortran is bad (obligatory disclaimer for Fortran fans).


Also, Fortran has progressed immensely since F77 days. (I actually wrote some F77 code back in the day, when it was already ancient.) C has also progressed quite a bit since K&R days, but, to my mind, not nearly as much.


right, the problem for scipy developers I believe is that not enough of them know fortran, whereas C knowledge is necessary to hack on CPython native extensions.


Oh, my God, I actually wrote some code in Fortran 2 back in 1969! It's certainly a much better language now!


I wouldn't say Python is good _at_ vectorization so much as good _with_ it. It's also good with AI, web systems, etc, as a result of its "play well with others" philosophy, which pragmatically accepted from the start that no one language could be best for all purposes.

One of the best improvements to Python's memory model was extending it to allow sensible access to blobs of memory containing regular structures such as vectors and matrices, making it relatively easy to interface with and control already-available, well-tried, highly-optimised algorithms.


> Python is only a thin veneer over those backends.

There is an interesting writeup by dvyukov that shows the potential cost of "thin veneers" :) https://github.com/dvyukov/perf-load


any python libraries that offer any sort of real performance, is written in C


As well they should be.


> Arrays are first-class things, not hacks on top of pointers, and can't alias each other.

"Hacks" feels like the wrong term to use here. Comparing managed objects surrounded by runtime to memory isn't a fair comparison.


There's no language-level difference between int[] and int* in C. Indexing is just pointer arithmetics. No support for (non-jagged) multidimensional arrays either.

A proper array would at least know its length. It does not require a lot of runtime to do, and most C code require libc anyway for basic stuff like malloc().


This isn't true. int[] decays into an int* but is a different type.

An array member of a struct will have its data allocated contiguously with other struct members (subject to compiler padding). An int* would point outside the struct. This is possible even with variable length arrays. Specifically, you can declare the final member of a struct as an int[] which makes it a "flexible array member". Then you have to malloc the struct to be the sizeof the struct + whatever size you want the array to be.

https://en.wikipedia.org/wiki/Flexible_array_member

This is used in struct sockaddr for generic APIs where the actual struct can be of different sizes.

https://lwn.net/Articles/997094/

This is rather pedantic but the memory model for arrays is important here. If you're iterating over multiple struct members that are arrays, the fact the arrays are stored contiguously in memory is going to matter to SIMD.


>Then you have to malloc the struct to be the sizeof the struct + whatever size you want the array to be.

Ultra pedantic, but you actually do not need to alloc struct+size; you want something like:

    // A flexible array member may have lower alignment requirements than the struct
    // overall - in that case, it can overlap with the trailing padding of the rest
    // of the struct, and a naive sizeof(base) + sizeof(flex) * count calculation
    // will not take into account that overlap, and allocate more than is required.
    #define SIZEOF_FLEX(type, member, count) MAX(sizeof(type), offsetof(type, member[count]))
I'm sure there's a more elegant way to calculate this, but I am no macro master.


The C language spec isn't what important here it's the ABI doesn't know the difference.


if using an int* over an int[] changes the memory layout of a struct, that necessarily implies a difference in the ABI.

As an example, C++'s std::array can be implemented as a struct containing a fixed-size C-style array. This can be passed-by-value. This means that returning a std::array can be more performant than returning a std::vector (which might be implemented as an int* that is reallocated when you add too many elements), because a std::array is returned on the stack while a std::vector is returned on the heap.

I was bit by this once when returning a std::unordered_map because I was doing a heap-allocation in a performance-critical section.


The memory layout is the ABI not the language spec.


The language spec for C constrains the memory layout of a struct. From 6.7.3.2.7 of the draft spec:

> a structure is a type consisting of a sequence of members, whose storage is allocated in an ordered sequence, (emphasis added)

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf


This is fair, thanks!


> There's no language-level difference between int[] and int* in C

sizeof(int[n]) vs sizeof(int*). Then there's: 'int getint(int *I){return *I;}' which wont let me pass '&int[n]' but I can 'int* = int[n]' then pass &int*. It's not so much a hack but a very primitive implementation on top of pointers (alright maybe it is a bit hacky.)

> A proper array would at least know its length. It does not require a lot of runtime to do, and most C code require libc anyway for basic stuff like malloc().

Something like a fat pointer 'typedef struct {long len, void *data} MyArray;' then make a libMyArray to operate on MyArray objects. But now you have a libMyArray dependency. You also loose type information unless you add fields to MyArray with an enum or have a MyArray for each type which gets messy. Then you cant do shit like just use an identifier in a conditional e.g. 'while(MyArray[n++])' unless you change how C works which makes it notC (uh, maybe pick another name.)

I feel that C is fine as it's an old machine oriented language. Leave it alone. I would look to reach for another tool when the primitive design of C makes life difficult.


Oh crap, markdown strikes again and ate the extra asterisk in getint(**I)


Fortran is not a managed language, and has a runtime as big a C.


I've had great success with autovectorization in Rust. Majority of non-branching loops seem to consistently generate great assembly.


None of that is ISO C though, and any language can have similar extensions, no need to be stuck with C for that.


isn't DirectXMath C++, not C?


Yes, it is in C++.


Yeah, but if you need C versions of low-level SIMD algorithms from there, copy-paste the implementation and it will probably compile as C code. That’s why I mentioned the MIT license: copy-pasting from GPL libraries may or may not work depending on the project, but the MIT license is copy-paste friendly.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: