> single-threaded nonvectorized C wastes on the order of 97% of your computer's ...

augustt · on May 8, 2020

A good example of getting a ~3000x speedup from naive matrix multiplication in C here (slides 20 onward): https://ocw.mit.edu/courses/electrical-engineering-and-compu...

Includes a 9-level nested for loop, which is always great to see.

kragen · on May 9, 2020

Thank you very much for posting this!

Roughly that 3000× is 18× from multithreading, 3× from SIMD instructions, 15× from tuning access patterns for locality of reference, and 3× for turning on compiler optimization options. This is a really great slide deck!

I was assuming "single-threaded nonvectorized C" already had compiler optimization turned on and locality of reference taken into account. As the slide deck notes, you can get some vectorization out of your compiler — but usually it requires thinking like a FORTRAN programmer.

So I think in this case reasonable C code runs about 54× slower than Leiserson's final code. However, you could probably get a bigger speedup in this particular case with GPGPU. Other cases may be more difficult to get a GPU speedup, but get a bigger SIMD speedup. So I think my 97% is generally in the ballpark.

A big problem is that we can't apply this level of human effort to optimizing every subroutine. We need better languages.

mratsim · on May 10, 2020

That's why you have people working on Halide, Taichi, DaCe, Tiramisu.

- https://halide-lang.org/

- http://taichi.graphics/

- http://spcl.inf.ethz.ch/Research/DAPP/

- http://tiramisu-compiler.org/

This way you can have a researcher implementing the algorithm (say bilinear filtering) and a HPC expert who tunes it with parallelism, SIMD, tiling.

I wrote an overview of most DSL for high performance or image processing in this issue: https://github.com/mratsim/Arraymancer/issues/347#issuecomme...

kragen · on May 10, 2020

This is great! Which of these do you think could be extended to general-purpose programming without the HPC expert? Taichi and DAPP seem to be aimed at that goal, but you seem to be implying they don't reach it yet?

mratsim · on May 11, 2020

You can use them without the HPC expert, Halide for example has a good autotuner and has been used by Google and Adobe to create image filters for mobile devices.

kragen · on May 12, 2020

Thank you!

umvi · on May 8, 2020

I think it's referring to simd instructions [0].

I'm still kind of a newb myself but from what I understand these are special CPU instructions that allow you execute the same instruction in parallel against multiple data points. This allows you to eke out a lot more performance. It's how simdjson[1] is able to outperform all other C++ json parsers.

[0] https://en.wikipedia.org/wiki/SIMD

[1] https://github.com/simdjson/simdjson

kragen · on May 8, 2020

8 cores times 4 SIMD lanes is a 32× speedup; that's where "97%" comes from, as explained in the note I linked to.

It's pretty variable: some things we haven't figured out how to speed up with SIMD, sometimes we have a GPU, sometimes we can get 8 or 16 SIMD lanes out of SSE3 or AVX128 or 32 of them out of AVX256, sometimes you only have four cores, sometimes make -j is enough parallelism to win you back the 8× factor from the cores (though not SIMD and GPGPU). But I think 97% is a good ballpark estimate in general.

sagarm · on May 8, 2020

A realistic speedup to expect from vectorization is probably closer to 10x. Other overheads start to dominate once you've optimized your inner loops.

It's also not reasonable to expect to vectorize everything; for example a web server is unlikely to benefit from vectorization.

kragen · on May 8, 2020

Just to clarify, I was only estimating a speedup of 4× from vectorization, while the other 8× comes from multithreading.

Fifteen years ago we thought regular expression matching and compilers were unlikely to benefit from vectorization, but now we have Hyperscan and Co-dfns, so they did. So I think it's likely that we will figure out ways to do a wider variety of computations in vectorized ways, now that the rewards are so great.

cellularmitosis · on May 8, 2020

A great talk which gets you thinking in this mindset is "Data-Oriented Design and C++" by Mike Acton https://www.youtube.com/watch?v=rX0ItVEVjHc

As an example, if you are checking a boolean flag (1 bit) on an object, and it ends up being a cache miss (and x86_64 cache line size is 64 bytes), then your computer just went through all the expense of pulling in 512 bits from RAM yet it only used 1 of them. You are achieving 0.2% of the machine's possible throughput.

neurostimulant · on May 8, 2020

I imagine if you could make the most out of vector instruction set in your code (where they can operate on a vector of data at once instead of one by one), you'll get a huge performance boost for "free". GP seem to be working on a vm that let you do that (a lot of it was flying over my head though, need some coffee).