This sounds interesting, but the documentation is really sparse, basically only how to turn it on. Am I supposed to be thinking of this a bit like Intel’s Parallel Studio’s Advisor tools for analysing threading and vectorisation? How do I tell what it’s doing with e.g. adding OpenMP calls since it seems to insert them between parsing and object code without oversight?
Basically, how easy is it, and how do I approach using this, as a non domain-expert?
The best resources on understanding Polly are probably various LLVM developer conference talks (it’s come up at nearly every one since at least 2010 or so).
But just like not needing to worry too much about how GCC does this unless you’re a compiler engineer, I’d say just either let it try or not.
My understanding is that this is only about memory access patterns, and has nothing to do with multithreading. The basic example being matrix multiplications C=A∙B. If the matrices are too big to fit in the L2/L3 cache, you have to think about how to optimize memory access. So instead of having the compiler just generate instructions to read/write memory, you want it to interleave instructions to prefetch memory, ahead of time. But how do you do that? What is the best point in time to prefetch?
Another optimization is locality: The basic algorithm is to loop over i and j, then multiply A's row i with B's column j. You do that with another loop over k to compute the sum over Aᵢₖ∙Bₖⱼ, then store the result in Cᵢⱼ. But what if one row or column is already too big for the cache? Also, once you invested into loading the data into the cache, you want to make the best use of it. You don't want to load it again if you can avoid it. So what you do is you limit the loop ranges for i, j, and k such that the overall memory accessed fits into the cache. But what are the optimal loop sizes?
The answers depends on the CPU architecture (cache sizes) and probably also the memory that you're using.
I don't know about this specific example, but polyhedral optimization generally, at least, is useful for parallelization of such loops, e.g. http://pluto-compiler.sourceforge.net/. However, serial GEMM is typically most important in HPC, with parallelization at a higher level.
(GEMM shouldn't be main memory limited, if that's what you mean.)
I'd intended to submit this after seeing the results recently while referring to it apropos BLAS implementation. I've quoted matmul here as the classic example of compiler optimization failures and won't again, at least until I've understood these results. I couldn't see how to reproduce the results and it's not clear what the various matrix sizes are, related to the dips in MKL performance that I don't understand where polly wins. Also, it's not clear what the POWER results are, as BLIS doesn't support it other than via the very slow generic configuration; I haven't got round to enquiring.
This sounds very good, just what's been missing from compiler optimizations. Anyone know if this is alive, or has it been merged to LLVM? The news seem to end a year ago.
When I first looked at this, there was a comment stating that (from memory) despite ~10 years of research and numerous PhDs, papers and conference talks, polly just doesn't work for production code. That comment is now gone.
Was it retracted because it was inaccurate? Seems to me like an important part of the conversation, whether true or not.
Yes, but likewise for the classical loop optimizations that this is aiming to improve on. They clearly do work -- to a lesser extent? -- for production code, just not the sort that many commentators give the impression of thinking is all that matters. Linear algebra does crop up all over the place.
(In contrast, I would say gcc's -floop-nest-optimize "experimental" implementation doesn't work on matmul, at least.)
True.
Almost everyone who does matrix/vector operations in production code uses libraries though. That is why it is tricky to see results from these kinds of projects on production code.
I looked up the Polly reference concerning discussion of BLIS missing optimized implementations, and GCC and clang apparently failing on BLIS-type loops. (It mentioned examples of people not using the relevant libraries in national labs, not that I'd defend that.) It might also have been other types of kernel that don't exist for ARM or POWER, say. Surely it would be good to have an effective general-purpose tool (compiler) that could replace labourious hand-optimization or building special-purpose tools.
Basically, how easy is it, and how do I approach using this, as a non domain-expert?