I've been experimenting with profile-guided optimization and link-time optimization now on a variety of applications I've been developing.
A couple of things I've noticed:
1. PGO most benefits whenever you have branching logic in 'hot' code (either inside a tight loop or as part of a dispatch function that's repeatedly called). If set up correctly, the conditional can be reordered or the code structured so that you get very good branch predictions. This often means that code that wasn't automatically vectorized before can now be reorganized to employ SIMD instructions (usually this means a worse delay in case of a branch miss though).
2. Your dependent functions must be in the same compilation unit if you want to take advantage of many of the optimizations. Yes, interprocedural optimization (LTO) is a thing, but it's not perfect. If you have a loop that calls a function that can be inlined, the compiler does a great job with PGO ensuring that everything is hot in the instruction cache. If you put those functions in another compilation unit, not so much.
3. If you want to use PGO, you better use extremely representative inputs. The performance of a project compiled with PGO will suffer greatly if you use unrepresentative inputs.
I presume "the compiler" you're speaking of is some version of MSVC?
I only ask because while this article is about MSVC's pgo, other compilers also have pgo and lto, and may not have the same issues wrt optimizing across compilation units with lto enabled.
GCC and, I would assume, Clang both support -fprofile-generate and -fprofile-use.
But these are based on a trace of actual execution, not a function use profile like implied by the article. But the article may be after a ELI5 filter.
I haven't used GCC or Clang's PGO, but Intel's PGO lets you specify the kind of instrumentation used and generates a detailed profile. Clang also lets you use instrumented code-paths to generated detailed profiles (that include function use statistics).
Actually, the Intel C++ compiler. I develop for Linux so it's still one of the best options there. All compilers will have more difficulty inlining across compilation units even with LTO than putting the relevant functions in a single unit.
If you have a hot loop that calls a function repeatedly, you will often find the compiler more easily inlines everything if it's in one compilation unit. So yes, it can be worth it.
The article is a bit shallow. It would be nice to see:
1. What flavours of PGO optimizations were applied? What was the isolated impact of each one of them on both speed and size of the code?
2. What tests did they use to "guide" PGO?
3. How did they analyze PGO results(except for these three tests that were provided)? I assume they did not blindly trust it, therefore there should be a way of visualizing differences between two binaries with millions of lines of code.
Sort of unrelated but one more example of responsive design gone wrong. The chart overflows outside of the page on an iphone, and by locking the scaling, I can't zoom out to view it.
Another example of a page that would be more readable had it sticked to plain old html.
And can't open the image on mobile either. Blogger has been become cesspool of UX. What is this whole javascript mystery they are creating, it's a damn blog for christ and sake.
Probably they will rewrite in web assembly too before finally ditching it.
JS engines spend a lot of their time in JITted code, which isn't helped by PGO compilation of C++. (I mean, the JIT itself is of course a kind of PGO compilation...) I believe in Firefox PGO is, or at least used to, be disabled because it didn't help much, but would occasionally cause crashes due to compiler bugs.
Running JIT code is only one aspect of the runtime. Parser, interpreter (if applicable), and GC benefit. C++ helper functions that get called from JIT code also benefit.
A couple of things I've noticed:
1. PGO most benefits whenever you have branching logic in 'hot' code (either inside a tight loop or as part of a dispatch function that's repeatedly called). If set up correctly, the conditional can be reordered or the code structured so that you get very good branch predictions. This often means that code that wasn't automatically vectorized before can now be reorganized to employ SIMD instructions (usually this means a worse delay in case of a branch miss though).
2. Your dependent functions must be in the same compilation unit if you want to take advantage of many of the optimizations. Yes, interprocedural optimization (LTO) is a thing, but it's not perfect. If you have a loop that calls a function that can be inlined, the compiler does a great job with PGO ensuring that everything is hot in the instruction cache. If you put those functions in another compilation unit, not so much.
3. If you want to use PGO, you better use extremely representative inputs. The performance of a project compiled with PGO will suffer greatly if you use unrepresentative inputs.