Making Chrome on Windows Faster with PGO

alfalfasprout · on Oct 31, 2016

I've been experimenting with profile-guided optimization and link-time optimization now on a variety of applications I've been developing.

A couple of things I've noticed:

1. PGO most benefits whenever you have branching logic in 'hot' code (either inside a tight loop or as part of a dispatch function that's repeatedly called). If set up correctly, the conditional can be reordered or the code structured so that you get very good branch predictions. This often means that code that wasn't automatically vectorized before can now be reorganized to employ SIMD instructions (usually this means a worse delay in case of a branch miss though).

2. Your dependent functions must be in the same compilation unit if you want to take advantage of many of the optimizations. Yes, interprocedural optimization (LTO) is a thing, but it's not perfect. If you have a loop that calls a function that can be inlined, the compiler does a great job with PGO ensuring that everything is hot in the instruction cache. If you put those functions in another compilation unit, not so much.

3. If you want to use PGO, you better use extremely representative inputs. The performance of a project compiled with PGO will suffer greatly if you use unrepresentative inputs.

codys · on Nov 1, 2016

I presume "the compiler" you're speaking of is some version of MSVC?

I only ask because while this article is about MSVC's pgo, other compilers also have pgo and lto, and may not have the same issues wrt optimizing across compilation units with lto enabled.

wscott · on Nov 1, 2016

GCC and, I would assume, Clang both support -fprofile-generate and -fprofile-use.

But these are based on a trace of actual execution, not a function use profile like implied by the article. But the article may be after a ELI5 filter.

alfalfasprout · on Nov 2, 2016

I haven't used GCC or Clang's PGO, but Intel's PGO lets you specify the kind of instrumentation used and generates a detailed profile. Clang also lets you use instrumented code-paths to generated detailed profiles (that include function use statistics).

alfalfasprout · on Nov 2, 2016

Actually, the Intel C++ compiler. I develop for Linux so it's still one of the best options there. All compilers will have more difficulty inlining across compilation units even with LTO than putting the relevant functions in a single unit.

ludamad · on Nov 1, 2016

What about forcing everything to be in the same compilation unit? Is this ever worth it?

alfalfasprout · on Nov 2, 2016

If you have a hot loop that calls a function repeatedly, you will often find the compiler more easily inlines everything if it's in one compilation unit. So yes, it can be worth it.

magicalist · on Oct 31, 2016

this twitter thread makes the post somewhat more interesting: https://twitter.com/BruceDawson0xB/status/793177917949739008

Specifically, (one reason) why this wasn't done years ago[1] and bugs found when switched to PGO[2]

[1] https://connect.microsoft.com/VisualStudio/feedback/details/...

[2] https://randomascii.wordpress.com/2016/03/24/compiler-bugs-f...

SaveTheRbtz · on Nov 1, 2016

The article is a bit shallow. It would be nice to see:

1. What flavours of PGO optimizations were applied? What was the isolated impact of each one of them on both speed and size of the code?

2. What tests did they use to "guide" PGO?

3. How did they analyze PGO results(except for these three tests that were provided)? I assume they did not blindly trust it, therefore there should be a way of visualizing differences between two binaries with millions of lines of code.

4. How did PGO affect crash statistics?

bitmapbrother · on Oct 31, 2016

Always a nice win when you can use tools to speed up your code. The only unfortunate part is that it's platform specific.

puzzle · on Nov 1, 2016

It's not like Google hasn't been doing the same under Linux for years:

http://research.google.com/pubs/pub45290.html

https://groups.google.com/forum/m/#!msg/llvm-dev/UOJqp0f9MBY...

codys · on Nov 1, 2016

gcc has had PGO and LTO for some time now, I know Firefox uses it. I'm not sure if Chrome does, but I'd be a bit surprised if they didn't

fowl2 · on Nov 2, 2016

In case anyone was wondering about other browsers - Firefox has been doing PGO, and dealing with bugs - for quite some time: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_g...

cm2187 · on Oct 31, 2016

Sort of unrelated but one more example of responsive design gone wrong. The chart overflows outside of the page on an iphone, and by locking the scaling, I can't zoom out to view it.

Another example of a page that would be more readable had it sticked to plain old html.

kevincox · on Oct 31, 2016

I wish my browser had an option to prevent the scaling lock. I should be able to decide if I can zoom in or out.

taspeotis · on Nov 1, 2016

FWIW: Mobile Safari 10 ignores user-scalable=no [1].

[1] http://stackoverflow.com/a/37859168/242520

kevincox · on Nov 1, 2016

I Just discovered that ff has an option for this as well :)

angry-hacker · on Nov 1, 2016

And can't open the image on mobile either. Blogger has been become cesspool of UX. What is this whole javascript mystery they are creating, it's a damn blog for christ and sake. Probably they will rewrite in web assembly too before finally ditching it.

polskibus · on Oct 31, 2016

Has PGO been applied to V8 as part of Chrome ?

mccr8 · on Oct 31, 2016

JS engines spend a lot of their time in JITted code, which isn't helped by PGO compilation of C++. (I mean, the JIT itself is of course a kind of PGO compilation...) I believe in Firefox PGO is, or at least used to, be disabled because it didn't help much, but would occasionally cause crashes due to compiler bugs.

MikeHolman · on Nov 1, 2016

Running JIT code is only one aspect of the runtime. Parser, interpreter (if applicable), and GC benefit. C++ helper functions that get called from JIT code also benefit.