Could someone please explain the differences and advantages of tiptop over using a program like htop for analysing performance for every day operations (if any)?
I think they're not really in the same class. The article states,
> software developers identify critical regions of their applications and evaluate design choices to select the best performing implementation.
Which is what I'd use metrics like cache misses and branch prediction misses for: really fine-tuning some section of the code that needs to execute lightning fast.
htop, on the other hand, gives a more high-level overview. Like, "who's eating all the RAM?". (Or, perhaps more often, "who do I need to kill?")
Once you've identified that a process has a lot of branch mispredict (or other counters), how do you identify where those lots of branch mispredictions are caused?
Use a sampling profiler with support for hardware counters (vtune, instruments, zoom, etc). Configure to take a stack trace every Nth* event on the counter of interest. Run your program under the profiler. Look at the trace with the tree “inverted” or “bottom up” to see exactly which functions are incurring the counter events, which isn’t terribly useful, so look at the functions themselves to see a line-by-line or instruction-by-instruction breakdown of where the events occurred.
(*) What should N be? Depends on how frequently the counter is getting hit. N between 1000 and 1000000 is pretty typical. Choosing prime N is a good idea.
You can use {call,cache}grind for that, but sometimes it's a bit unpractical if the software is big.
An approach I sometimes use is to throw a generic profiler at the program, make the program do something that is not fast enough (and that would need to be optimizes), look at the profile to identify the function(s) that are too slow, extract them from the big code base, get a good set of input data and run that with {call,cache}grind.
Then you can use the awesome kcachegrind to look at the data (where you can look at different cache misses, branch misdirect, etc.).
Of course, most of the time, simply running in the profiler show a non-optimal algorithm, or terrible allocation patterns, so you don't have to do all that, but I found this approach useful when writing inner loops for numeric computations (and of course, extracting the code if rather easy for this kind of stuff).
That's cool. Right now I'm at the stage "I know how to profile, I can identify algorithmic problems and change them for a better ones, I can find allocation issues and use buffers/pools".
But lower level things like "I know which branch mispredict kills me" or "I know which access patterns results in page faults" are out of my reach. I mean, before asking the question and getting answers like yours.
It monitors hardware counters, so it is more similar to 'perf top' than 'htop'/'top', but it can also monitor multiple counters at once.
As a disadvantage it can't show the function-level counters, or assembly annotation that 'perf top' can.
So perhaps you could use 'tiptop' to get a general view of what might be slow, and then drill down using 'perf top'.
Slightly off-topic (maybe) but can this or another readily available tool be used to track counters during startup/shutdown and provide a log of it which can then be imported in some viewer? (think something like windows Performance Monitor/Event Tracer and the likes)