Can you elaborate? I have a side project(1) where all profilers I've used give a...

namibj · on Dec 5, 2021

KDAB's hotspot is quite nice for analyzing perf recordings, and I suggest looking at stall cycle and "cycles with less than X uops dispatched" events to sample on. Yes, attributing to lines in code is hard for optimized compiler output, but it can (in the continuous release/`hotspot-git` AUR package) attribute to the disassembly of a function.

mrcode007 · on Dec 8, 2021

One way of approaching the problem is with the top down cycle accounting methodology

You can find a nice spreadsheet from Intel here:

https://download.01.org/perfmon/TMA_Metrics.xlsx

Also using Intel tools gives you much clearer answers. There are a bunch of scripts on top of perf that try to automate this. You can find them in pmu-tools git repo.

The basic workflow is always the same. Find out which part of the pipeline gets bottlenecked. Is it stuck in PCIE IO, memory, instruction decoding, etc. in my experience most of the time it is memory due to heavy pointer dereferencing and it is just a sad state of programming languages these days.

There are a bunch of memory benchmark tests that demonstrate these effects very well in the NUMA tools package source repository.

The problem with memory being the slow part is that it affects instruction fetch/decode cycle as well.

Poor instruction selection and sequencing can impact throughout because of port saturation (e.g. in case of AVX2) and leave other dispatch ports idle, premature store buffer flushes where you send 1-entry updates instead of sending stuff in chunks, leaving 70% of store buffer bandwidth idle, etc.

There are a lot of foot guns in a modern processor and usually it’s a mixture of these problems with one being dominant. A lot of times it is not possible to completely address the problem without rebuilding the software from scratch and properly leveraging hardware knowledge from the beginning when architecting your code.