Hacker News new | past | comments | ask | show | jobs | submit login

Can you elaborate? I have a side project(1) where all profilers I've used give a very muddled picture, so I'm very interested in the question of what slows down code on a modern "big" CPU with wide dispatch, a few kB of decoded ops buffer and a lot of OoO hardware.

(1) It encodes and decodes a protocol from a potentially untrusted source, so there is obviously a lot of waiting for previous results. That much is clear, however I expected profilers to show me some causal link between serial nature and slow execution, but they don't. I have tried perf, Valgrind-Callgrind and AMD μProf (because I have a Ryzen CPU on both of my main private computers). I'm not sure if the tools suck, my test cases suck, or I just don't know how to interpret the tools' results - assignment of cost to lines of code seeming highly unreliable is my main problem. Maybe the stupid things (most of optimization is about not doing stupid things, after that it gets properly hard) that these profilers are designed to catch aren't the stupid or unavoidable things my code is doing.




KDAB's hotspot is quite nice for analyzing perf recordings, and I suggest looking at stall cycle and "cycles with less than X uops dispatched" events to sample on. Yes, attributing to lines in code is hard for optimized compiler output, but it can (in the continuous release/`hotspot-git` AUR package) attribute to the disassembly of a function.


One way of approaching the problem is with the top down cycle accounting methodology

You can find a nice spreadsheet from Intel here:

https://download.01.org/perfmon/TMA_Metrics.xlsx

Also using Intel tools gives you much clearer answers. There are a bunch of scripts on top of perf that try to automate this. You can find them in pmu-tools git repo.

The basic workflow is always the same. Find out which part of the pipeline gets bottlenecked. Is it stuck in PCIE IO, memory, instruction decoding, etc. in my experience most of the time it is memory due to heavy pointer dereferencing and it is just a sad state of programming languages these days.

There are a bunch of memory benchmark tests that demonstrate these effects very well in the NUMA tools package source repository.

The problem with memory being the slow part is that it affects instruction fetch/decode cycle as well.

Poor instruction selection and sequencing can impact throughout because of port saturation (e.g. in case of AVX2) and leave other dispatch ports idle, premature store buffer flushes where you send 1-entry updates instead of sending stuff in chunks, leaving 70% of store buffer bandwidth idle, etc.

There are a lot of foot guns in a modern processor and usually it’s a mixture of these problems with one being dominant. A lot of times it is not possible to completely address the problem without rebuilding the software from scratch and properly leveraging hardware knowledge from the beginning when architecting your code.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: