> we need something that does more than just show aggregates, like a sampling profiler would.
This isn't really a limitation of sampling profilers. This is really a limitation of the front-end nearly every sampling profiler stick their data into. Basically they just aggregate sampling data and stop there.
For the gecko platform we use a sampling profiler but we're able to correlate samples with unresponsiveness events. For instance you might have only 0.3% of your samples running Garbage Collection or X but that 0.3% was all consecutive and was scheduled during a VSync event and prevented the app from displaying the frame on time. You want your UI to highlight things that impact latency even if in aggregate the function was fast, because it still managed to make you skip a frame on the screen.
Tracing Vs. Sampling is just a data collection method. The issue here is the UI and how displaying the data in a strictly aggregate/bottom-up view isn't enough.
Here's a demo of our open source cross-platform multi-process sampling low level profiler we use:
The ability to show something more than aggregates is, sure, a UI/front-end issue. (And an issue with intermediate storage, obviously).
But it's misleading to say that tracing-vs-sampling isn't a factor, unless you believe that the most-performant sampling and the most-performant tracing are roughly comparable. TFA claims otherwise; the claim is that the maximum resolution of tracing is much better than the maximum resolution of sampling.
That's nonsense. If anything the maximum resolution of sampling is higher, since it has lower overhead, so it perturbs the measured thing less.
However most sampling profilers only report CPU-time spent in a task, not wall-time. So, the things he discusses like time spent waiting for a task to complete would be obscured. However it would be perfectly possible to report the number of samples in which a thread was blocked in a method call (though it could not be said with certainty it was the same invocation, it would be enough to see that a majority of time blocked was elapsed there), it's just that they do not typically do so.
maximum resolution of sampling is higher, since it has lower overhead
The article claims otherwise.
This is apparently mostly related to the costs of preemption and resulting icache replacement.
though it could not be said with certainty it was the same invocation, it would be enough to see that a majority of time blocked was elapsed there
But that's not quite what's being discussed. The main interesting thing here is tracking what oddness causes outlier worst-case times, which very much does require tracking individual invocations.
And I'm calling that aspect of the article nonsense. The cost of preemption is incurred only infrequently, so even if it perturbs the point after it measures slightly, it only does so infrequently and the point at which it measures (assuming it has not been affected by the prior sample) more accurately represents a system without any instrumentation.
Assuming each invocation has a unique stack trace, each call site can still also effectively be tracked through sampling. Looking at his examples, this seems reasonably likely, as they all have quite different behaviours.
What tracing does do is permit a clear sequential analysis of an arbitrary granularity of macro behaviours.
If the macro behaviours are chosen with sufficiently low resolution that they are large enough for instrumentation to be an immeasurable overhead, then you obviously get a very clear and accurate picture of the system behaviour at that reduced resolution.
Tracing RPC calls and other similar behaviours as done in the blog are a good example, but it isn't down to increased resolution; quite the opposite.
The main problem the article is talking about is tracking 99th percentile issues (e.g. tail latency) which, given the nature of sampling is difficult to capture otherwise.
Overhead is an issue with instrumentation, but as the article mentions, there are a number of clever ways to minimize it. Some simpler examples are physical probing and dynamic instrumentation. FTrace might be a nice example of instrumentation on Linux worth looking at.
Basically, you define structs that you then fill in your code. Then give them to a ppt-generated library routine. It's a bit of a pain, but you get two nice features:
- The storage is completely non-blocking: you're just writing to shared memory.
- Overhead is linear to write rate: if you have a spare CPU core on the machine, the overhead on your active cores is a constant + the cost of filling & copying (once) the struct.
- It'll prefer to lose some data in case the disk falls behind in writing -- it won't slow down your app. Auto-generated sequence numbers tell you what/if any data was lost.
- You can leave the code on in production: the listener attaches via ptrace() and injects a destination buffer into your process (constant time), and if there's no buffer, the data gets harmlessly dropped.
I've left it alone as I haven't needed it for myself in a while, and there hasn't been interest (I don't really advertise it, though). If anyone's interested, just ping me or file an issue.
Sampling and instrumenting profilers both lie, in different ways. Sampling is incomplete; instrumenting has potentially strong observer effect. Neither are the ultimate or necessary future. They are both useful.
The most interesting bit about that, to me, was the diagram of a google search's RPC tree. That was awesome, and I imagine these days it's a more more so.
There's also a brief mention in http://joeduffyblog.com/2015/11/19/asynchronous-everything/ about making async/remote calls not break stack traces. That feels like it ought to be somewhat related to the distributed performance profiling discussed here.
sampling is useful to detect where to direct inquiry later, creating flame graph of the whole application is quite taxing and sampling can help restricting the research area
This isn't really a limitation of sampling profilers. This is really a limitation of the front-end nearly every sampling profiler stick their data into. Basically they just aggregate sampling data and stop there.
For the gecko platform we use a sampling profiler but we're able to correlate samples with unresponsiveness events. For instance you might have only 0.3% of your samples running Garbage Collection or X but that 0.3% was all consecutive and was scheduled during a VSync event and prevented the app from displaying the frame on time. You want your UI to highlight things that impact latency even if in aggregate the function was fast, because it still managed to make you skip a frame on the screen.
Tracing Vs. Sampling is just a data collection method. The issue here is the UI and how displaying the data in a strictly aggregate/bottom-up view isn't enough.
Here's a demo of our open source cross-platform multi-process sampling low level profiler we use:
http://people.mozilla.org/~bgirard/keyboard_timeline.mp4