The ability to show something more than aggregates is, sure, a UI/front-end issue. (And an issue with intermediate storage, obviously).
But it's misleading to say that tracing-vs-sampling isn't a factor, unless you believe that the most-performant sampling and the most-performant tracing are roughly comparable. TFA claims otherwise; the claim is that the maximum resolution of tracing is much better than the maximum resolution of sampling.
That's nonsense. If anything the maximum resolution of sampling is higher, since it has lower overhead, so it perturbs the measured thing less.
However most sampling profilers only report CPU-time spent in a task, not wall-time. So, the things he discusses like time spent waiting for a task to complete would be obscured. However it would be perfectly possible to report the number of samples in which a thread was blocked in a method call (though it could not be said with certainty it was the same invocation, it would be enough to see that a majority of time blocked was elapsed there), it's just that they do not typically do so.
maximum resolution of sampling is higher, since it has lower overhead
The article claims otherwise.
This is apparently mostly related to the costs of preemption and resulting icache replacement.
though it could not be said with certainty it was the same invocation, it would be enough to see that a majority of time blocked was elapsed there
But that's not quite what's being discussed. The main interesting thing here is tracking what oddness causes outlier worst-case times, which very much does require tracking individual invocations.
And I'm calling that aspect of the article nonsense. The cost of preemption is incurred only infrequently, so even if it perturbs the point after it measures slightly, it only does so infrequently and the point at which it measures (assuming it has not been affected by the prior sample) more accurately represents a system without any instrumentation.
Assuming each invocation has a unique stack trace, each call site can still also effectively be tracked through sampling. Looking at his examples, this seems reasonably likely, as they all have quite different behaviours.
What tracing does do is permit a clear sequential analysis of an arbitrary granularity of macro behaviours.
If the macro behaviours are chosen with sufficiently low resolution that they are large enough for instrumentation to be an immeasurable overhead, then you obviously get a very clear and accurate picture of the system behaviour at that reduced resolution.
Tracing RPC calls and other similar behaviours as done in the blog are a good example, but it isn't down to increased resolution; quite the opposite.
But it's misleading to say that tracing-vs-sampling isn't a factor, unless you believe that the most-performant sampling and the most-performant tracing are roughly comparable. TFA claims otherwise; the claim is that the maximum resolution of tracing is much better than the maximum resolution of sampling.