This is why I love HN. I'm actually working on a lightweight metrics system right now. This accurate little tool (and the author's article) is exactly what I needed right now.
My poison is vegeta, but one of the genius ideas behind both are HDR histograms, both as a design, as an implementation detail, and a data transport mechanism.
There are few production projects where I don't have a reason to depend on HDR histograms for various under the hood functionality.
I think there's times when using a max makes sense and times when using a percentile makes sense.
For measurements that are entirely within my own infrastructure, I'll always use the max. Outliers there are my responsibility and I want to fix them asap.
For measurements that originate from clients, I'll always use percentiles. There you're measuring the internet, which means people on patchy wifi connections or someone who's train just went through a tunnel. The max will always be spiky and there's nothing you can do about it.
You may (or may not!) be conflating summary statistics with signal filtering here. For latencies as experienced by interactive users, the tail of the distribution (very high percentiles including maximum) are the summary statistics that matter, period.
It sounds like in your case the data is a mixed distribution of both non-actionable garbage and actionable signals, and if you could filter out the actionable, you would benefit from looking at the tail of that distribution.
However, the best way you have of filtering out the garbage currently is to filter out the central portion of the mixed distribution. This comes with some false negatives and false positives, and may (or may not!) be the ideal filter in your case.
95% latency seems way more useful as a metric than max() would be. It’s better to spend most of your time improving the experience of most users rather than debugging ultra-rare issues that rarely affect users.
The effective latency users see is the maximum of all latencies observed during the session. If your session requires 10 observable requests 1 - 0.95^10 = 40% of users will experience >95%ile latency at some point. There are many considerations (I specifically said "observable" because there are tons of ways to hide latencies to users, and the number of observable requests can greatly vary) but it is never the case that the raw per-request latency equals to the user-visible latency.
Also worth mentioning the slightly fuzzier problem that the effects on user satisfaction from latency is not linear. A user that experiences four 125 ms page loads followed by one 500 ms page load is less happy than the user experiencing five 200 ms page loads.
So the user experience of your system is strongly driven by the tails of the latency distribution. Practically all of the useful information is beyond 95 %, not below it.
Upvoting because this is one of the most common misconceptions around performance I encounter.
You have received several good responses already, and I'll add just one more: the latency distribution of many software systems is in practise closer to a power law than a normal distribution. This means the "outliers" dominate almost all ways to look at the distribution, which means you will find the very high percentiles (including max) to be more informative of what going on (on a per-request basis) than the lower percentiles.
Another feature of power laws is that it's hard to estimate their shape based on the central values -- one of the more effective methods is to MLE fit a tail exponent to the tail (i.e. what's beyond the 95th percentile) and then extrapolate from that also to central values when necessary. This yields a less biased estimate of the mean, for example.
It depends how many requests (both frontend and backend) each user triggers and the odds of a big latency spike on at least one of them.
For single requests, targeting 95% would probably be just fine back in the days of static sites served by a single Apache instance, but that doesn't describe many modern systems. Nowadays, a single HTTP request to a load balancer can trigger a flurry of blocking queries to fulfil it. (eg a distributed ElasticSearch query that can't combine the results from each node and return until EVERY node has responded) then your worst-case performance will quickly begin to dominate.
But presumably those slow requests arent evenly distributed. say you have some server generated page + static css, js, images. One would presume the static resources are less likely to be in the 95th percentile (although i guess it depends on architecture a lot)
Aside from everything else already mentioned (which is true and accurate and discussed in the article), it’s important to also remember that tails skew your percentiles quite heavily. So fixing your max could easily drop your 95%-ile by a good margin on its own. So if that doesn’t improve the experience of most users, why do you choose to believe that the 95% number represents the experience of most people? In fact, isn’t it possible and likely your entire 95% number is a virtual artifact of your common case vs your max as opposed to it being its own mode with its own root cause?
In practice, what you really want to focus on is solving the modes of your system but identifying them can be quite hard (aside from the obvious one of max). %-iles are a useful tool but don’t mistake your tool for the truth or accidentally mistake the measure for the target.
Or you use microservices where a lot of requests are sent between the services. Which results in max being more and more likely for each user made request.
"99% of users experience ~99.995%’ile response times, "
Most of the time, you will experience adequate performance from our service. However, from time to time, you will also experience anomalies. We won't be fixing the sources of anomalous behavior because they're rare.
I don't remember where I saw it but there's also the idea of setting a threshold and asking what percentage of users exceeded it, e.g. how many users are seeing page load times > 1 second.
> Service time is how long it takes to do the work.
> Response time is the amount of time spent waiting before the work starts.
Is that right? I thought response time was the time from user input to completed response.
I'm not sure (I found parts of the article hard to parse), but I got the impression that author's generator won't generate a new request until the previous one is complete. That's not how I'd write a generator.
Also, it seems odd that the test-rig would discard responses that arrive outside of the test window. Surely it should record all responses to requests that were emitted during the window, even if they arrive outside the window?
The 95th percentile response time is a bad metric. A typical user executes more than a single request, so it is close to 100% probable that they get a response time above the 95th percentile. Better to tune the maximum response time instead.
I found this article hard to read; it felt like bulleted noted after watching a presentation and many of the statements lacked the context to back them up.
The bottom line is that measuring performance is hard, and that you have to be aware that your measurement code and your metrics system have a high likelihood of misleading you if you don’t deeply understand how they work.
As a side note, the author didn’t mention my favorite metrics pet peeve, that is the right hand cliff. Often systems will graph a 0 while waiting for a metrics bucket to fill, as when you are aggregating counts over time intervals. This can result in your metric appearing to drop off a cliff in the very recent past.
> I found this article hard to read; it felt like bulleted noted after watching a presentation and many of the statements lacked the context to back them up.
http://web.archive.org/web/20221227140440/http://highscalabi...