How Not to Measure Latency (2015)

neonate · on Dec 29, 2022

https://archive.ph/UiyIl

http://web.archive.org/web/20221227140440/http://highscalabi...

leetrout · on Dec 29, 2022

Needs (2015).

I loved the talks from Gil Tene.

I always reach for his fork of wrk whenever I need to test throughput:

https://github.com/giltene/wrk2

rootw0rm · on Dec 29, 2022

This is why I love HN. I'm actually working on a lightweight metrics system right now. This accurate little tool (and the author's article) is exactly what I needed right now.

kqr · on Dec 29, 2022

My poison is vegeta, but one of the genius ideas behind both are HDR histograms, both as a design, as an implementation detail, and a data transport mechanism.

There are few production projects where I don't have a reason to depend on HDR histograms for various under the hood functionality.

philbo · on Dec 29, 2022

I think there's times when using a max makes sense and times when using a percentile makes sense.

For measurements that are entirely within my own infrastructure, I'll always use the max. Outliers there are my responsibility and I want to fix them asap.

For measurements that originate from clients, I'll always use percentiles. There you're measuring the internet, which means people on patchy wifi connections or someone who's train just went through a tunnel. The max will always be spiky and there's nothing you can do about it.

kqr · on Dec 29, 2022

You may (or may not!) be conflating summary statistics with signal filtering here. For latencies as experienced by interactive users, the tail of the distribution (very high percentiles including maximum) are the summary statistics that matter, period.

It sounds like in your case the data is a mixed distribution of both non-actionable garbage and actionable signals, and if you could filter out the actionable, you would benefit from looking at the tail of that distribution.

However, the best way you have of filtering out the garbage currently is to filter out the central portion of the mixed distribution. This comes with some false negatives and false positives, and may (or may not!) be the ideal filter in your case.

It would be interesting to try to find out!

chis · on Dec 29, 2022

95% latency seems way more useful as a metric than max() would be. It’s better to spend most of your time improving the experience of most users rather than debugging ultra-rare issues that rarely affect users.

lifthrasiir · on Dec 29, 2022

The effective latency users see is the maximum of all latencies observed during the session. If your session requires 10 observable requests 1 - 0.95^10 = 40% of users will experience >95%ile latency at some point. There are many considerations (I specifically said "observable" because there are tons of ways to hide latencies to users, and the number of observable requests can greatly vary) but it is never the case that the raw per-request latency equals to the user-visible latency.

kqr · on Dec 29, 2022

Also worth mentioning the slightly fuzzier problem that the effects on user satisfaction from latency is not linear. A user that experiences four 125 ms page loads followed by one 500 ms page load is less happy than the user experiencing five 200 ms page loads.

So the user experience of your system is strongly driven by the tails of the latency distribution. Practically all of the useful information is beyond 95 %, not below it.

kqr · on Dec 29, 2022

Upvoting because this is one of the most common misconceptions around performance I encounter.

You have received several good responses already, and I'll add just one more: the latency distribution of many software systems is in practise closer to a power law than a normal distribution. This means the "outliers" dominate almost all ways to look at the distribution, which means you will find the very high percentiles (including max) to be more informative of what going on (on a per-request basis) than the lower percentiles.

Another feature of power laws is that it's hard to estimate their shape based on the central values -- one of the more effective methods is to MLE fit a tail exponent to the tail (i.e. what's beyond the 95th percentile) and then extrapolate from that also to central values when necessary. This yields a less biased estimate of the mean, for example.

Gh0stRAT · on Dec 29, 2022

It depends how many requests (both frontend and backend) each user triggers and the odds of a big latency spike on at least one of them.

For single requests, targeting 95% would probably be just fine back in the days of static sites served by a single Apache instance, but that doesn't describe many modern systems. Nowadays, a single HTTP request to a load balancer can trigger a flurry of blocking queries to fulfil it. (eg a distributed ElasticSearch query that can't combine the results from each node and return until EVERY node has responded) then your worst-case performance will quickly begin to dominate.

dikei · on Dec 29, 2022

95% is really not a lot, it's 5 slow requests out of 100. It takes ~300 requests to load the homepage of Amazon.

You need a lot more "9"

bawolff · on Dec 29, 2022

But presumably those slow requests arent evenly distributed. say you have some server generated page + static css, js, images. One would presume the static resources are less likely to be in the 95th percentile (although i guess it depends on architecture a lot)

vlovich123 · on Dec 29, 2022

Aside from everything else already mentioned (which is true and accurate and discussed in the article), it’s important to also remember that tails skew your percentiles quite heavily. So fixing your max could easily drop your 95%-ile by a good margin on its own. So if that doesn’t improve the experience of most users, why do you choose to believe that the 95% number represents the experience of most people? In fact, isn’t it possible and likely your entire 95% number is a virtual artifact of your common case vs your max as opposed to it being its own mode with its own root cause?

In practice, what you really want to focus on is solving the modes of your system but identifying them can be quite hard (aside from the obvious one of max). %-iles are a useful tool but don’t mistake your tool for the truth or accidentally mistake the measure for the target.

bitcharmer · on Dec 29, 2022

This heavily depends on the SLA, the use case or the industry.

Working on ultra low latency systems for algorithmic trading I can tell you we actually care and hunt down maxes more than the 95th percentile.

They tend to be much bigher outliers in HFT and the transactions you typically loose money on.

jcelerier · on Dec 29, 2022

Up to the day one of these ultra-rare issues falls on the journalist writing a review in a specialist magazine that will make or break your product

winrid · on Dec 29, 2022

Until max() is your largest, top paying customers.

Akronymus · on Dec 29, 2022

Or you use microservices where a lot of requests are sent between the services. Which results in max being more and more likely for each user made request.

https://youtu.be/_Zoa3xkzgFk

say_it_as_it_is · on Dec 29, 2022

"99% of users experience ~99.995%’ile response times, "

Most of the time, you will experience adequate performance from our service. However, from time to time, you will also experience anomalies. We won't be fixing the sources of anomalous behavior because they're rare.

andreareina · on Dec 29, 2022

I don't remember where I saw it but there's also the idea of setting a threshold and asking what percentage of users exceeded it, e.g. how many users are seeing page load times > 1 second.

throwghkgjn · on Dec 29, 2022

https://linuxczar.net/blog/2019/10/28/prometheus-histograms-... and maybe https://en.m.wikipedia.org/wiki/Apdex ?

denton-scratch · on Dec 29, 2022

> Service time is how long it takes to do the work.

> Response time is the amount of time spent waiting before the work starts.

Is that right? I thought response time was the time from user input to completed response.

I'm not sure (I found parts of the article hard to parse), but I got the impression that author's generator won't generate a new request until the previous one is complete. That's not how I'd write a generator.

Also, it seems odd that the test-rig would discard responses that arrive outside of the test window. Surely it should record all responses to requests that were emitted during the window, even if they arrive outside the window?

cassianoleal · on Dec 29, 2022

> Is that right? I thought response time was the time from user input to completed response.

You're right. In fact, that's how the Gil Tene defines it in the linked talk. It's essentially service time + wait time.

denton-scratch · on Dec 29, 2022

Plus network queueing time: oh, I guess that's rolled into wait time.

cassianoleal · on Dec 30, 2022

Yes, wait time from the point-of-view of the user, so everything that's not actual service time between time of request and fulfilled response.

dang · on Dec 29, 2022

Discussed at the time:

How Not to Measure Latency - https://news.ycombinator.com/item?id=10334335 - Oct 2015 (23 comments)

Maybe we'll pinch that non-baity title...

nequo · on Dec 29, 2022

tl;dr:

The 95th percentile response time is a bad metric. A typical user executes more than a single request, so it is close to 100% probable that they get a response time above the 95th percentile. Better to tune the maximum response time instead.

Edit: When comparing two systems, the Fisher–Tippett–Gnedenko theorem gives guidance about how to do large-sample statistical inference on estimates of maxima: https://en.wikipedia.org/wiki/Fisher%E2%80%93Tippett%E2%80%9...

efitz · on Dec 29, 2022

I found this article hard to read; it felt like bulleted noted after watching a presentation and many of the statements lacked the context to back them up.

The bottom line is that measuring performance is hard, and that you have to be aware that your measurement code and your metrics system have a high likelihood of misleading you if you don’t deeply understand how they work.

As a side note, the author didn’t mention my favorite metrics pet peeve, that is the right hand cliff. Often systems will graph a 0 while waiting for a metrics bucket to fill, as when you are aggregating counts over time intervals. This can result in your metric appearing to drop off a cliff in the very recent past.

dikei · on Dec 29, 2022

> I found this article hard to read; it felt like bulleted noted after watching a presentation and many of the statements lacked the context to back them up.

Because it is. Here's the presentation.

https://www.youtube.com/watch?v=lJ8ydIuPFeU