How Not to Measure Latency

jefftk · on Oct 5, 2015

    For example, in a user session with 5 page views that
    load 40 resources per page, how many users will not
    experience worse than 95%’ile of HTTP requests? The
    chances of not seeing it is ~.003%.

This is assuming complete independence, which is very wrong.

ma2rten · on Oct 6, 2015

This!

Especially the example of Google and Amazon is misleading. Most of those requests are assets which are loaded from a CDN, which is highly reliable compared to the application code.

chetanahuja · on Oct 6, 2015

Not to mention, accurately measuring independent download times on a web page (or a native app screen) with 40 assets is next to impossible because of the overlapping request, persistent connections and all sorts of other confounding factors.

The actual latency you should measure for such a download is the total time taken to get to a usable view for the user. How to do that is left as an exercise for the reader ;-)

Arnt · on Oct 6, 2015

You mean that the <95% requests will tend to go to the same few users?

If half of them go to a few unlucky users and half go to scattered users, the chance is of avoiding a <95% request is 0.6% for the lucky users and 0 for the unlucky ones.

This share is a bit unrealistic — who will go on for five whole page views if performance is that bad? But it doesn't matter, 99.4% or 99.997% both mean "practically all users experience slow requests".

sulam · on Oct 6, 2015

So, he is using an example of traditional media-heavy web sites in this specific case, but the general case still holds for many web applications such as:

- Google (you hit many search nodes to get the results for a single search, and Google goes to some effort to make sure you get back both accurate and fast responses)

- Facebook (there are many distinct services that have to be queried to render any FB page)

- YouTube (union of the above)

- etc

Now, I hope the engineers building these sites understand the points he is making, but they stand true either way.

ninjakeyboard · on Oct 6, 2015

For sure - we are concerned w/ the API times. I'm not sure I agree w/ this article.

latch · on Oct 6, 2015

I like how the article talks about the uselessness of percentiles, when a lot of people are still using averages!

Seriously, one of the best tool we use is to count thresholds and buckets. On a per-route basis, identify a threshold that you deem worrisome (say, 50ms), and then simply count the number of responses above that. Similarly, create buckets, 0-9ms, 10-19, 20-29... I say "best" because it's also very easy to implement (even against streaming data) and doesn't take a lot of memory. Not sure if there's a better way, but we sample our percentiles, which makes me uneasy (but keeps us to a fixed amount of memory)

lpage · on Oct 6, 2015

There's a highly tuned set of libraries available for computing HDR histograms with minimal overhead: https://github.com/HdrHistogram. This gets you a little closer to a continuous distribution, and does away with the branching needed for explicit buckets.

grandalf · on Oct 5, 2015

NewRelic graphs are another highly misleading example. They average the response time across all endpoints and make it very hard to understand the 95th percentile for a specific endpoint.

They also don't let you exclude static assets from the graphs, so the numbers are fairly unhelpful when trying to understand performance bottlenecks of a dynamic application.

Lightbody · on Oct 5, 2015

Disclaimer: I'm part of New Relic's Product Management team.

I think you might have a misunderstanding of how our stuff works. While we originally only captured aggregates, the last ~2 years or so we've been capturing and reporting on every transaction/request taking place.

As such, when you do a 95th percentile chart in our product, we're not "averaging the percentile" like many monitoring tools do. We are literally looking at every single record during that time.

We also allow filtering by transaction, which means you can check out just the percentile for "CheckoutController" or "AddToCartController" -- definitely not just the aggregate application.

And if you're a customer and want to verify this yourself, just pop over to New Relic Insights (insights.newrelic.com) and run a query like this to really see the power that comes from not pre-aggregating anything:

  SELECT count(*), percentile(duration, 50, 90, 87, 95)
  FROM Transaction
  SINCE 1 day ago
  FACET BY name

That will return all transactions in the last day, grouped by transaction name (ex: "CheckoutController"), along with their respective count and median, 90th, and 95th percentile. I threw in the 87th percentile just to show that you can do that too if that's your kind of thing.

grandalf · on Oct 6, 2015

> every transaction/request taking place.

It's definitely gotten better, however I think the default view is still just averaging all requests, which isn't very helpful.

For a blocking server (like most Rails apps use) the key insight has to do with which controller actions need to be optimized to prevent slow user-facing aggregate response times.

I don't currently have it in use on any of my apps, but next time I'll give that query a try.

Lightbody · on Oct 6, 2015

Yes, the first graph is for the entire app. But we (try to) make it easy to quickly drop down to a view that shows you the most time consuming controllers and then let you quickly see performance for just a single one.

Feel free to drop me an email at lightbody@newrelic.com if you decide to check it out again and have more feedback.

Take care!

grandalf · on Oct 6, 2015

I appreciate the response here, and appreciate the offer.

joshwa · on Oct 6, 2015

This is a problem with other APM tools, too... someone inevitably wants to know in one number how the system is responding, and the best you can do naively is to average all of them.

Defining critical transactions, and measuring them against their SLAs is really the only valid way of summarizing total application performance (x% of traffic is meeting SLA, y% is 1 stddev above, z% 2 stddev, etc).

bbrazil · on Oct 5, 2015

In Prometheus we solved this by having Histograms where we exported cumulative buckets of latencies, which can be aggregated and then the quantiles calculated from them.

http://prometheus.io/docs/practices/histograms/#quantiles

joshwa · on Oct 6, 2015

The most important thing to measure is the user-experienced response time. If your app isn't ready until some assets have loaded and your UI is actually displayed and usable, then you need to measure and instrument THAT. (domready is a poor but often usable proxy).

A good APM tool will measure this as well for ajax requests by measuring how long it takes from click to the end of the callbacks executed after receiving the response (i.e. displaying the refreshed content).

ninjakeyboard · on Oct 6, 2015

I don't know if I understand the point.

"More shocking: 99% of users experience ~99.995%’ile HTTP response times. You need to know this number from Akamai if you want to predict what 99% of your users will experience. You only know a tiny bit when you know the 99%’ile."

What does this mean? Can you explain this better? How is this a truth? I'm not convinced.

If I'm making requests for assets for a page and requests 100 assets, sure... But the worst asset doesn't dictate the user experience. We're primarily concerned with the API response time - not all of the individual static assets. Assets don't block so if one little image is slow I wouldn't take that as a breaker. We don't even measure entire page load times - just what the user needs to see to have some sort of interest especially stuff loading above the fold etc.

velox_io · on Oct 6, 2015

Interesting article, I dislike vanity metrics too. Seems most stats are designed tell the best story, rather than the truth or nitty-gritty.

Instead of looking at normalised stats, look at the worst offenders. Be it the URLs or the heaviest queries. Then find out what is using most of the resources and focus on those first.

wpeterson · on Oct 6, 2015

Averages can be useful, percentiles can be more useful.

However, the biggest asset in useful monitoring is focusing in on the right events and data. Is an average latency across all of your requests useful? Probably much less than averages per API or page.

divan · on Oct 5, 2015

One of the previous versions of this talk I keep in a bookmarks folder named 'Must see videos'.

dang · on Oct 5, 2015

This is a summary of https://www.youtube.com/watch?v=lJ8ydIuPFeU. It looks like a good summary, so I guess we won't replace the URL above, even though HN prefers original sources. We did change the title to that of the talk though.

vacri · on Oct 6, 2015

In addition to some of the other comments, another factoid of dubious understanding is "that 5 percent had to be so bad it pulled the other 95 percent of observations up into bad territory". No, that is demonstrably not the case, and it's even referenced earlier in the article that the sample graph only saw movement in the 90% and 95% lines.

It feels like the article has taken a philosophical position, then gone through a lot of confirmation bias to support it.

sulam · on Oct 6, 2015

He's taking a very pragmatic position that you will experience directly if you're ever responsible for the SLA of a "web scale" service.

His 5% / 95% observation is simply that you should not be focusing on the average, or even majority case if that's not what your users actually experience. One interpretation of the specific example you don't like is that more requests fell into a given range, thereby giving a distribution that the graph reflects. Another potential interpretation is just what he describes, where you have a set of especially poor performing requests at that point in time which effectively skew the distribution so that the 95th is "pulled up". His broader point is that you don't know based on this graph what you're dealing with, and he demonstrates a better visualization technique to determine the precise distribution of responses without having to look at every request.