Hacker News new | past | comments | ask | show | jobs | submit login

Unfortunately I can't find the reference any more, but there was this great presentation by someone explaining how in computer systems developers will "optimise for the requirement", and ignore anything outside of it.

The example was latency. If the programmers were told to achieve less than 1 ms for 99% of the requests -- then sure enough -- the 1% of requests would have sky-high latencies of multiple seconds.

If told they needed to achieve 1 ms for 99% and 100 ms for 99.99%, then -- you guessed it -- the worst 0.01% would be tens of seconds or even minutes.

Inevitably, there would be a visible discontinuity in the latency histogram just above whatever the official business requirement was.

It's a difficult thing to fix, because no matter where you set your threshold, unless it's 100%, you won't meet it. And even 100% is just a fiction, because you'd need an infinite number of tests to achieve it.




> how in computer systems developers will "optimise for the requirement", and ignore anything outside of it.

It’s worth noting that this is a learned behavior which is not natural for many people.

There are plenty of people who will naturally think about a problem in context and want to solve for the problem and not just the metric.

Those people will be slower to satisfy management demands which will be metric driven, and in order to progress in their careers will learn to focus on the requirement and not the problem.


I interned at a company decades ago that was doing large dynamic system simulations and worked with an older guy who was known for knowing how to make the solvers work on impossible sysyems. One of his biggest tricks was replacing discontinuities with various tuned arctan so that transitions are smooth and integrators don't thrash about computing derivatives.


I don't know if it's the correct talk, but those ideas are explored in "How NOT to Measure Latency by Gil Tene: https://www.youtube.com/watch?v=lJ8ydIuPFeU


This talk forever changed my views on metrics.


Wow, holy shit, that talk did not disappoint. I will never look at page loads, response times, and percentiles the same way again. Paradigm shifting and a really rare case of a groundbreaking insight.


I think it may have been a different talk either by the same guy, or someone from the same company.


From practical experience, I find it a bit hard to believe this is actually how it works for latency. Were any specific examples provided?


Lots of examples.

Typically the issue was caused by garbage collection. You can twiddle with the parameters to meet your 99% latency goal, and then fail your 1% spectacularly.

Similarly, anything involving the network would slowly be optimised to meet the "typical case" requirements, while the extremes would be terrible.

If I remember correctly, this was a talk by someone working in a real-time trading firm, where latency was a critical metric for all of their systems designs. He had a lot of charts with very visible upticks in latency at the "nice round numbers" where the requirements were set.


I don't know the reasons for this, but when I hammer my Cloudflare Workers with a stress test, some requests take 10-100 times longer than the majority.


I can't quite remember it, but a few weeks ago I read an article about performance, and in the end they said something like "include the mean in your business requirement".

Nobody experiences the mean, but every outlier in your optimization will affect it.


I can totally imagine how that happens... The software isn't performing well enough, so some programmer says "I can rewrite the request queueing system to prioritize requests just before the 1ms deadline and deprioritize the ones where the deadline has already been missed". End result, a step...


Deep in the internals of Google, we actually had a system that was putting incoming requests not in a queue (first-in-first-out), but in a stack (last-in-first-out).

The system in question was essentially not meant to have any backlog of request under normal operations. But when overloaded, for this particular system it was better to serve the requests it could serve very fast, and just shed load on the overflow.

(I don't remember more specifics, and even if I could, I would probably not be allowed to give them..)


Reminded me of this old post by Facebook, well worth a read :)

https://engineering.fb.com/2014/11/14/production-engineering...


I really wish network routers would do this when delivering packets...

The other end gets far better information about congestion that way, and congestion control algorithms could be made much smarter.

Everyone says "jitter is bad in networks", but the reality is if the jitter is giving you network state information, it is a net positive - especially when you can use erasure coding schemes so the application need not see the jitter.


Currently you can assume that if you receive a packet with a higher sequence number, earlier packets were probably lost, and resend them. This heuristic won't work anymore with a LIFO buffer.

LIFO sounds annoying for bursty traffic, since you can only start processing the burst after the buffer has cleared.


Using erasure coding, there is no need to know which packets were lost. You just keep sending more erasure coded packets till you get an acknowledgement that all data has been decoded. If a packet from a while ago randomly reappears, it usually isn't wasted throughput either - it can be used to replace any other lost packet, either forwards or backwards in the packet stream (up to some limit determined by the applications end-to-end latency requirement).



I have exactly that pattern in an internal system. Under load, the caller will give up after a fixed timeout and retry, so why waste time on an old request that where the caller has probably already hung up?


LIFO is a perfectly reasonable mode, when overloaded. FIFO will give you a stable equilibrium where every request takes too long and fails. LIFO breaks out of that operating point.

This is the key feature of the CoDel solution to buffer bloat.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: