Unfortunately I can't find the reference any more, but there was this great pres...

zepto · on Sept 10, 2021

> how in computer systems developers will "optimise for the requirement", and ignore anything outside of it.

It’s worth noting that this is a learned behavior which is not natural for many people.

There are plenty of people who will naturally think about a problem in context and want to solve for the problem and not just the metric.

Those people will be slower to satisfy management demands which will be metric driven, and in order to progress in their careers will learn to focus on the requirement and not the problem.

fluidcruft · on Sept 10, 2021

I interned at a company decades ago that was doing large dynamic system simulations and worked with an older guy who was known for knowing how to make the solvers work on impossible sysyems. One of his biggest tricks was replacing discontinuities with various tuned arctan so that transitions are smooth and integrators don't thrash about computing derivatives.

kryptiskt · on Sept 10, 2021

I don't know if it's the correct talk, but those ideas are explored in "How NOT to Measure Latency by Gil Tene: https://www.youtube.com/watch?v=lJ8ydIuPFeU

hermanradtke · on Sept 10, 2021

This talk forever changed my views on metrics.

sheikheddy · on Sept 11, 2021

Wow, holy shit, that talk did not disappoint. I will never look at page loads, response times, and percentiles the same way again. Paradigm shifting and a really rare case of a groundbreaking insight.

jiggawatts · on Sept 10, 2021

I think it may have been a different talk either by the same guy, or someone from the same company.

Jweb_Guru · on Sept 10, 2021

From practical experience, I find it a bit hard to believe this is actually how it works for latency. Were any specific examples provided?

jiggawatts · on Sept 10, 2021

Lots of examples.

Typically the issue was caused by garbage collection. You can twiddle with the parameters to meet your 99% latency goal, and then fail your 1% spectacularly.

Similarly, anything involving the network would slowly be optimised to meet the "typical case" requirements, while the extremes would be terrible.

If I remember correctly, this was a talk by someone working in a real-time trading firm, where latency was a critical metric for all of their systems designs. He had a lot of charts with very visible upticks in latency at the "nice round numbers" where the requirements were set.

k__ · on Sept 10, 2021

I don't know the reasons for this, but when I hammer my Cloudflare Workers with a stress test, some requests take 10-100 times longer than the majority.

k__ · on Sept 10, 2021

I can't quite remember it, but a few weeks ago I read an article about performance, and in the end they said something like "include the mean in your business requirement".

Nobody experiences the mean, but every outlier in your optimization will affect it.

londons_explore · on Sept 10, 2021

I can totally imagine how that happens... The software isn't performing well enough, so some programmer says "I can rewrite the request queueing system to prioritize requests just before the 1ms deadline and deprioritize the ones where the deadline has already been missed". End result, a step...

eru · on Sept 10, 2021

Deep in the internals of Google, we actually had a system that was putting incoming requests not in a queue (first-in-first-out), but in a stack (last-in-first-out).

The system in question was essentially not meant to have any backlog of request under normal operations. But when overloaded, for this particular system it was better to serve the requests it could serve very fast, and just shed load on the overflow.

(I don't remember more specifics, and even if I could, I would probably not be allowed to give them..)

willvarfar · on Sept 10, 2021

Reminded me of this old post by Facebook, well worth a read :)

https://engineering.fb.com/2014/11/14/production-engineering...

londons_explore · on Sept 10, 2021

I really wish network routers would do this when delivering packets...

The other end gets far better information about congestion that way, and congestion control algorithms could be made much smarter.

Everyone says "jitter is bad in networks", but the reality is if the jitter is giving you network state information, it is a net positive - especially when you can use erasure coding schemes so the application need not see the jitter.

CodesInChaos · on Sept 10, 2021

Currently you can assume that if you receive a packet with a higher sequence number, earlier packets were probably lost, and resend them. This heuristic won't work anymore with a LIFO buffer.

LIFO sounds annoying for bursty traffic, since you can only start processing the burst after the buffer has cleared.

londons_explore · on Sept 10, 2021

Using erasure coding, there is no need to know which packets were lost. You just keep sending more erasure coded packets till you get an acknowledgement that all data has been decoded. If a packet from a while ago randomly reappears, it usually isn't wasted throughput either - it can be used to replace any other lost packet, either forwards or backwards in the packet stream (up to some limit determined by the applications end-to-end latency requirement).

eru · on Sept 17, 2021

You might like BBR: https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/

kristjansson · on Sept 10, 2021

I have exactly that pattern in an internal system. Under load, the caller will give up after a fixed timeout and retry, so why waste time on an old request that where the caller has probably already hung up?

jeffbee · on Sept 10, 2021

LIFO is a perfectly reasonable mode, when overloaded. FIFO will give you a stable equilibrium where every request takes too long and fails. LIFO breaks out of that operating point.

This is the key feature of the CoDel solution to buffer bloat.