Hacker News new | past | comments | ask | show | jobs | submit login
Suspicious Discontinuities (danluu.com)
191 points by apsec112 on Sept 10, 2021 | hide | past | favorite | 54 comments



Unfortunately I can't find the reference any more, but there was this great presentation by someone explaining how in computer systems developers will "optimise for the requirement", and ignore anything outside of it.

The example was latency. If the programmers were told to achieve less than 1 ms for 99% of the requests -- then sure enough -- the 1% of requests would have sky-high latencies of multiple seconds.

If told they needed to achieve 1 ms for 99% and 100 ms for 99.99%, then -- you guessed it -- the worst 0.01% would be tens of seconds or even minutes.

Inevitably, there would be a visible discontinuity in the latency histogram just above whatever the official business requirement was.

It's a difficult thing to fix, because no matter where you set your threshold, unless it's 100%, you won't meet it. And even 100% is just a fiction, because you'd need an infinite number of tests to achieve it.


> how in computer systems developers will "optimise for the requirement", and ignore anything outside of it.

It’s worth noting that this is a learned behavior which is not natural for many people.

There are plenty of people who will naturally think about a problem in context and want to solve for the problem and not just the metric.

Those people will be slower to satisfy management demands which will be metric driven, and in order to progress in their careers will learn to focus on the requirement and not the problem.


I interned at a company decades ago that was doing large dynamic system simulations and worked with an older guy who was known for knowing how to make the solvers work on impossible sysyems. One of his biggest tricks was replacing discontinuities with various tuned arctan so that transitions are smooth and integrators don't thrash about computing derivatives.


I don't know if it's the correct talk, but those ideas are explored in "How NOT to Measure Latency by Gil Tene: https://www.youtube.com/watch?v=lJ8ydIuPFeU


This talk forever changed my views on metrics.


Wow, holy shit, that talk did not disappoint. I will never look at page loads, response times, and percentiles the same way again. Paradigm shifting and a really rare case of a groundbreaking insight.


I think it may have been a different talk either by the same guy, or someone from the same company.


From practical experience, I find it a bit hard to believe this is actually how it works for latency. Were any specific examples provided?


Lots of examples.

Typically the issue was caused by garbage collection. You can twiddle with the parameters to meet your 99% latency goal, and then fail your 1% spectacularly.

Similarly, anything involving the network would slowly be optimised to meet the "typical case" requirements, while the extremes would be terrible.

If I remember correctly, this was a talk by someone working in a real-time trading firm, where latency was a critical metric for all of their systems designs. He had a lot of charts with very visible upticks in latency at the "nice round numbers" where the requirements were set.


I don't know the reasons for this, but when I hammer my Cloudflare Workers with a stress test, some requests take 10-100 times longer than the majority.


I can't quite remember it, but a few weeks ago I read an article about performance, and in the end they said something like "include the mean in your business requirement".

Nobody experiences the mean, but every outlier in your optimization will affect it.


I can totally imagine how that happens... The software isn't performing well enough, so some programmer says "I can rewrite the request queueing system to prioritize requests just before the 1ms deadline and deprioritize the ones where the deadline has already been missed". End result, a step...


Deep in the internals of Google, we actually had a system that was putting incoming requests not in a queue (first-in-first-out), but in a stack (last-in-first-out).

The system in question was essentially not meant to have any backlog of request under normal operations. But when overloaded, for this particular system it was better to serve the requests it could serve very fast, and just shed load on the overflow.

(I don't remember more specifics, and even if I could, I would probably not be allowed to give them..)


Reminded me of this old post by Facebook, well worth a read :)

https://engineering.fb.com/2014/11/14/production-engineering...


I really wish network routers would do this when delivering packets...

The other end gets far better information about congestion that way, and congestion control algorithms could be made much smarter.

Everyone says "jitter is bad in networks", but the reality is if the jitter is giving you network state information, it is a net positive - especially when you can use erasure coding schemes so the application need not see the jitter.


Currently you can assume that if you receive a packet with a higher sequence number, earlier packets were probably lost, and resend them. This heuristic won't work anymore with a LIFO buffer.

LIFO sounds annoying for bursty traffic, since you can only start processing the burst after the buffer has cleared.


Using erasure coding, there is no need to know which packets were lost. You just keep sending more erasure coded packets till you get an acknowledgement that all data has been decoded. If a packet from a while ago randomly reappears, it usually isn't wasted throughput either - it can be used to replace any other lost packet, either forwards or backwards in the packet stream (up to some limit determined by the applications end-to-end latency requirement).



I have exactly that pattern in an internal system. Under load, the caller will give up after a fixed timeout and retry, so why waste time on an old request that where the caller has probably already hung up?


LIFO is a perfectly reasonable mode, when overloaded. FIFO will give you a stable equilibrium where every request takes too long and fails. LIFO breaks out of that operating point.

This is the key feature of the CoDel solution to buffer bloat.


> Authors of psychology papers are incentivized to produce papers with p values below some threshold, usually 0.05, but sometimes 0.1 or 0.01. Masicampo et al. plotted p values from papers published in three psychology journals and found a curiously high number of papers with p values just below 0.05.

This is on topic; there is a discontinuity there which is an example of the same type of thing the rest of the post talks about.

But it's not the biggest problem illustrated by that graph. The dot at "p is just barely less than 0.05" is an outlier. But it's an outlier from what is otherwise a regular pattern that clearly shows that smaller p-values are more likely to occur than larger ones are. That's insane. The way for that pattern to arise without indicating a problem would be "psychologists only investigate questions with very clear, obvious answers". I find that implausible.


The graph shows that those low p-values are more likely to be in papers, not that they’re more likely to occur. Is that suspicious? I don’t know enough about it to judge.


> The graph shows that those low p-values are more likely to be in papers

This is an important distinction, in my experience [0].

Many papers will report a p-value only if it is below a significance threshold, otherwise they will report "n.s." (no statistic) or will give a range (e.g. p > .1). This just means that in addition to pressure to shelve insignificant results, publication bias also manifests as a tendency to emphasize and carefully report significant findings, while mentioning in passing those that don't meet whatever threshold.

[0] I happen to be working on a meta-analysis of psychology and public health papers at the moment. One paper that we're reviewing constructs 32 separate statistical models, reports that many of the results are not significant, and then discusses the significant results at length.


> Many papers will report a p-value only if it is below a significance threshold, otherwise they will report "n.s." (no statistic) or will give a range (e.g. p > .1).

But the oddity here is a pronounced trend in the reported p-values that meet the significance threshold. The behavior you mention cannot create that trend.


> The graph shows that those low p-values are more likely to be in papers, not that they’re more likely to occur.

It looks to me like the y-axis is measured in number of papers. The lower a p-value is, the more papers there are that happened to find a result beating the p-value.

So low p-values are more likely to occur a priori than high p-values are. This is most certainly not true in general. We might guess that psychologists are fudging their p-values somehow, or that journals are much, much, much, much, much, much, much more likely to publish "chewing a stalk of grass makes you walk slower, p < 0.013" than they are to publish "chewing a stalk of grass makes you walk slower, p < 0.04".

I've emphasized the level of bias the journals would need to be showing -- over fine distinctions in a value that is most often treated as a binary yes or no -- because it is much easier to get p < 0.04 than it is to get p < 0.013.


Conditional on being published, this is true. Hence studies of the file-drawer effect and what not.

More generally, scientists are incentivised to find novel findings (i.e. unexpectedly low p-values) or lose their job.

Given that, the plot doesn't surprise me at all (Also, people will normally not report a bunch of non-significant results, which is a similar but unrelated problem).


That's part of why pre-registration of studies is so important.


> This is most certainly not true in general.

Are you saying that in other disciplines, the distribution of p-values in published papers does not follow this pattern?


I think what they meant was that we would expect the distribution of p-values to be uniform, if we had access to every p-value ever calculated (or a random sample thereof).

Publishing introduces a systematic bias, because it's difficult to get published where p>0.05 (or whatever the disciplinary standard is).


> Publishing introduces a systematic bias, because it's difficult to get published where p>0.05 (or whatever the disciplinary standard is).

That explains why the p-values above 0.05 are rare compared to values below 0.05. But it fails to explain why p-values above 0.02 are rare compared to values below 0.02.


I agree with your point from your previous post, that lower p-values are harder to get than higher ones, at least if one is looking at all possible causal relationships, but there are at least two possible causes for the inversion seen in publishing. The first is a general preference for lower p-values on the part of publishers and their reviewers (by 'general' I mean not just at the 0.05 value); the second is that researchers do not randomly pick what to study - they use their expertise and existing knowledge to guide their investigations.

Is that enough to tip the curve the other way across the range of p-values? Well, something is, and I am open to alternative suggestions.

One other point: while the datum immediately below 0.05 would normally be considered an outlier, the fact that it is next to a discontinuity (actual or perceived) renders that call less clear. Personally, I suspect it is not an accidental outlier, but given that it does not produce much distortion in the overall trend, I am less inclined to see the 0.05 threshold (actual or perceived) as a problem than I did before I saw this chart.


> Personally, I suspect it is not an accidental outlier, but given that it does not produce much distortion in the overall trend, I am less inclined to see the 0.05 threshold (actual or perceived) as a problem than I did before I saw this chart.

Don't be fooled by the line someone drew on the chart. There's no particular reason to view this as a smooth nonlinear relationship except that somebody clearly wanted you to do that when they prepared the chart.

I could describe the same data, with different graphical aids, as:

- uniform distribution ("75 papers") between an eyeballed p < .02 and p < .05

- large spike ("95 papers") at exactly p = 0.4999

- sharp decline between p < .05 and p < .06

- uniform distribution ("19 papers") from p < .06 to p < .10

- bizarre, elevated sawtooth distribution between p < .01 and p < .02

And if I describe it that way, the spike at .05 is having exactly the effect you'd expect, drawing papers away from their rightful place somewhere above .05. If the p-value chart were a histogram like all the others instead of a scatterplot with a misleading visual aid, it would look pretty similar to the other charts.


Well, you could extend this mode of analysis to its conclusion, for each dataset, and describe each datum in the data by its difference from its predecessor and successor, but if you do, does that help? I took it as significant that you wrote "...but it's an outlier from what is otherwise a regular pattern that clearly shows that smaller p-values are more likely to occur than larger ones are" (my emphasis) and that is what I am responding to.

I think we are both, in our own ways, making the point that there is more going on here than the spike just below 0.05 - namely, the regular pattern that you identified in your original post. If we differ, it seems to be because I think it is explicable.

WRT p-values of 0.05: I almost, but did not, say that if you curve-fitted above and below 0.05 independently, there would be a gap between the two, and maybe even if you left out the value immediately below 0.05. No doubt that would also happen for other values, but I am guessing that this gap would peak at 0.05. If I have time in the near future, I may try it. If you do, and find that I am wrong, I will be happy to recant.


> The way for that pattern to arise without indicating a problem would be "psychologists only investigate questions with very clear, obvious answers". I find that implausible.

Don't throw the Seldon out with the bathwater. I think there is a very real chance that the problems psychologist address are extremely probable in the society they investigate them.


> "psychologists only investigate questions with very clear, obvious answers".

More like, "psychologists often publish results where questions have clear, obvious answers".


In your model, a psychologist does these things in this sequence:

1. Choose a question to investigate.

2. Get some results.

3. Compute p < 0.03.

4. Toss the paper in the trash, because p < 0.03 isn't good enough.

But that's not how they operate. The reason there's a spike at 0.05 is that that's what everyone cares about. If you get p < 0.03, you're doing better than that!

So the bias in favor of even lower p-values is coming from somewhere else. It definitely is not coming from the decision point of "OK, I've done the research, but do I publish it?".


Here's my write-up of how I learned about this phenomenon in a specific software system:

https://www.solipsys.co.uk/new/GracefulDegradation.html?UI10...

In all the systems I've been involved with since, "Discontinuous Behaviour" is a failure mode we've explicitly analysed, and "Graceful Degradation" is a technique we've often implemented.

In case people want to discuss that separately I've submitted it here:

https://news.ycombinator.com/item?id=28478909


I feel like the "graceful degradation" writeup is incomplete. The obvious solution -- to me -- would be to have the tills continue operating as normal while their queue is full, but stop trying to write to the queue. This would allow you to sell as many items as people desire to buy, but you lose the ability to keep track electronically of what's been sold.

The ideal solution presented here is instead that the tills should bottleneck the customers so that they collectively can't buy more items than the central computer is capable of logging in amortized real time. This preserves the ability to keep electronic track of everything the store sells. But it does it by preventing the store from selling its inventory to customers! Instead of a loss in real-time recordkeeping, we have a hard cap on the amount of money the store can earn during the Christmas season. And the reason we've put the cap in place is that otherwise we'd blow right past it!

That solution is so anomalous-seeming that I want to see a discussion of what exactly the store's goals are, and why refusing to sell your inventory to customers in December is a good idea for a retail store.


It makes sense if:

- Reconciling electronic inventory numbers with the real ones will cost more than the store will earn from uncapping the sales, and/or

- There's a potential legal/tax risk involved with having stock numbers be bad, and/or

- There's a risk of losing customers and reputation when stock of a product physically runs out while the tracking system isn't aware of this, and a bunch of customers have to have their orders cancelled because there's nothing to fulfill them with.


It might make sense. But it probably doesn't. Without a discussion between the store and the IT guy, this looks like the store having a problem of "we want to sell our items, but the till isn't letting us" and the IT guy having a problem of "when the till locks up, the store's problem is obviously my fault", and the IT guy solving his personal problem at the store's expense because the store doesn't know any better.


I feel like you're not giving enough credit to a professional programmer.


Recent example (for me): Atlassian monthly per-user pricing vs. annual bulk pricing (discontinuous).

When moving up a range it seems more economical to revert to monthly pricing than to go annual. (Btw: that sucks!)

Source: https://www.atlassian.com/licensing/future-pricing/cloud-pri...


> A simple fix for the problems mentioned above would be to have slow phase-outs instead of sharp thresholds. Slow phase-outs are actually done for some subsidies and, while that can also have problems, they are typically less problematic than introducing a sharp discontinuity in tax/subsidy policy.

Or just don’t have phase-outs at all. What, exactly, is wrong with giving millionaires subsidies with similar dollar values as much poorer people? They’ll make it up in overall taxes paid anyway.


The problem is not with millionaires, who are rather few, but with middle-to-upper middle class people, who make up bulk of taxpayers and the revenue. If you run the numbers, your scheme ends up being rather unworkable.

To put in concrete terms: if you tried to give every household $7000 of ACA subsidy, that would cost you $900B a year. That’s more than US military spending. Where you would get extra taxes from to cover that? There aren’t enough millionaires to pay for it (do the numbers here too, to see), so what you’d need to do is to get back $7000 from household that make “mere” $150-200k. You don’t want to slap extra $7k tax, because that would just be a cliff like the one you tried to avoid in the first place. Once you figure out how to set up your taxation to do that, you’ll find that it conceptually and practically much simpler to just do phaseouts.


We, conceptually and practically, _already_ have a way to collect extra taxes from people with higher income — it’s called tax brackets. Adjusting the brackets a bit to collect that extra $7k (or more) from people who earn a few hundred K per year is straightforward, and it more easily preserves the nice property of having net taxes paid (tax minus subsidies) being convex as a function of income.

I realize that health care in particular is a horrible mess due to most higher income people getting insurance through employer group policies. That, in and of itself, is a problem, and fixing would have the added benefit of removing a large disincentive to changing jobs or leaving a bad job.


> Adjusting the brackets a bit to collect that extra $7k (or more) from people who earn a few hundred K per year is straightforward,

It’s less straightforward than you think. Please, show me how to modify the current brackets to collect extra $7k from households making >$200k (but not more than $7k), and do slow phaseout starting from above $100k. Then, do the same also for TANF, SNAP, SSI etc, all at the same time, and all at different thresholds.

After doing this exercise, you’ll find that it’s much easier to think about it in terms of keeping brackets the same for all, and having separate, phased out deductions, rather to fiddle with the brackets.


Previous discussion from Feb 2020: https://news.ycombinator.com/item?id=22378555


The UK has some similar things in its tax system. Child benefit is lost if either parent earns a penny over £50,000, but there are various "salary sacrifice" schemes (or heck, just put more into your pension - tax deductible here) meaning financial trickery like buying put options isn't necessary.


I find the drug charges section especially damning, I'd like to hear more analysis of that particular part.


There is a footnote about the drug charges section that anyone commenting here may want to read first.


Thanks for pointing that out, it's a relief.

For the lazy, the footnote basically says that the discontinuity at 280g is not caused by an increase in seizures at this amount (which would indicate police fraud), but by prosecutors choosing to charge defendants at slightly above the 280g threshold when the amount seized was actually significantly larger than the threshold.


It was never about the drugs.

The change the law increased the amount the police had to plant on someone to lock them away in a federal penitentiary, and they had not expected it to show up so strongly in the data. Back when the minimum was 50g they could hide in the noise but at 280g it stands out.

In a well functioning society this would trigger investigations of all of the cases where the suspect was charged with carrying just over the limit on suspicious of police misconduct, but the police don't want that so it's not going to happen.


> One thing that's maybe worth noting is that I've gotten a lot of mileage out in my career both out of being suspicious of discontinuities and figuring out where they come from and also out of applying standard techniques to smooth out discontinuities.

Just wanted to point out that quote from danluu. It's a really good practice to sit down and plot histograms, scatterplots and other visualizations and JUST THINK for a while before trying to cram the model you think applies.


very nice write-up. also reminds me of spurious correlations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: