Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.
The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.
There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)
Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.
Hi, in an unrelated nitpick: The relative error should be calculated by dividing the error by the true value, not by it's approximation. Still, very nice writeup!
Not an expert on this topic but I noticed that the KLL algorithm (published in 2016) was not mentioned in this thread, which provides theoretically optimal performance for streaming quantiles with guaranteed worst case performance: http://courses.csail.mit.edu/6.854/20/sample-projects/B/stre... (And is pretty fast in practice).
The UDDSketch (default) implementation will allow rolling percentiles, though we still need a bit of work on our end to support it. There isn't a way to do this with TDigest however.
Sure there is. You simply maintain N phases of digests, and every T time you evict a phase and recompute the summary (because T-digests are easily merged).
I think this would be a tumbling window rather than a true "rolling" tdigest. I suppose you could decrement the buckets, but it gets a little weird as splits can't really be unsplit. The tumbling window one would probably work, though Tdigest is a little weird on merge etc as it's not completely deterministic with respect to ordering and merging (Uddsketch is) so it's likely you get something that is more than good enough, but wouldn't be the same as if you just calculated it directly so it gets a little confusing and difficult.
i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.
That sounds familiar, I remember reading about Greenwald-Khanna before I found T-Digest after I ran into the "how to find a percentile of a massive data set" problem myself.
Yes. The goal is to find (or approximate) the median without storing all the elements. Instead, it approximates the median by finding the median of randomly selected samples from the elements.
Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)
Did the candidate get an offer? Genuinely curious.
I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.
Whether they got it or not probably isn't useful information. Having a good/brilliant answer probably isn't the only point of the question, this probably wasn't the only question of the interview, and this probably wasn't the only interview
You can actually construct the heap from unsorted data in O(n) time, so constructing the heap is definitely not sorting. However, yeah, to actually use the heap to find median in O(n) time, you need to do something similar to magic-five (median of medians) algorithm.
Is the minheap-maxheap approach faster than sorting the data? The obvious approach (process each element, one by one, into the appropriate heap, and rebalance the heaps so they are of equal size) takes n log n time and linear space. You can use the same resources to just produce a sorted copy of the input, which is a much better thing to have than two heaps that center on the median.
> The minheap-maxheap approach is better for streaming data, to get the median as data comes in.
I see that it's better if you need to know "what is the median of the amount of the list that I've consumed so far?"
But if what you want is the median of the whole list, which might be in a random order, the medians of random prefixes of the list don't seem especially relevant. And if you do have an indefinite amount of data coming in, so that you need a "well, this is what we've seen so far" data point, the minheap-maxheap approach doesn't seem very well suited since it requires you to remember the entirety of the data stream so far.
My first instinct is to divide the possible data values into buckets, and just count the number of datapoints that fall into each bucket. This gives you a histogram with arbitrary resolution. You won't know the median value, but you will know which bucket contains the median value, and your storage requirements depend only on the number of buckets you want to use.
One way to think about why we tend to use averages instead of medians is that it is related to a really deep theorem in probability: The Central Limit Theorem.
But I think we can twist our heads and see in a way that this is backwards. Mathematically, the mean is much easier to work with because it is linear and we can do algebra with it. That's how we got the Central Limit Theorem. Percentiles and the median, except for symmetric distributions, are not as easy to work with. They involve solving for the inverse of the cumulative function.
But in many ways, the median and percentiles are a more relevant and intuitive number to think about. Especially in contexts where linearity is inappropriate!
Assume up front none of your measured latencies from a software networked system will be Gaussian, or <exaggereation> you will die a painful death </exaggeration>. Even ping times over the internet have no mean. The only good thing about means is you can combine them easily, but since they are probably a mathematical fiction, combining them is even worse. Use T-Digest or one of the other algorithms being highlighted here.
This is why I try to plot a proper graph of times from any "optimization" I see in a PR. Too many times I see people making this assumption for example, and even if they're right they usually forget to take the width of the gaussian into account (i.e. wow your speedup is 5% of a standard deviation!)
yep, have made that mistake before. even turned in a write-up for a measurement project in a graduate level systems course that reported network performance dependent measurements with means over trials with error bars from standard deviations.
sadly, the instructor just gave it an A and moved on. (that said, the amount of work that went into a single semester project was a bit herculean, even if i do say so myself)
> the median and percentiles are a more relevant and intuitive number to think about
A good example of this is when people say "more than half of drivers say they are above average drivers! Idiots!" Of course that's perfectly possible if most drivers are basically fine and some are really really bad. For example, 99.99% of people have greater than the average number of legs.
The correct impossible statement would be if more than half of drivers are better than the median driver.
For some things, you can't even sensibly measure the mean. For example, if you're measuring the mean response time for a service, a single failure/timeout makes the mean response time infinite (because 100 years from now the response still hasn't been received).
Gamers have an intuitive sense of this. Your average framerate can be arbitrarily high, but if you have a big stutter every second between the smooth moments, then a lower but more consistent framerate may be preferable, typically expressed as the 1% and 0.1% slowest frames, which at a relatively typical 100fps, represents the slowest frame every second and every 10 seconds.
I have no hope of finding a cite for this, but a long time ago I read some command line UI research that found if you had a system where commands ranged from instant to taking a small but noticeable time and you introduced delays in the faster commands to make it so all commands took the same small but noticeable time people would think that the system was now faster overall.
I guess that’s because our minds (and animal minds as well) are always aware of the pace of repetitive events. If something is off, the anxiety alarm rings. One old book on the brain machinery described an example of a cat that was relaxing near the metronome and became alert when it was suddenly stopped. Unpredictable delays are disturbing, because a mispredicted event means you may be in a dangerous situation and have to recalibrate now.
I think the explanation may be even more low level than that. Iirc, even with a single neuron (or maybe if it was very small clusters, sorry recollection is a bit hazy) you can see that it learns to tune out a repetitive signal.
Love this example. Might have to use that in a future post. Feel like a lot of us are running into a similar thing with remote work and video calls these days...
For games, one compounding factor is that you need to estimate how long the next frame will take in order to know how much time to simulate. If the variance is greater, the prediction will be worse leading the perceived game "speed" to vary from frame to frame.
There was a “Latency” multiplayer setting in Starcraft 1. I think it was there precisely to give players control over smoothness of network latency instead of graphics and processing latency. I think this just adds yet another example of your observation.
Yep, I often find it useful to limit framerate to a number I know my GPU can handle so that I don't get as much stuttering. It's better to run at 60 FPS all the time than 80 FPS most of the time but with stuttering.
Gamers do not have a more intuitive sense for this than movie watchers, video/voice talkers, not to mention users who type text into bloated web browsers or lagged remote login sessions.
I would rather have a rock-steady consistent 500 ms reponse time when typing text, than to have a 100 ms average response time which randomly spikes to outliers that go past one second.
A rock-steady, though poor, event rate in a paint application is better for drawing a freehand curve (especially if the program interpolates well) than a really fast rate that suddenly has a glitch in it, spoiling your work.
> Gamers do not have a more intuitive sense for this than movie watchers, video/voice talkers, not to mention users who type text into bloated web browsers or lagged remote login sessions
I would be shocked if gamers didn't on average have a more intuitive sense of this than any of the groups you mentioned.
Yah, I don't think anyone's arguing that ONLY gamers will observe the phenomena, but I would be shocked if they weren't more sensitive to it than most groups.
Looking particularly at latency measurements, I found the "How NOT to Measure Latency" [1] talk very illuminating. It goes quite deep into discussing how percentiles can be used and abused for measurement.
!*Plowser*!* is in the 99th percentile of browsers by global user count!
# Fact Sheet
- !*Plowser*! is used by 2,050 people
- sample size of browsers is 10,000 (this includes toy apps on GitHub and almost all browsers in the sample have no record of active users)
- those using !*Plowser*! have no choice as the developing company $*DendralRot Inc*$ forces all its employees, contractors and users of their enterprise '?_shuiteware_?' (a name derived by mashing |-->shite|software|suite<--| into one word) to use their browser
- if we place the number of global browser users at a conservative 1,000,000,000, !*Plowser*! actually has 0.00000205% of users
From Wikipedia description: Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it, and the effect of outliers and other influential observations on statistical properties.
Yes, a both funny and insightful lesson on how weak basic indicators (mean, standard deviation, correlation) can be, the interest and limits of box-plot with quartiles and whiskers, the benefit of the violin-plot.
> For the median and average to be equal, the points less than the median and greater than the median must have the same distribution (i.e., there must be the same number of points that are somewhat larger and somewhat smaller and much larger and much smaller).
[0, 2, 5, 9, 9] has both median and mean = 5, but the two sides don't really have the same distribution.
Totally true...thoughts on how I could rephrase? I guess it's more the "weight" of points greater than and less than the median should be the same, so symmetric distributions definitely have it, asymmetric may or may not. Definitely open to revising...
This is excellent information, thank you for posting this! I was not familiar with this example previously, but it is a perfect example of summary statistics not capturing certain distributions well. It's very approachable, even if you had to limit the discussion to mean and variance alone. Bookmarked, and much appreciated.
That's really nifty, wish I'd heard about it earlier. Might go back and add a link to it in the post at some point too! Very useful. Definitely know I wasn't breaking new ground or anything, but fun to see it represented so succinctly.
Very discouraging if one is trying to analyze data algorithmically. Often when faced with a problem in statistics, the answer is: "Look at the graph and use intuition!".
I just recently tried giving a presentation to my department (they're developers, I'm architect) about this stuff and they all just sort of blinked at me. It brought in Little's Law and Kingman's Formula, in an attempt to underscore why we need to limit variation in the response times of our requests.
There are a bunch of queuing theory formulas that are really cool but don't exactly apply if your responses vary a lot like the article describes. I think the assumption is that response time distributions are exponential distributions, which I don't think is a good assumption (is an Erlang distribution an exponential distribution?) Nevertheless, hooking the equations up to some models is a good way to get directional intuition. I didn't realize how steep the performance drop-off is for server utilization until I started moving sliders around.
Our ops team doesn't really follow this either. We're not a huge department though - is this the kind of stuff that an SRE team usually pays attention to?
I found it surprisingly difficult to explain well. Took a lot of passes and a lot more words than I was expecting. It seems like such a simple concept. I thought the post was gonna be the shortest of my recent ones, and then after really explaining it and getting lots of edits and rewriting, it was 7000 words and ...whoops! But I guess it's what I needed to explain it well (hope you thought so anyway).
It's somewhat exponential, but yeah, not necessarily, it's definitely long-tailed in some way and it sort of doesn't matter what the theoretical description is (at least in my mind) the point is these types of distributions really don't get described well by a lot of typical statistics.
Can't talk too much about SREs at the ad analytics company I mentioned, we were the backend team that wrote a lot of the backend stores / managed databases / ran the APIs and monitored this stuff a bit (and probably not all that well). It was a bit more ad hoc I guess, probably now that company is large enough they have a dedicated team for that...
The article hit the pocket pretty exactly for how I felt I needed to explain it. (Actually I had the same experience where I thought I would be able to go over it quickly and then I felt like I was getting mired.) The graphs look great too - I've been able to explore it pretty well with JupyterLab but I can't export that into something super-interactive that is read-only. I've thought about creating an idyll page out of it to help others explore but that's a bit overkill for the work style our team has.
I think the weirdly-shaped long-tail graphs we come across are just sums of more naturally-distributed response times, for different types of responses. Another reason to limit variation I think.
> I think the weirdly-shaped long-tail graphs we come across are just sums of more naturally-distributed response times, for different types of responses. Another reason to limit variation I think.
The best method I've found for digging into latency is profiling RPC traces to see where time is spent. This at least separates the code and parameters that have simple latency distributions from those that don't. Some distributions will be very data-dependent (SQL queries).
Side note: I wish everyone would stop using the term Average to refer to the Arithmetic mean. "Average" just means some statistic used to summarize a dataset. It could be the Arithmetic Mean, Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch of other statistics. We're stuck with AVG because that's the function used by early databases and Lotus 123.
No we’re stuck with it because average was used colloquially for arithmetic mean for decades.
I wish people would stop bad-mouthing the arithmetic mean. If you have to convey information about a distribution and you’ve got only one number to do it, the arithmetic mean is for you.
“It depends” is always right but which function is better for arbitrary, unknown data?
At least the arithmetic mean is fine for Gaussian distributions, and coneys a sense about the data even on non-Gaussian ones. but the median doesn’t even work at all on some common distributions like scores of very difficult exams (where median=0)
For the mean, at least every data point contributes to the final value.
```To calculate the 10th percentile, let’s say we have 10,000 values. We take all of the values, order them from largest to smallest, and identify the 1001st value (where 1000 or 10% of the values are below it), which will be our 10th percentile.```
Isn't this contradictory? If we order the values from largest to smallest and take the 1001st value, then 10 % of the values are above/larger and not below/smaller. I believe it should say order from smallest to largest.
One of the difficulties when dealing with percentiles is estimating error of the estimated percentile values, which is not necessarily trivial compared to estimating error of mean approximation (e.g. see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6294150/)
I often see HN articles crop up soon after a related post, in this case this Ask HN poster [0] being driven crazy by people averaging percentiles and them not seeing that it's a big deal. It's pretty funny to see such tuples of posts appearing.
This is a really important topic if you're doing web services. Especially if you're doing parallel processing services. When I was at Blekko the "interesting" queries were the ones above the 95th percentile because they always indicated "something" that hadn't worked according to plan. Sometimes it was a disk going bad on one of the bucket servers, sometimes it was a network port dropping packets, and sometimes it was a corrupted index file. But it was always something that needed to be looked at and then (usually) fixed.
It also always separated the 'good' Ad networks from the 'bad' ones as the bad ones would take to long to respond.
Be careful translating percentiles of requests to percentiles of users; if less than 10% of your requests take over 1 second, but a typical user makes 10 requests, it's possible that the majority your users are seeing a request take over 1 second.
Yep! Briefly noted that in the post, but deserves re-stating! it's definitely a more complex analysis to figure out the percentage of users affected (though often more important) could be majority could also be one user who has some data scientist programmatically making hundreds of long API calls for some task...(can you tell that I ran into that? Even worse it was one of our own data scientists ;) ).
If you're going to use multiple quantities to summarize a distribution, wouldn't using percentiles for all of them give you the most information? The mean and standard deviation could then be estimated from that data.
Percentiles for sure are better than average if you want to explore distribution: there are several percentiles comparing to single average. However median is harder to use if you want to do calculations based on this metric. For example distribution of sample median could be a problem, if you want to understand confidence interval for it for example.
Totally can be true. In our case, we use these approximation methods that allow you to get multiple percentiles "for free" definitely need to choose the right ones for the job. (We talk a bit more about the whole approach where we do the aggregation then the accessor thing in the previous post on two-step aggregation [1]). But there are definitely times when averages/stddev and potentially the 3rd and 4th moments are more useful etc.
Ive been trying to get the marketing team to always include a std deviation with averages. Average alone is simply not useful, standard deviation is a simple way to essentially include percentiles.
They regularly compare experiments to the mean but dont use a T test to ensure the results are actually different from the mean.
I heavily caution against the feeling that "standard deviation is a simple way to essentially include percentiles." The usefulness of the standard deviation depends on the distributions that you are working with. Heavy tailed distributions appear a fair amount in practice, and the combo of summary statistics mentioned would not do well on those. Also, Madars' comment in this thread is a beautiful example of this: 4 completely different distributions, with identical mean and standard deviation (among other things). Histograms and percentiles, and if necessary their approximations, are more desirable for the above reasons.
I assume most of the distributions a marketing department would be dealing with are generally normal in which case stddev is a great way to analyze the data. This can be easily verified by just plotting said data and making sure the tails don't look weird.
I can't help but idly wonder what humans are doing when they are eyeballing the tails, to see if things look good. Like lets say we wanted to do the eyeball test but automatically. Would the best way be to use an image classifier on the plot? Is there something magic about the plot representation that would make it good even for computers to use?
One thing I like about this post is that it explains things in an accessible way before getting into a deep dive. Might be worth sharing with the marketing team as they'll "get" long tail in the context of web search, so the concept is fairly transferable to stuff that they would know about.
Std deviation definitely helps a lot, still often not as good as percentiles, was actually thinking about adding some of that in the post but it was already getting so long. It's funny how things you think are simple sometimes take the most effort to explain, definitely found that on this one.
Yeah -- std deviation has a similar problem to the mean in that it doesn't give you a full picture unless the distribution is close to normal / gaussian.
We've been meaning to add IQR as an accessor function for these, may have to go back and do it...the frequency trails [1] stuff from Brendan Gregg also goes into some of this and it's really cool as a visualization.
Barchart is basically your percentiles (just more of them) so why not show it? Bars and whiskers could be more complicated for them but still the same sort of data
Recovering product manager in me sees 90th percentile queries with outlier high latency and starts asking instead of how to reduce it, how we can to spin it out into a dedicated premium query feature, as if they're willing to wait, they're probably also willing to pay.
If you want to calculate exact percentiles there's a simple in-place algorithm that runs in expected O(n) time. You basically do quicksort but ignore the "wrong" partition in each recursive step. For instance if your pivot is at the 25% percentile and you're looking for the 10% percentile you only recurse on the "left" partition at that point. It's pretty easy to implement. (And rather straightforward to change to a loop, if necessary.)
I had to explain the data usage of an interface that looked extremely busy from the standard graphs
I did a percentile graph of the usage - the data was only typically using 5% of the maximum throughput, no-one could really understand the graph though
so I did a zoomed-in version of the normal data usage graph and it looked like a blip lasting 1/20 of the time - everyone got that - eg it was peaking every few seconds and then doing nothing for ages
Can someone bake this down to a sentence? I think I understand what they are saying since I have been faced with using these metrics and in having been tempted to use average response times I recognized that average is not a good baseline since it moves in relation to the outliers (which are usually the problem requests and there can be many or few).
How do you calculate these percentiles and use them as a trigger for alerts?
If you are doing this sort of work I highly recommend the Datasketches library. It's used in Druid, which is a database specifically designed for these sorts of aggregations over ridiculously large datasets.
> “In the below graph, half of the data is to the left (shaded in blue), and a half is to the right (shaded in purple), with the 50th percentile directly in the center.”
But the half on the right is actually shaded yellow.
Sorry for the minor nitpick, but I find it a bit unusual (disappointing?) that there's an image of a candlestick chart at the very top, but the article only uses API response times as examples...
If you think the data is normal-ish but want to account for skew and kurtosis, you can try fitting a distribution such as the skewed Student's t -- there are R packages for this.
Bah, I'll be happy if I could even get correct averages. I see pipelines getting value X1 from a server that served 100 requests, another value X2 from a server that served one request, and then it returns (X1+X2)/2.
The mean is something you can easily compute progressively and with trivial resources. Median and percentiles, on the other hand, can be super expensive and potentially unsuitable for some real-time applications, since you need to maintain a sorted list of all relevant samples.
I've skimmed some of the literature here when I've spent time trying to help people with their bucket boundaries for Prometheus-style instrumentation of things denominated in "seconds", such as processing time and freshness.
My use case is a little different from what's described here or in a lot of the literature. Some of the differences:
(1) You have to pre-decide on bucket values, often hardcoded or stored in code-like places, and realistically won't bother to update them often unless the data look unusably noisy.
(2) Your maximum number of buckets is pretty small -- like, no more than 10 or 15 histogram buckets probably. This is because my metrics are very high cardinality (my times get recorded alongside other dimensions that may have 5-100 distinct values, things like server instance number, method name, client name, or response status).
(3) I think I know what percentiles I care about -- I'm particularly interested in minimizing error for, say, p50, p95, p99, p999 values and don't care too much about others.
(4) I think I know what values I care about knowing precisely! Sometimes people call my metrics "SLIs" and sometimes they even set an "SLO" which says, say, I want no more than 0.1% of interactions to take more than 500ms. (Yes, those people say, we have accepted that this means that 0.1% of people may have an unbounded bad experience.) So, okay, fine, let's force a bucket boundary at 500ms and then we'll always be measuring that SLO with no error.
(5) I know that the test data I use as input don't always reflect how the system will behave over time. For example I might feed my bucket-designing algorithm yesterday's freshness data and that might have been a day when our async data processing pipeline was never more than 10 minutes backlogged. But in fact in the real world every few months we get a >8 hour backlog and it turns out we'd like to be able to accurately measure the p99 age of processed messages even if they are very old... So despite our very limited bucket budget we probably do want some buckets at 1, 2, 4, 8, 16 hours, even if at design time they seem useless.
I have always ended up hand-writing my own error approximation function which takes as input like
(1) sample data - a representative subset of the actual times observed in my system yesterday
then returns as output info about how far off (%age error) each estimated percentile is from the actual value for my sample data.
Last time I looked at this I tried using libraries that purport to compute very good bucket boundaries but they give me, like, 1500 buckets with very nice tiny error, but no clear way to make real-world choice about collapsing this into a much smaller set of buckets with comparatively huge, but manageable, error.
I ended up just advising people to
* set bucket boundaries at SLO boundaries, and be sure to update when the SLO does
* actually look at your data and understand the data's shape
* minimize error for the data set you have now; logarithmic bucket sizes with extra buckets near the distribution's current median value seems to work well
* minimize worst-case error if the things you're measuring grow very small or very large and you care about being able to observe that (add extra buckets)
One thing you could try to use are variance optimal histograms. These are histograms which set bucket boundaries such that the weighted average variance in the buckets is minimized. Unfortunately, this algorithm is quadratic with the number of data points, but you could try approximating the optimum by taking random subsets of the data, computing the histogram, then seeing how well the histogram does on the whole dataset.
This is interesting and I totally get at least some of the problems you're facing. I wonder if you could take some of the strategies from t-digest and modify a bit to accomplish...I'd be interested in seeing some sort of implementation of this and would love to see if we can get it into our toolkit if you do...or you can also open up a ticket for us and we'll see if we can prioritize to work on something like it ourselves.
I do think there are some interesting ways you can cut corners if you know things about the "SLO" or other sort of cutoff values, but I'd have to think more deeply about it to say more. Essentially we want a variable error rate based on the distance away from a known value, where you have little error in the values relatively close to the known value, care little if at all for fine distinctions on the low end and, once you get past some high end value you could probably bucket everything above it into a "large outliers" bucket or something too...meaning you p999 could get out of hand if you start getting lots of stuff above a threshold, but, if that starts happening, probably everything's already burning down, so might not be that useful to know it's burning down in a very specific way...
look at Greenwald-Khanna (and the followon work). It adapts the bucket size to minimize the total error (and the number of buckets is proportional to log-epsilon I think).
with all the competition and 'innovation', you would think the data dogs of this world would grow up beyond time series.
Spent several years in venture capital investing and averages were always misleading - as Nassim Taleb says "Never cross a river that is on average 4 feet deep"
The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.
[1]: https://github.com/cxxr/LiveStats
[2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf