Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.

The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.

[1]: https://github.com/cxxr/LiveStats

[2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf



There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)

[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest


NB: Post author here.

Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.

[1]: https://arxiv.org/abs/1908.10693


Is it using https://github.com/tvondra/tdigest under the hood, or a separate implementation?


If folks are interested:

https://github.com/timescale/timescaledb-toolkit/blob/main/e...

(The TimescaleDB Toolkit is also implemented in Rust)


Facebook seems to have an even better performance implementation using sqrt. Might make sense to port that over to Rust. https://github.com/facebook/folly/blob/master/folly/stats/TD...


a separate implementation


Hi, in an unrelated nitpick: The relative error should be calculated by dividing the error by the true value, not by it's approximation. Still, very nice writeup!


Thanks! messed up the formula but had it right in the text :( Fixed now.


Not an expert on this topic but I noticed that the KLL algorithm (published in 2016) was not mentioned in this thread, which provides theoretically optimal performance for streaming quantiles with guaranteed worst case performance: http://courses.csail.mit.edu/6.854/20/sample-projects/B/stre... (And is pretty fast in practice).


NB: Post author here.

Interesting will have to take a look! Thanks for sharing!


Also relevant: Single-Pass Online Statistics Algorithms

[1] http://www.numericalexpert.com/articles/single_pass_stat/


That's pretty neat! Can these be used to efficiently compute rolling percentiles (over windows of the data), or just incremental?


The UDDSketch (default) implementation will allow rolling percentiles, though we still need a bit of work on our end to support it. There isn't a way to do this with TDigest however.


Sure there is. You simply maintain N phases of digests, and every T time you evict a phase and recompute the summary (because T-digests are easily merged).


I think this would be a tumbling window rather than a true "rolling" tdigest. I suppose you could decrement the buckets, but it gets a little weird as splits can't really be unsplit. The tumbling window one would probably work, though Tdigest is a little weird on merge etc as it's not completely deterministic with respect to ordering and merging (Uddsketch is) so it's likely you get something that is more than good enough, but wouldn't be the same as if you just calculated it directly so it gets a little confusing and difficult.

(NB: Post author here).


This is what I do, it's not a true rolling digest but it works well enough for my purposes.


yep I had to implement t-digest in a monitoring library. another alternative (although older) that the prometheus libraries use is CKMS quantiles [0].

[0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-icde.pd...


i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.


That sounds familiar, I remember reading about Greenwald-Khanna before I found T-Digest after I ran into the "how to find a percentile of a massive data set" problem myself.


> Find the median ... randomly evict items

So, not find, but approximate. That's a different thing.


> without sorting it... have a fixed size sorted buffer

(that you sort yourself)


That doesn't really make sense to me at all. Don't sort it, just have it?

Is the storage restriction the point?


Yes. The goal is to find (or approximate) the median without storing all the elements. Instead, it approximates the median by finding the median of randomly selected samples from the elements.


NB: Post author here.

Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)


Did the candidate get an offer? Genuinely curious.

I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.


Whether they got it or not probably isn't useful information. Having a good/brilliant answer probably isn't the only point of the question, this probably wasn't the only question of the interview, and this probably wasn't the only interview


> if we estimate the median and move it a small amount for each new data point, it would be pretty close.

Yeah, this is gradient descent on the absolute loss.


> The question was basically, "Find the median of a huge data set without sorting it," [...]

You can use eg QuickSelect (https://en.wikipedia.org/wiki/Quickselect) or Median of Medians.

They don't sort the data, but they do need linear amount of storage.


> The question was basically, "Find the median of a huge data set without sorting it,"

Isn't this done using a min heap and a max heap in conjuction?


The real constraint here is probably "find the median of a huge data set without holding the entire data set in memory".


'Estimate the median of an arbitrary sized data set using a constant amount of memory'.


No, for 2 reasons,

1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"

2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules


You can actually construct the heap from unsorted data in O(n) time, so constructing the heap is definitely not sorting. However, yeah, to actually use the heap to find median in O(n) time, you need to do something similar to magic-five (median of medians) algorithm.


Something like QuickSelect is probably better in practice than median-of-medians.


> but "huge" means "more than you can store"

Really? Where's it coming from?


I think they mean having more than you can store simultaneously on a single device.

With a few exceptions this is pretty common scenario.


More than you can store.

And possibly it's live data.


The two heap method helps you maintain the current median given an incoming stream of “online” data, and technically is not “sorting”.

The original problem is probably more accurately described as “what if you have too much data to sort” though.


It's funny that this is often left out from a data & algorithm class.


Is the minheap-maxheap approach faster than sorting the data? The obvious approach (process each element, one by one, into the appropriate heap, and rebalance the heaps so they are of equal size) takes n log n time and linear space. You can use the same resources to just produce a sorted copy of the input, which is a much better thing to have than two heaps that center on the median.


The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I agree that if you have the whole thing, doing heapsort and pulling a[N/2] or a[1 + N/2] is simpler.


> The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I see that it's better if you need to know "what is the median of the amount of the list that I've consumed so far?"

But if what you want is the median of the whole list, which might be in a random order, the medians of random prefixes of the list don't seem especially relevant. And if you do have an indefinite amount of data coming in, so that you need a "well, this is what we've seen so far" data point, the minheap-maxheap approach doesn't seem very well suited since it requires you to remember the entirety of the data stream so far.

My first instinct is to divide the possible data values into buckets, and just count the number of datapoints that fall into each bucket. This gives you a histogram with arbitrary resolution. You won't know the median value, but you will know which bucket contains the median value, and your storage requirements depend only on the number of buckets you want to use.


Because unlike many dynamic programming algorithms, it is something anyone running a large system will need.


How does this get you the median?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: