Awhile ago I wrote a Python library called LiveStats[1] that computed any percen...

cyral · on Sept 14, 2021

There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)

[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest

djk447 · on Sept 14, 2021

NB: Post author here.

Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.

[1]: https://arxiv.org/abs/1908.10693

vvern · on Sept 14, 2021

Is it using https://github.com/tvondra/tdigest under the hood, or a separate implementation?

mfreed · on Sept 14, 2021

If folks are interested:

https://github.com/timescale/timescaledb-toolkit/blob/main/e...

(The TimescaleDB Toolkit is also implemented in Rust)

rogers18445 · on Sept 15, 2021

Facebook seems to have an even better performance implementation using sqrt. Might make sense to port that over to Rust. https://github.com/facebook/folly/blob/master/folly/stats/TD...

Lockerman · on Sept 14, 2021

a separate implementation

_0ffh · on Sept 15, 2021

Hi, in an unrelated nitpick: The relative error should be calculated by dividing the error by the true value, not by it's approximation. Still, very nice writeup!

djk447 · on Sept 15, 2021

Thanks! messed up the formula but had it right in the text :( Fixed now.

jeremysalwen · on Sept 15, 2021

Not an expert on this topic but I noticed that the KLL algorithm (published in 2016) was not mentioned in this thread, which provides theoretically optimal performance for streaming quantiles with guaranteed worst case performance: http://courses.csail.mit.edu/6.854/20/sample-projects/B/stre... (And is pretty fast in practice).

djk447 · on Sept 15, 2021

NB: Post author here.

Interesting will have to take a look! Thanks for sharing!

riskneutral · on Sept 14, 2021

Also relevant: Single-Pass Online Statistics Algorithms

[1] http://www.numericalexpert.com/articles/single_pass_stat/

breuleux · on Sept 14, 2021

That's pretty neat! Can these be used to efficiently compute rolling percentiles (over windows of the data), or just incremental?

WireBaron · on Sept 14, 2021

The UDDSketch (default) implementation will allow rolling percentiles, though we still need a bit of work on our end to support it. There isn't a way to do this with TDigest however.

jeffbee · on Sept 14, 2021

Sure there is. You simply maintain N phases of digests, and every T time you evict a phase and recompute the summary (because T-digests are easily merged).

djk447 · on Sept 14, 2021

I think this would be a tumbling window rather than a true "rolling" tdigest. I suppose you could decrement the buckets, but it gets a little weird as splits can't really be unsplit. The tumbling window one would probably work, though Tdigest is a little weird on merge etc as it's not completely deterministic with respect to ordering and merging (Uddsketch is) so it's likely you get something that is more than good enough, but wouldn't be the same as if you just calculated it directly so it gets a little confusing and difficult.

(NB: Post author here).

cyral · on Sept 14, 2021

This is what I do, it's not a true rolling digest but it works well enough for my purposes.

fotta · on Sept 14, 2021

yep I had to implement t-digest in a monitoring library. another alternative (although older) that the prometheus libraries use is CKMS quantiles [0].

[0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-icde.pd...

convolvatron · on Sept 14, 2021

i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.

cyral · on Sept 14, 2021

That sounds familiar, I remember reading about Greenwald-Khanna before I found T-Digest after I ran into the "how to find a percentile of a massive data set" problem myself.

tantalor · on Sept 14, 2021

> Find the median ... randomly evict items

So, not find, but approximate. That's a different thing.

rendaw · on Sept 15, 2021

> without sorting it... have a fixed size sorted buffer

(that you sort yourself)

raxxorrax · on Sept 15, 2021

That doesn't really make sense to me at all. Don't sort it, just have it?

Is the storage restriction the point?

panda88888 · on Sept 16, 2021

Yes. The goal is to find (or approximate) the median without storing all the elements. Instead, it approximates the median by finding the median of randomly selected samples from the elements.

djk447 · on Sept 14, 2021

NB: Post author here.

Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)

Cd00d · on Sept 15, 2021

Did the candidate get an offer? Genuinely curious.

I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.

williamdclt · on Sept 15, 2021

Whether they got it or not probably isn't useful information. Having a good/brilliant answer probably isn't the only point of the question, this probably wasn't the only question of the interview, and this probably wasn't the only interview

bo1024 · on Sept 15, 2021

> if we estimate the median and move it a small amount for each new data point, it would be pretty close.

Yeah, this is gradient descent on the absolute loss.

eru · on Sept 15, 2021

> The question was basically, "Find the median of a huge data set without sorting it," [...]

You can use eg QuickSelect (https://en.wikipedia.org/wiki/Quickselect) or Median of Medians.

They don't sort the data, but they do need linear amount of storage.

Izikiel43 · on Sept 14, 2021

> The question was basically, "Find the median of a huge data set without sorting it,"

Isn't this done using a min heap and a max heap in conjuction?

ilikebits · on Sept 14, 2021

The real constraint here is probably "find the median of a huge data set without holding the entire data set in memory".

robotresearcher · on Sept 14, 2021

'Estimate the median of an arbitrary sized data set using a constant amount of memory'.

tantalor · on Sept 14, 2021

No, for 2 reasons,

1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"

2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules

xyzzyz · on Sept 14, 2021

You can actually construct the heap from unsorted data in O(n) time, so constructing the heap is definitely not sorting. However, yeah, to actually use the heap to find median in O(n) time, you need to do something similar to magic-five (median of medians) algorithm.

eru · on Sept 15, 2021

Something like QuickSelect is probably better in practice than median-of-medians.

thaumasiotes · on Sept 14, 2021

> but "huge" means "more than you can store"

Really? Where's it coming from?

namrog84 · on Sept 14, 2021

I think they mean having more than you can store simultaneously on a single device.

With a few exceptions this is pretty common scenario.

Dylan16807 · on Sept 14, 2021

More than you can store.

And possibly it's live data.

hnfong · on Sept 15, 2021

The two heap method helps you maintain the current median given an incoming stream of “online” data, and technically is not “sorting”.

The original problem is probably more accurately described as “what if you have too much data to sort” though.

5faulker · on Sept 14, 2021

It's funny that this is often left out from a data & algorithm class.

thaumasiotes · on Sept 14, 2021

Is the minheap-maxheap approach faster than sorting the data? The obvious approach (process each element, one by one, into the appropriate heap, and rebalance the heaps so they are of equal size) takes n log n time and linear space. You can use the same resources to just produce a sorted copy of the input, which is a much better thing to have than two heaps that center on the median.

Izikiel43 · on Sept 15, 2021

The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I agree that if you have the whole thing, doing heapsort and pulling a[N/2] or a[1 + N/2] is simpler.

thaumasiotes · on Sept 15, 2021

> The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I see that it's better if you need to know "what is the median of the amount of the list that I've consumed so far?"

But if what you want is the median of the whole list, which might be in a random order, the medians of random prefixes of the list don't seem especially relevant. And if you do have an indefinite amount of data coming in, so that you need a "well, this is what we've seen so far" data point, the minheap-maxheap approach doesn't seem very well suited since it requires you to remember the entirety of the data stream so far.

My first instinct is to divide the possible data values into buckets, and just count the number of datapoints that fall into each bucket. This gives you a histogram with arbitrary resolution. You won't know the median value, but you will know which bucket contains the median value, and your storage requirements depend only on the number of buckets you want to use.

lanstin · on Sept 14, 2021

Because unlike many dynamic programming algorithms, it is something anyone running a large system will need.

ZephyrBlu · on Sept 15, 2021

How does this get you the median?