Hacker News new | past | comments | ask | show | jobs | submit login

It works well only on uniformly distributed keys. Not everyone has pleasure to work with such data.



Actually it will work with any distribution of keys, provided that you know what that distribution is.


As long as they are integer keys, that is.


It doesn't need to be an integer but you're right in that it requires an additional property of the key type that most sorting and searching algorithms do not need: being able to measure the distance between two keys. Most algorithms only need to be able to decide whether a key is less, equal or greater than another key.


This is implicit in having access to the (cumulative) key distribution. Still, the interpolation function can turn out to be pretty expensive, if your data isn't uniform 0-100.


We only need a ball-park estimate of the distance between keys not an exact measurement. Anything better than random should be an improvement.


Technically, any data on your computer is represented as an integer. Integer subtraction is simply not guaranteed to be meaningful as a distance operation for all encodings.

For most data types that you're likely to index by (numbers, strings in alphabetical order, dates, tuples of these things), it's quite possible to build a meaningful distance operation which can be used for this. If you're indexing by something for which its difficult to do this (graphs, versioned strings), you should probably rethink your data structure.


I don't see why.


It doesn't need to be integer keys, but you need to be able to do the interpolation. Hence you need some kind of arithmetic on your keys, and not all data are like that.


I agree. It does not depend on key distribution. Nor on the type of the key. But we do need to be able to estimate this

    (key - min) / (max - min)
where the variables are all keys and we want the expression to give a number between 0 and 1. We just need some way of estimating the 'distance' between keys for the subtraction to work. It does not need to be an exact distance, an average estimate better than random should beat the binary search.

The algorithm they give is effectively a linear interpolation. We could beat their algorithm by fitting a curve to the accumulated known points rather than a straight line to the nearest two points.


I think it should work on many other distributions. You'll probably need to do density estimation on you keys and partition your search space accordingly if the distribution is not known a priori.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: