Hacker News new | past | comments | ask | show | jobs | submit login
Beating Binary Search (sna-projects.com)
84 points by boredandroid on June 17, 2010 | hide | past | favorite | 47 comments



This depend entirely on being able to do the interpolation on your keys. He computes:

    offset = (key - min) / (max - min)
but not all data can have arithmetic done on it like that. What if the data in your array are, for example, names?

It's pretty trivial to beat binary search if your data are integers, or other keys on which you can do arithmetic. You can make the uniform distribution assumption, check it as you go, degrade to binary search if the assumption seems flawed, and then proceed from there.

However, if you're given an array of unknown objects, a comparison function, and no way to do arithmetic on your objects, you can't beat binary search.


However, if you're given an array of unknown objects, a comparison function, and no way to do arithmetic on your objects, you can't beat binary search.

... if all you care about is operation count. If you're fetching pages from disk, a B+Tree will perform dramatically better than a binary search.

And if you're fetching pages from disk and have a large cache, a B+Tree can do far better than interpolation search, too. Interpolation search will thrash your cache, resulting in O(log log n) seeks, whereas a cache of size O(n/B) will allow a B+Tree to find data in O(1) seeks.


But now you're talking about different data structures - his assumption was that you had a simple sorted array. We both know that there's a lot that can be done, and when performance really matters you don't use a sorted array, you don't use binary search, and you do take into account things like access pattern, access speed, caches, etc.

But people don't, because they read simplistic articles like this and then believe they know everything. At least the author could hint that there's a lot more that can be done. As it is, it just leaves the naive reader with a sense that they've seen the light, and the experienced reader irritated.


True. Although I'd be interested in seeing actual benchmarks for interpolation vs. binary search too, since binary search has better cache locality...


What if the data in your array are, for example, names?

  offset = (key[0] - min) / (max - min)
doesn't seem very hard to me. Or you could even go by the first 2 digits:

  offset = (key[0] << sizeof(key[0]) + key[1] - min) / (max - min)
The issue, of course, is that names aren't evenly distributed, so this algorithm could degrade pretty poorly... you'd probably be better off with a simple binary search.


My apologies - the example I gave was a poor one, but I thought the point was clear. Obviously it wasn't.

Suppose instead I give you records of an unknown type, and a function that compares them. Now interpolation search doesn't work and you have to do something either simpler, or more complex.


But that's a different problem. Of course different problems may require different solutions.


But that's a different problem.

No, it's exactly the problem. Quoting:

    Quick, what is the fastest way
    to search a sorted array?
See? He didn't say I can do arithmetic on the keys, he just said the array is sorted. He then goes and solves a different problem to the one he originally gave.

In short, the real answer is: It depends

Yes, if you can do arithmetic on the keys, and the keys are sorted in the obvious way according to your arithmetic, then interpolation search can be faster than binary search.

But not always, not in the worst case, and not if you have interesting issues about the cache, access speed, etc.

No, I didn't change the problem, I just raised issues with the overly simplistic analysis he gave.


I'm not saying his problem description was great, but it did become clear through the course of the article the additional constraints and assumptions that he was working under. He certainly never said "you're given an array of unknown objects, a comparison function, and no way to do arithmetic on your objects", and that first sentence you quoted doesn't contain enough information to translate directly into that interpretation (where did it say you get a comparison function, for instance?).


I wonder if we're simply in violent agreement - it certainly seems to me that you just aren't getting my point, and while much of what you say is true, it just doesn't seem relevant to me.

Yes, Interpolation Search has better average complexity in those circumstances when it works. However, for almost all cases of Binary Search that I use, Interpolation Search won't work, and isn't applicable. I was trying to make that point.

If your vector is sorted then it's reasonable to assume you have a comparison function between the objects in your vector. Binary Search then requires nothing more than this absolute minimum - just a comparison function. In this case Interpolation Search doesn't work at all.

As Colin pointed out above, it actually gets worse for our author. In cases where the volumes are sufficient to make much of a difference, you are almost certainly going to want to take caching and access speeds into account. Now you need to move out of the realm of Binary or Interpolation Search and into other data structures anyway.

Yes, Interpolation Search is cool and works in a small niche of cases. But mostly, it's irrelevant.

And that's my summary. Cool, but mostly either inapplicable or irrelevant.


I think you're overstating how often you're only going have a sorted vector of unknown type and an opaque comparison function, and understating how often you'll know your types and be able to make a reasonable guess at their distribution.

You're also ignoring how much simpler it is to roll a quick interpolation search than throw together a B+-Tree, to the point where it may make sense as something to try when you've got an hour to try to speed up some process just enough to get you through another week of growing load while you're rewriting your app to do it right.

But, yes, I'll agree with your summary.


One particular real-world case where it is useful is seeking (to a particular frame number or time code) within variable-bit-rate video streams.


I'm fascinated that so many of the commenters here seem to be focusing on the fact that the algorithm mentioned doesn't work in every case. My reaction was different -- I thought of a case where it would be less than ideal, and allowed it to inspire me to find an algorithm that worked better than naive binary search for that case, too.

In particular, I thought of the case of searching arrays of strings. If you are doing repeated searches against the same array, you can easily build a table breaking the array down into subarrays all sharing the same first character. (I'm assuming ASCII there!) Using that can quickly eliminate 5 or so comparisons that would be required with a naive binary search.

Okay, that's not brilliant, but might be useful in some circumstances. My bigger point is I thought this article was a very useful reminder that even a terrific general purpose algorithm like binary search can be beaten if you can apply specific domain knowledge to the search.


That sounds a bit like Burst Tries / burstsort.


You mean the Burst-colomon sort.


>In particular, I thought of the case of searching arrays of strings. If you are doing repeated searches against the same array, you can easily build a table breaking the array down into subarrays all sharing the same first character. (I'm assuming ASCII there!) Using that can quickly eliminate 5 or so comparisons that would be required with a naive binary search.

depending on the extent of your "fan-out" your memory budget might explode...


ASCII can be stored in a byte, so at most he is going to use

256*(index+length).

Assuming he doesn't have more than 2^64 byte of data, he will only need 24576 bytes to the index, and only 32768 if he needs a 64 bit, rather than a 32 bit length.

You might even not need the length argument as it is only the next array index subtracted from the current. In that case you need only 16384 bytes.

If you want two bit unicode, you have to use 49152, 65536 and 32768 bytes, respectively.

Remember he only wanted to index the first value.


Caveat emptor: The article fails to mention that interpolation search's worst-case performance is O(n), considerably worse than binary search's O(log n).


And? Quicksort in general runs faster than Heapsort, yet the worst case performance of the first is O(n^2) while the later only O(n log n). If they implemented it, I'm sure they've found that there's a performance boost for large datasets.


Note that properly implemented quicksort has expected O(n log n) time independent of the data (dependent only on the random choices of pivots), whereas interpolation sort's performance is data dependent. That's a pretty important distinction.


Interestingly, that's not quite true: see http://www.scss.tcd.ie/~pbiggar/jea-2008.pdf, section 8.3.

I'll point out too that no quicksort implementation uses a truly random choice of pivot. Typically they use the median of three pivots to be pseudo-random.


Your second point may be valid, though not literally. I've implemented quicksort, and I used (pseudo) random pivots, damnit! :)

The article you point to doesn't seem to address the complexity of quicksort with random pivots. However, that is proven to be expected O(n log n) time for any data in Motwani's Randomized Algorithms book, and I don't think the correctness of that has ever been questioned.


> The article you point to doesn't seem to address the complexity of quicksort with random pivots.

That paper is about experimental algorithmics, and avoids discussions of things which aren't practical (such as using truly random pivots).

> However, that is proven to be expected O(n log n) time for any data in Motwani's Randomized Algorithms book, and I don't think the correctness of that has ever been questioned.

Right, but since there are no truly random quicksorts, its hardly a concern.


Please clarify: Why are random pivots not practical? And why do you say there are no truly random quicksorts? Because pseudorandom numbers aren't really random, or for some other reason?


> Why are random pivots not practical?

The cost of true randomness is really high (in execution time)

> And why do you say there are no truly random quicksorts?

Because pseudo-random numbers aren't really random.

The only reason this is important is that the pseudo-randomness means that its possible the construct an adversarial input, which could drive the quicksort to take N^2 time.


> If they implemented it, I'm sure they've found that there's a performance boost for large datasets.

Argh! Perhaps I take things too literally, but the logic in this statement is just painful.


Only naive implementations of quicksort take O(n^2) time. Most good implementations will switch to another sort (generally heapsort) after detecting adversarial conditions.


In which case they cease to be quicksort, and are instead called "introsort": http://en.wikipedia.org/wiki/Introsort


My point was that we should give good hackers the benefit of the doubt when they say quicksort, as they are likely to actually be talking about introsort. GNU C uses introsort to implement qsort, and if I were to implement quicksort for an important job, I certainly wouldn't overlook the fact that it can run to O(N^2).


I have thought about this years ago, simply because it is also kind of the way humans search in telephone books. (Advantage of being born in a time when telephone books where still in use). I think humans would also do repeated interpolation, as some letters would be "thicker" (more pages in the book) than others.

Bionics for the win...


You can beat pretty much any algorithm if you make convenient extra assumptions. In this case they have suitable data for the implication search (good for them!), but the title promises too much.

You can't beat binary search without new assumptions.


It works well only on uniformly distributed keys. Not everyone has pleasure to work with such data.


Actually it will work with any distribution of keys, provided that you know what that distribution is.


As long as they are integer keys, that is.


It doesn't need to be an integer but you're right in that it requires an additional property of the key type that most sorting and searching algorithms do not need: being able to measure the distance between two keys. Most algorithms only need to be able to decide whether a key is less, equal or greater than another key.


This is implicit in having access to the (cumulative) key distribution. Still, the interpolation function can turn out to be pretty expensive, if your data isn't uniform 0-100.


We only need a ball-park estimate of the distance between keys not an exact measurement. Anything better than random should be an improvement.


Technically, any data on your computer is represented as an integer. Integer subtraction is simply not guaranteed to be meaningful as a distance operation for all encodings.

For most data types that you're likely to index by (numbers, strings in alphabetical order, dates, tuples of these things), it's quite possible to build a meaningful distance operation which can be used for this. If you're indexing by something for which its difficult to do this (graphs, versioned strings), you should probably rethink your data structure.


I don't see why.


It doesn't need to be integer keys, but you need to be able to do the interpolation. Hence you need some kind of arithmetic on your keys, and not all data are like that.


I agree. It does not depend on key distribution. Nor on the type of the key. But we do need to be able to estimate this

    (key - min) / (max - min)
where the variables are all keys and we want the expression to give a number between 0 and 1. We just need some way of estimating the 'distance' between keys for the subtraction to work. It does not need to be an exact distance, an average estimate better than random should beat the binary search.

The algorithm they give is effectively a linear interpolation. We could beat their algorithm by fitting a curve to the accumulated known points rather than a straight line to the nearest two points.


I think it should work on many other distributions. You'll probably need to do density estimation on you keys and partition your search space accordingly if the distribution is not known a priori.


I do not know this language (Java?), but doesn't the line

   int offset = (int) (((size - 1) * ((long) key - min)) / (max - min));
compute (max - min) as an int (potentially overflowing), and then convert it to long, defeating the precautions obviously taken against this kind of event with the cast to long elsewhere?


Let me try to break it down:

  (max - min) is int
  ((long) key - min) is long
  long / int is long since a long is involved.
  (size - 1) is int
  int * long is long
I think the purpose if the long cast is that the last multiplication produces a long and it doesn't overflow. But then it is casted to an int anyway defeating the purpose.


> (max - min) is int

Incorrect. These are signed values. If you subtract a large negative integer from a large positive integer, the result is larger than an integer.


Interesting article, but without actual numbers it's hard to tell whether the change made any difference in performance.


That's what mathematics is for ... you just need the growth rate.

O(lg lg N) is a smaller growth rate than O(lg N).

Of course, for smaller datasets it doesn't matter.

In fact if you do sorting, for small datasets (that are mostly sorted) a bubble-sort is better than quick-sort, since it doesn't involve random disk accesses and it might even require fewer steps.

I would be curious if this method for searching would apply to sorting (e.g. sorting done in O(N lg lg N)) ... since I remember that sorting done with comparisons can't be better than O(N lg N) (the height of a binary search tree is lg N ... and I think the demonstration was based on that).

So to get back on-topic ... it really depends on the size of your data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: