Hacker News new | past | comments | ask | show | jobs | submit login
Naggum Nugget (groups.google.com)
55 points by signa11 on Nov 2, 2010 | hide | past | favorite | 54 comments



True enough in 1999, and all but completely false now. For a competently implemented hash map with integer keys, hash tables start beating linear search somwhere between n = 2 and n = 4.

At n = 4, you start needing more than one cacheline to do your linear search. Since a worst-case cache miss is around 300 cycles on a Nehalem, a hash function is very, very cheap price for near-certainty that you'll only probe one line of the structure.

All of which makes the table-pounding tone of this post kind of funny from 11 years' hindsight. The crossover point between linear search and hashing is a result of hardware and software parameters, not rhetorical force. He was right when he wrote what he wrote, and is wrong now. Future hardware/software changes might make him right again, but might not. I've indulged in the "real programmers vs. quiche eaters" rhetorical flourish at times, too; this post is a welcome reminder of its risks. It looks better to posterity if you do the experiments, state the results, and walk away.


And based on the way he worded it, he would agree with you now. His point was that you need to apply both theory AND practice and not just blindly cite theory.


There is an aspect to theory that's comforting, though. I'm sure I would sleep better at night knowing that my application was using an O(1) hash table as its map rather than an O(n) alist, even if the constant factors are higher. More hardware helps out constant factors; it doesn't do anything with quadratic algorithms.


"Completely false" might be an overstatement. Do you have any test results you could point to? Or at least an exact specification of what the test would consist of? Particularly, are you measuring the first run from a luke warm start, guaranteed to be in memory but not in cache? Or might you be reading a page or two from disk, or making multiple passes while it still might be in some level of cache? Because I think the details are going to make a difference as to whether the break even point is n=4 or n=400, or maybe even n=4000.

Like the decade-old Erik, my intuition is still that a brute force would be hard to beat for less than 100 integer keys to pointers if one is doing multiple passes through the list. But I'd love to bring my intuition up to date. Do you have one of these new-fangled 'competently implemented hash maps' that you could yield up so I can do some cycle counting? And is it fair if we throw in a few insertions and deletions just to make it realistic?

For record, I completely agree on irrelevance of the cycles to compute the hash once one might be hitting main memory. But I also think it's worth noting that the size of your L3 cache may be comparable to the all the RAM that Erik had in his machine. I also think you might be underestimating the power of a modern prefetch unit, whether hinted or automatic. But I'll bow to data. As you said very well: "do the experiments, state the results, and walk away."


You are ignoring the fact that you have to calculate the hash-key. With association tables, pointer comparison is enough if you are searching for interned symbols.


No, he isn't; you entirely missed the point. Cache misses are expensive. Given that he specified integer keys, it's not a bad guess that a hash function will run fewer than 300 cycles on the specified Nehalem. If your association list has to run across three cachelines but your hash implementation runs across only two, the hash wins under these circumstances. Add a couple more cachelines and hashes start winning for more than just integer keys, too.

Times have changed. Incidentally, the same may be true of some of the tree alternatives to hashes, too; used to be log n lookups would beat the O(1) lookups of hashes in many real-world scenarios because of the big constant on the O(1), but running across log n cachelines vs. O(1) cachelines (without a big constant) is a different story. The gulf between computation and memory continues to widen.


Cache misses are expensive yes.

And when you have an interned string to be searched on - you always have to calculate it's hash-sum as key before looking for it in a hash-table, and you don't have to do that for a linear search (it's an interned symbol that's why). This means you might have to read one, two or more cache lines (if the cache line is 64, then it's strlen(yourstring)/8)

You can't use the pointer itself as a key - either the distribution is not good, or the language would not allow you (you can't do it in Common Lisp which Erik Naggum was talking about).

That calculation of the key might read one two or more cache lines - depending on how big the string is.

Off course it would be better for the hash-table the smaller key is, and worse if not.

And then, there are still CPUs without cache still in popular use - for example PS3's CELL SPE (spu).

And one more thing - in the linear search you always read the same cache lines, while in the hash-table you have to read everytime different memory (the strings to be searched upon).

But in both cases, there are instructions to prefetch, but again - you would be wasting less cache with the assoc table if you are looking a lot (e.g. you'll warm always from the begining and keep that in the L1), while for hash-table you would warm all the keys. It's really hard to compare, but I hope you get the picture.


isn't this making a lot of assumptions with regard to memory locality and the organization of the hash table?

for instance: what if the initial capacity isn't set properly for the hash table? what if the hash function isn't perfect and you end up with 16 buckets and 4 elements in one bucket and none in the other 15? (at which point you end up with what is essentially a list with lots of overhead). what if you start off by allocating a hash table that is too small and midway through populating it have to expand it?

the more comments I read, the more Erik becomes right.


"isn't this making a lot of assumptions with regard to memory locality and the organization of the hash table?"

Well, yeah. Always.

But if you want to talk about how Erik is right, you probably want to go looking for an algorithm specifically designed for storing alists in a cache aware manner. Not sure what such a beast would look like, but something could probably be constructed and probably has been.


Saying "what if your hash table becomes linearized" is not a fair argument when comparing linear search to hash lookup.


I don't see why this should look funny 11 years in hindsight. the important part is knowing rather than assuming. and you seem to assume quite a lot.


The next reply in that conversation is by...Julian Assange (http://groups.google.com/group/comp.emacs/browse_thread/thre...).


another thing the world needs is people who can argue a point without an obnoxious "shut-down" mentality about it.

whatever points he's making, it feels like listening to bill o'reilly make them.


Obtaining useful information in contexts or from people you find unpleasant is a handy skill.


In this case, the useful skill is scanning that wall of text to extract the small nugget of useful information:

For small N, finding something in an association list can be faster than finding the same thing in a hash table.

Naggum also could have run some quick bench marks and figured out at what size N hash tables become a better choice (depending on the nature of the keys, etc. etc.), in less time than he spent berating the intelligence of the questioner.


That nugget of useful information is also explained (in a lot more detail) in one of the first chapters of the "Introduction to Algorithms" book.

When I was working through the book (and I still am to some extent), it was basically beat into my brain that some algorithms with worse performance could perform better than faster algorithms for a small enough n.


I doubt he is right for any N. Association lists require on average N+1 memory accesses. Small hash tables require 1 memory access.

Only in edge cases like looking up a long string in an alist consisting of one short string he's right.


It's good to be able to endure it when necessary, sure, but it's also good to avoid needlessly perpetuating bitterness.

Is it really that hard to say, "Complex algorithms have higher constant factors, so for small values of N, 'naive' ones can still be faster. When in doubt, measure." without chewing somebody out?


Agreed, but being able to present useful information in a context that makes people more receptive, rather than hostile, is also a handy skill.

It's easy to deliberately overlook rudeness, but it still unconsciously colors a person's responses and thoughts, which could have a negative impact on the message.


TL:DR: O(2n^2) complexity is better than O(5000n) for small n.

I believe the above to be an accurate summary.


In all the times I have seen this topic raised (and I've seen it raised a lot) I have yet to see anyone who engages in the debate actually do the experiment to determine where the actual breakeven point is. Hint: it's less than 5000.


Why do you assume there's a constant breakeven point? It depends on the specific implementations of the things being compared.


I don't assume anything. I actually did the experiment.


You did the experiment for one particular problem, with two particular implementations, on particular hardware. That doesn't mean you've discovered a universally valid break-even point.


I assume that's obvious, so lisper must mean something else. Regardless, if you know your complexities it's trivial to find the break-even point with math (list complexity = hash complexity, solve for n).


I don't see how you could possibly know what I did and didn't do. And I never claimed to have discovered a universally valid break-even point.


You're certainly not talking about it, so I'll assume you're trolling and leave it at that.


I never claimed to have discovered a universally valid break-even point.

The confusion stems from the language you used in your comment, which made it seem like you were claiming a lot more than you actually turned out to be:

"I have yet to see anyone ... actually do the experiment to determine where the actual breakeven point is. Hint: it's less than 5000."


Performance break-even is always with respect to a particular implementation. And even with only a single data point I could state (and still can) with very high confidence that no matter what implementation you run your test on, the break-even point will be less than 5000. A LOT less. You'd have to go out of your way to find a hash table implementation so bad that the break-even point was even within an order of magnitude of 5000. Even just reasoning from first principles you could reasonably guess that the break-even would be a lot less than 5000. Traversing a 5000-element AList requires 5000 memory accesses, with no guarantees of locality of reference so you're probably blowing out your cache dozens (thousands if you're unlucky) of times. You probably couldn't write a hash table implementation whose performance was that bad if you tried.

The fact of the matter is that hash tables are faster than ALists in all but the most trivial of cases. Theory predicts it, and an experiment that takes all of two minutes to do confirms it. Everything else is quibbling over angels on the head of a pin.


...and?


And its a lot less than 5000. (And you should do your own homework.)


Don't you have a better use of your time than to tell everyone "Hey, I figured this thing out. But I'm not going to tell you the answer."

I'm struggling to think of a non-trolling motivation that explains this behavior.

This is giving me comp.lang.lisp flashbacks.


I was hoping to motivate people to do the experiment for themselves. It's not very hard. Doesn't seem to be working though. Oh well.

On Clozure Common Lisp the breakeven point for the average case is 20 items (10 for the worst case).


You did "the experiment"? As in, you compared two specific implementations and came up with a universal point?

Your condescending tone would at least not be completely unwarranted if you weren't utterly wrong.


> you compared two specific implementations

Yes.

> and came up with a universal point

No. I got one data point.

Since there seems to be so much demand for it, I compared ALists and hash tables on Clozure Common Lisp. The breakeven point there was 20 items.


... and using upper-case characters at the start of sentences is just silly.



I remember Erik giving this up at some point, and using a more conventional style, but I can't find it now.


He didn't give it up, but changed Emacs/Gnus to automatically add uppercase letters as he sent it. (Trivial, because he ended sentences with a period and two spaces)


Eh, I think that more things are matters of style than people would have one believe.


I don't see him pin any actual numbers to things. It'd be interesting to see how things work out with different sized lists and hash tables.


He doesn't, but that's the crux of his entire argument. It's not like the theory is unsound, it's just that it applies to very large n, where the constant factor is insignificant in comparison. When n is small, it's the constant factor that dominates, and that's what he's saying (albeit in too condescending a manner).


Yes, I get that, but it'd still be interesting to play with some actual numbers.


Oh, definitely. Luckily, that's not very hard to do yourself...


Pretty much, although O() notation doesn't really work with constants.


I know, but it was the most concise way I could express it.


Is there a problem with just leaving off the O() altogether?


I was going to say "but then you have to say 'complexity'", but I realised I already did, so I should have done that instead.


More concise and more accurate. You're a genius!


Right. Big O notation usually excludes constant factors, but here, the constant factors are the point.


In order to improve the performance of hash tables, why not have the buckets be something like a red-black tree instead of a linked list? (You might improve the efficiency a little by spending a bit and saying "This bucket contains a single element and here it is", or "This bucket contains multiple elements and here is the root of the red-black tree".) Then, you would get log-time search instead of linear-time search when buckets start to fill up with multiple elements.


This exists, it's just generally not done because your buckets should be small enough that linear search is fastest.

From the wikipedia hash table article (http://en.wikipedia.org/wiki/Hash_table#Separate_chaining_wi...):

Instead of a list, one can use any other data structure that supports the required operations. By using a self-balancing tree, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, this approach is only worth the trouble and extra memory cost if long delays must be avoided at all costs (e.g. in a real-time application), or if one expects to have many entries hashed to the same slot (e.g. if one expects extremely non-uniform or even malicious key distributions).


Ouch! The cruel tutelage of Erik Naggum.

I miss that guy.


tl;dr: Fancy algorithms are slow when n is small, and n is usually small. (Rob Pike).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: