Hacker News new | past | comments | ask | show | jobs | submit login

Multiplication is bad. Knuth actually also describes a hash function using random numbers and XOR. Its 10x faster than the modulo, and I belive Mikkel Thorup proved it optimal.

The idea is roughly:

Say you have a hashtable of size 1024. You then create x uint arrays of size size 256. These arrays you fill up with random numbers 0-1023.

To get your hash value, you take your input and for i=0..x-1 determine byte k=input[i] you lookup the value in array[i][k]. These lookup values are then XORed giving a final random value between 0-1023 ready for inserting into the hash array.

No modulos. No multiplications. You only have to redo the random tables when the size changes from say 1024 to 2048. Easy peacy. Superfast.




“Superfast”, until you blow through your L1 cache, which happens pretty early on if you need 1 kB of table per byte in your key.

Even in the L1 cache, it's hard to beat the mul: A multiplication (which can hash multiple bytes in the case of Fibonacci hashing) has 3 cycles latency on modern x86. A single load, even from L1, is 5, I believe.


How much of a problem 3 cycles of latency is depends on what else your processor is doing. It might not be any problem at all.


Word. It depends on access patterns of course but yeah, L1 is a valuable resource.


Modern CPU cores can perform a multiplication and addition every clock tick. Heck, I'd expect a modern Zen4 core to be able to do like 4 parallel 64-bit multiplications per clock tick on it's integer pipelines, and maybe 32x parallel 32-bit multiplications per clock tick on it's vector pipelines.

Multiplications we're bad 40 years ago, but the year 2020 called and FMAC is incredibly optimized today.

You should still avoid integer division (floating point division is commonly optimized as reciprocal and then multiply). But multiplications are really really fast at least as far back as 2008 or so.

-------

I'm pretty sure multiplication's latency is only 5 clocks, but with all the out of order processing that occurs on modern cores, latency of just 5 ticks is rarely is the bottleneck. (A DDR4 memory load is like 200+ cycles of latency. You shouldn't even worry about 5 cycles like multiplication, especially because those out of order cores will find some work to parallelize in that time).

-----

> you lookup the value in array[i][k]

You know a L1 cache lookup these days is like 4 cycles of latency right? And I'm pretty sure you have fewer load/store units than multiplication units. So a load/store, even to L1 cache, might use more resources than the multiply.

Might, I'd have to benchmark to be sure.


Indeed, division just doesn't have a parallel algorithm, unlikely mul and add. So it's bound to be 'slow'. About 2008 - Intel core 2 (2006) had 3 cycle mul. Edit: Pentium Pro(1995)'s imul was 4 cycles. 386's imul was slow, though.


This sounds like zobrist hashing, or related . https://en.wikipedia.org/wiki/Zobrist_hashing

" Zobrist hashing is the first known instance of the generally useful underlying technique called tabulation hashing. "

so to https://en.wikipedia.org/wiki/Tabulation_hashing

"

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; [...]

Despite its simplicity, tabulation hashing has strong theoretical properties that distinguish it from some other hash functions. In particular, it is 3-independent: [...]

Because of its high degree of independence, tabulation hashing is usable with hashing methods that require a high-quality hash function, including hopscotch hashing, cuckoo hashing, and the MinHash technique for estimating the size of set intersections.

"

further

" Method: The basic idea is as follows:

First, divide the key to be hashed into smaller "blocks" of a chosen length. Then, create a set of lookup tables, one for each block, and fill them with random values. Finally, use the tables to compute a hash value for each block, and combine all of these hashes into a final hash value using the bitwise exclusive or operation.[1]

"


How come mul is bad? It is a low cycle latency - Skylake had 3cycles per imul, mul r32 - a single cycle. Div is bad but mul is great.

edit: Memory access (along with div) is pretty much the only slow operation in modern CPUs -- extra pressure on L1 just to have random number is not smart at any rate, heck Marsaglia's xor (random) is likely cheaper than accessing L1, very likely all the latency to be hidden behind the memory access.


Multiplication was bad on decades old CPU's.


Others have hinted at this, but to be clear: This algorithm is slow, even in the optimal case where the tables are in cache. On new X86 CPUs it is theoretically limited to less than 2 bytes per cycle. Probably somewhere around 1.5 for an implementation that loads 8 bytes of input at once and shift though them in order to limit load on the load slots.

Even without getting into SIMD algorithms you could load 8 bytes at a time and pretty easily go faster than that, possibly while using the multiplication instruction for mixing.

This of course ignores that we are not looking up values from a hash table in a vacuum. Other code will also be competing for the cache, and that generally means that everything runs slower because of more cache misses.


How do you fill the array though? Wouldn't filling it with random numbers give you a different hash each time you rebuild the hash function? I can see it being useful for a short-lived data structure, but you wouldn't be able to use it as a shared deterministic hash function?


For in-memory tables you rarely need determinism across instances, let alone runs.


Why couldn't you fill it deterministically?


I guess that was my question indeed. In the sense of how do you do it in practice? I suppose there are pseudorandom algorithms that can be easily applied.


Pseudorandom bit sequences PRBS are deterministic and very easy to implement (just linear feedback shift registers).

Something similar is actually done in communication systems, with scrambling, to prevent long strings of transmitted ones or zeros (which cause issues for some of the hardware components). Essentially you just add or multiply the data with the PRBS sequence. At the receiver you just do the reverse operation.


The replies to this command are top HackerNews: every commenter is a bigger expert than Donald Knuth but nobody quotes any actual benchmark results to go with their theories.


It isn't fair to assume the same rigor on a comment.

But, you don't have to be a bigger expert than Knuth to dismiss an optimization done for hardware say 40 years ago (the circumstances around this particular case I don't know).

Even in that case though it might still be relevant for embedded CPUs.


So it’s just some unfounded handwaving? You can just dismiss one of the greatest minds in computing science by just blathering in a comment, because then it’s ‘unfair to assume rigor’?

If it is all so clear and all the armchair experts here have ample experience in the field like they pretend, why is it so hard to run a few benchmarks?


It's not just handwaving. It's knowing the context. As late as the mid 80's it was not unusual for multiplication to take tens of cycles if your CPU even had a built in multiplication instruction (e.g. on the M68000 in the Amiga, a MULU - unsigned 16 bit -> 32 bit multiplication - took from 38 to 70 cycles plus any cycles required for memory access; if you needed to multiply numbers larger than 16 bit to depend on the overflow like in the case of this article, you'd need multiple instructions anyway), far worse if it didn't, while memory loads were often cheap in comparison (on the M68000 a memory read indirect via an address register could be down to 8 cycles), and so until years after that it did make sense to do all kinds of things to reduce the need for multiplication. But it doesn't any more.

While it'd be worthwhile doing tests to confirm a specific case, the default assumptions have changed: Today memory is slow and multiplication fast (in terms of cycles; in absolute terms both are of course far faster).

You certainly should not today pick a more complex hashing scheme to try to avoid a multiplication without carefully measuring it just because it was discussed even by someone as smart as Knuth in a context where the relative instruction costs where entirely different.

If you're actually using the function as the primary hash function, then the distribution of the output might well make up for significant performance difference, so this is not to suggest that tabulation hashing isn't a worthwhile consideration.


How is it unfounded? How is it unreasonable to question the relevance of an optimization from a time where the relevant parts of computing were completely different?

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.", Donald Knuth

Maybe doing benchmarks for a comment isn't worthwhile. I guarantee you have to do different benchmarks for different contexts anyway so I can't blindly trust the benchmark anyway. Not to say it wouldn't be interesting.

I wouldn't mind someone plotting the cost of instructions over time and how that affects choice of algorithms. But to expect that from a comment?


So here’s a paper where the author has run the experiment:

https://arxiv.org/pdf/1011.5200.pdf

And the answer is that the speed is comparable to other functions that however produce worse results.


A couple of issues with that interpretation:

1. Dated hardware, so the hashing algo speed comparison is no longer relevant without redoing it, but even on hardware that old a 2.2x-2.8x speed advantage for mul+shift is substantial.

2. No tests with contention for the cache; no tests with different table sizes; no code given. As a result it's impossible to tell if the performance numbers are relevant and realistic.

3. If they could demonstrate substantially better distribution, it might still be very worthwhile despite how much slower it is, but they test the hashing algorithms with runs of 100 random constants. We don't know if any of those constants are any good because they've not given them, but odds are highly against 100 random constants even approaching good. As such the comparisons of tabulation hashing with the other hashing method is meaningless in terms of performance (but see below) - it's trivial to find constants for multiplication + shift that produces pathologically bad outcomes.

What the paper does appear to show is that tabulation hashing might have more predictable runtime given the result on the specific set of structured input they test with, and that might well be a good reason to use it for some applications.

But that is tainted by the lack of transparency in what they've actually compared against.

(This is also mostly relevant if you considering using a multiplication-shift based hash function, which is also not what the original article is advocating you use Fibonacci hash for)


The hardware is not that dated, they specifically target then modern hardware that has the fast multiplication everyone is on about. And while multiplication has gotten faster, of course the size of the l1 cache has also grown. It is per-core so there is not going to be a whole lot of contention going on. A modern budget cpu has about 32kb per core so it’s not going to be a squeeze.

I am no expert in math but the algorithm is claimed to be better than another one because it is in a class that is better than the class the other is in. It’s not because a run of 100 shows some distribution.

You have a lot of demands for exhaustive testing but when you are asked to provide the same, it’s all too much to ask. ‘I wouldn’t mind someone plotting graphs’ yeah thanks, I wouldn’t mind someone else doing the work.

Then again someone else probably has done the work more recently or more in line with what you want to see. I found this paper in a few minutes or websearching, I’m sure you can spare the time to find a better one.


> It is per-core so there is not going to be a whole lot of contention going on.

That's only true if nothing else happens between requests to the hash table. That might be the case, or it might not. Depending on your workload that might make no difference or totally ruin your performance characteristics.

> I am no expert in math but the algorithm is claimed to be better than another one because it is in a class that is better than the class the other is in. It’s not because a run of 100 shows some distribution.

The problem with this is that while it may well have better characteristics than multiply and shift on average, the quality of the distribution of multiply and shift based hashes can vary by many orders of magnitude depending on the choice of multiplication factors. Put another way: Multiplying by 1 and shifting is a perfectly valid multiply and shift hash function. It's a very stupid one. The performance characteristics for a hash table doing that is nothing like the performance characteristics of what is proposed in the original article. I have no doubt that the table based approach will beat a large proportion of the multiply and shift hashes. But so does other multiply and shift hashes, by large factors. As such, without actually comparing against a known set of some of the best multiply-shift hashes we learn very little about whether or not it'll do well against good multiply-shift hashes.

To me, the fact that they chose random factors is very suspicious. Nobody uses random factors. The effort spent on choosing good factors over the years has been very extensive, and even hacky, ad hoc attempts will tend to use large prime numbers.

> You have a lot of demands for exhaustive testing but when you are asked to provide the same, it’s all too much to ask. ‘I wouldn’t mind someone plotting graphs’ yeah thanks, I wouldn’t mind someone else doing the work.

I've not made demands for anything. I've pointed out that making a blanket claim that multiplication is bad when the performance characteristics has changed as much as they have is unreasonable, and a paper like this tells us pretty much nothing more. It's absolutely reasonable to consider table based approaches; it's quite possible, even likely they'll have desirable properties for various sets of inputs - there is no such thing as a perfect hash function for all inputs, and sometimes you care most about pathological worst case scenarios, some times you care about averages, some times you know what data you will or won't see. That it performs as well as they've shown it to means there is almost certainly situations where it will be a good choice. But because of the choices they made in that paper we can't really tell when and where that would be, and that's a shame.

What is not reasonable is just writing off the use of multiplication on the basis of hardware characteristics that are decades out of date. That doesn't mean you should blindly use that either.

If there's one thing people should know about working with hash tables it's that you should test and measure for the actual type of data you expect to see.


Sorry, all I can do is suggest you come back tomorrow and reread the comment you wrote.


I have and stand by every word, but seeing as you have no arguments and resort to implied insults we are done here.


Sorry, if you start making up giant arguments against claims I have not made there really is no response. But clearly you don’t see that.


It is funny, the article claims to test the 64 bit code on "Dual-core Intel Xeon 2.6 GHz 64-bit processor with 4096KB cache". That is a really poor description, as it does not tell us what architecture the processor is. But one can go through a list of all Intel Xeon processors to find the ones that match the description. Turns out that there are none.

If we broaden the search to 2.66 GHz processors there are 4: 5030, 5150, 3070 and 3075. All released in 2006 and 2007. This means it is either one of the last "NetBurst" CPUs or one of the first "Core" CPUs. Assuming "Core" the relevant operation has a 5 clock latency, as best I can tell. This is down to 3 clocks on pretty much all modern X86 CPUs. Modern CPUs also get an extra load port, so I doubt the relative difference is much different on modern CPUs.

Overall it looks like a pretty bad benchmark, thrown into a paper on collision likelihood, which itself looks like an academic exercise with no relevance for the real world.


Embedded CPUs would have much worse memory access latency, and a lot less memory to spare - so if anything wasting memory on tables is likely to perform worse as well.


How do you mean? Measured in cycles embedded devices typically have less latency to SRAM.


You're right, I guess. On devices where the memory (sram) tends not to have its own clock (and there is no OOO), it can effectively be one cpu cycle. PIC and ESP-32 comes to mind. If there is 'extra' not on chip memory, it'd be way worse, of course.


Why? Eg https://news.ycombinator.com/item?id=35760954 quotes some (vague) numbers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: