Hash Functions

antirez · on Dec 10, 2011

It is worth adding that if your goal using an hash function is sharding across different servers the CRC function can be good, especially the CRC16 function, that has a smaller output than the CRC32, tends to perform very well with many common "key name" patterns.

It's also worth to note that the CRC16 (and CRC32) function will map every possible input of 16 (or 32) bits to a different output, without collisions, similarly to a block cipher.

EDIT: I would take this web site with a bit of salt, the description of the "One Way" concept in the home page is, as far as I can tell, not correct:

"Another important goal is that "similar" input vectors produce very different output vectors. For the case of table lookup, table keys are usually either human language strings, index numbers, or other data that exhibit non-randomness. For the case of message digest functions, the goal is to make it as difficult as possible to find two different messages that result in the same hash. A hash function with this property is called a one-way hash function."

pja · on Dec 10, 2011

It's sort of almost correct. He's fumbling his way towards the avalanche property: for a perfect cryptographic hash if you flip a single bit in the input, on average half the bits in the output will change.

http://en.wikipedia.org/wiki/Avalanche_effect

antirez · on Dec 10, 2011

Yes, but the Avalanche effect does not mean it is not invertible. It is possible to design a one way hash function with remarkable avalanche (50% of flipped bits on average after 1 bit change in input) effect but that is easily invertible.

The simplest would be an hash function that outputs an N-bit hash where M bits are a pseudo-random sequence of bits (depending on the input, let's say SHA1(input)) that are used to key a stream cipher, and the rest N-M bits are the XOR between the stream cipher output and part of the input of the hash function.

Perfect Avalanche effect, but the function is invertible.

jacquesm · on Dec 10, 2011

Only in that special case where the input <= the output of the hash function, as in your example earlier in the thread.

antirez · on Dec 11, 2011

An hash function is invertible enough if it will leak the first three bytes of the input. Does not need to be invertible as F^-1 in mathematical notation.

JoshTriplett · on Dec 10, 2011

You should not use a simple hash like a CRC for sharding if you need to remain robust against hostile users. Otherwise the users can bypass the sharding and execute a denial of service.

To give an example, the Linux kernel's network stack uses cryptographically secure hash functions in its hash tables, to ensure that a hostile network cannot cause the kernel's hash tables to devolve into linked lists and thus break the O(1) performance expectations of the kernel.

antirez · on Dec 10, 2011

This can only be fixed with a crypto hash function, that IMHO would be a premature optimization here. Users having this potential problem (key names are result of external user inputs) can simply prefix the key with HMAC(key,secret) truncated to a few bytes.

JoshTriplett · on Dec 11, 2011

At that point why not just use the HMAC and drop the weak hash entirely?

thinkbohemian · on Dec 10, 2011

antirez, can you recommend some good hashing resources?

antirez · on Dec 10, 2011

I use the well know resources: wikipedia, a few web sites with many hash functions benchmarked that will show up if you google "hash functions".

However this is very interesting IMHO if you want to understand the inner working of CRC: http://www.ross.net/crc/crcpaper.html

If you are interested in cryptographic hash functions a good introduction is "Applied Cryptography" that can be integrated with "Handbook of Applied Cryptography" (all chapters free here: http://www.cacr.math.uwaterloo.ca/hac/).

xtacy · on Dec 10, 2011

Would you recommend this source as well?

http://burtleburtle.net/bob/hash/

zohebv · on Dec 10, 2011

He did muddle around the definition of one way functions, but the information provided is useful.

http://home.comcast.net/~bretm/hash/8.html This seems to indicate that CRC is not a good fit for hashing.

I have used http://burtleburtle.net/bob/hash/ before and it has worked very well for all the use cases I have thrown at it.

eis · on Dec 10, 2011

If you're looking for a cutting-edge non-cryptographic hash function, then I'd suggest using MurmurHash 3 created by Austin Appleby.

Excellent performance with great collision properties.

http://code.google.com/p/smhasher/ or more specifically http://code.google.com/p/smhasher/wiki/MurmurHash3

There exist 32, 64 and 128 bit versions.

pjscott · on Dec 10, 2011

The easy way to use MurmurHash 3 is to grab the C version of the code from here:

https://github.com/PeterScott/murmur3

Full disclosure: I did the port to C, and added the documentation and example code. </plug>

jwn · on Dec 10, 2011

I'll wait for the winner of NIST's SHA-3 competition: http://csrc.nist.gov/groups/ST/hash/sha-3/Round3/index.html

psykotic · on Dec 10, 2011

He said non-cryptographic.

kenjackson · on Dec 10, 2011

I must admit that I find it a little disappointing that an article of this length on hash functions doesn't mention universal hashing at all.

jemfinch · on Dec 10, 2011

As far as I know, no one anywhere uses universal hashing. Are there some modern, serious uses of universal hashing that I'm missing?

kenjackson · on Dec 10, 2011

UMACs, for example, are based on them. And it would seem to me to understand the power and flaws of hashing, knowing universal hashing seems important -- even if you never use them.

thinkbohemian · on Dec 10, 2011

Quick link for the rest of us: http://en.wikipedia.org/wiki/Universal_hashing

Roboprog · on Dec 10, 2011

Hash functions are generally not reversible, unless you were to very carefully define the input domain. Given the "similar inputs" -> "quite different outputs" ("avalanche") effect described here and the article, I wonder this:

Is it possible to describe a given hash function as a "cyclic" function, given some kind of transformation on the input to create a matrix/tensor out of the input -- are there common methods to re-order the inputs in such a way that there are discernible "axis" that exhibit cyclic behavior? I suppose such an analysis would be a way to determine cryptographic fitness of a function, but I am only speculating.

ypcx · on Dec 10, 2011

It should be duly noted that hashing used for table lookup is a workaround technique, a hack to overcome the inherent inefficiency of the Von Neumann design.

mmphosis · on Dec 10, 2011

Interesting. Would you elaborate, and/or give an example of some non-Von Neumann designs and how they accomplish table lookups more efficiently?

wisty · on Dec 10, 2011

Note, Python's hash function is a very bad hash function. It usually maps sequential values (i.e. 1,2,3 ..) to sequential hash values (i.e. 1,2 3 ...) - yes, hash(1) = 1. The advantage of this is, if you have sequential keys (which is very common), you minimize collisions better than more "random" distributions.

Of course, you couldn't use it for cryptography.

slig · on Dec 10, 2011

If it was made that way by design, for that specific use, how can it be "very bad"?

anfractuosity · on Dec 11, 2011

One type of hashing that I hadn't heard of before that I think is really cool, is fuzzy hashing.

Which allows you to compare hashes, to find the similarity of data.

For an example: http://ssdeep.sourceforge.net/

thinkbohemian · on Dec 10, 2011

This is great, thanks to Bret for writing this.