It is worth adding that if your goal using an hash function is sharding across different servers the CRC function can be good, especially the CRC16 function, that has a smaller output than the CRC32, tends to perform very well with many common "key name" patterns.
It's also worth to note that the CRC16 (and CRC32) function will map every possible input of 16 (or 32) bits to a different output, without collisions, similarly to a block cipher.
EDIT: I would take this web site with a bit of salt, the description of the "One Way" concept in the home page is, as far as I can tell, not correct:
"Another important goal is that "similar" input vectors produce very different output vectors. For the case of table lookup, table keys are usually either human language strings, index numbers, or other data that exhibit non-randomness. For the case of message digest functions, the goal is to make it as difficult as possible to find two different messages that result in the same hash. A hash function with this property is called a one-way hash function."
It's sort of almost correct. He's fumbling his way towards the avalanche property: for a perfect cryptographic hash if you flip a single bit in the input, on average half the bits in the output will change.
Yes, but the Avalanche effect does not mean it is not invertible. It is possible to design a one way hash function with remarkable avalanche (50% of flipped bits on average after 1 bit change in input) effect but that is easily invertible.
The simplest would be an hash function that outputs an N-bit hash where M bits are a pseudo-random sequence of bits (depending on the input, let's say SHA1(input)) that are used to key a stream cipher, and the rest N-M bits are the XOR between the stream cipher output and part of the input of the hash function.
Perfect Avalanche effect, but the function is invertible.
An hash function is invertible enough if it will leak the first three bytes of the input. Does not need to be invertible as F^-1 in mathematical notation.
You should not use a simple hash like a CRC for sharding if you need to remain robust against hostile users. Otherwise the users can bypass the sharding and execute a denial of service.
To give an example, the Linux kernel's network stack uses cryptographically secure hash functions in its hash tables, to ensure that a hostile network cannot cause the kernel's hash tables to devolve into linked lists and thus break the O(1) performance expectations of the kernel.
This can only be fixed with a crypto hash function, that IMHO would be a premature optimization here. Users having this potential problem (key names are result of external user inputs) can simply prefix the key with HMAC(key,secret) truncated to a few bytes.
If you are interested in cryptographic hash functions a good introduction is "Applied Cryptography" that can be integrated with "Handbook of Applied Cryptography" (all chapters free here: http://www.cacr.math.uwaterloo.ca/hac/).
UMACs, for example, are based on them. And it would seem to me to understand the power and flaws of hashing, knowing universal hashing seems important -- even if you never use them.
Hash functions are generally not reversible, unless you were to very carefully define the input domain. Given the "similar inputs" -> "quite different outputs" ("avalanche") effect described here and the article, I wonder this:
Is it possible to describe a given hash function as a "cyclic" function, given some kind of transformation on the input to create a matrix/tensor out of the input -- are there common methods to re-order the inputs in such a way that there are discernible "axis" that exhibit cyclic behavior? I suppose such an analysis would be a way to determine cryptographic fitness of a function, but I am only speculating.
It should be duly noted that hashing used for table lookup is a workaround technique, a hack to overcome the inherent inefficiency of the Von Neumann design.
Note, Python's hash function is a very bad hash function. It usually maps sequential values (i.e. 1,2,3 ..) to sequential hash values (i.e. 1,2 3 ...) - yes, hash(1) = 1. The advantage of this is, if you have sequential keys (which is very common), you minimize collisions better than more "random" distributions.
It's also worth to note that the CRC16 (and CRC32) function will map every possible input of 16 (or 32) bits to a different output, without collisions, similarly to a block cipher.
EDIT: I would take this web site with a bit of salt, the description of the "One Way" concept in the home page is, as far as I can tell, not correct:
"Another important goal is that "similar" input vectors produce very different output vectors. For the case of table lookup, table keys are usually either human language strings, index numbers, or other data that exhibit non-randomness. For the case of message digest functions, the goal is to make it as difficult as possible to find two different messages that result in the same hash. A hash function with this property is called a one-way hash function."