Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why sha256 hash the user into to get a two character target directory? Wouldn't md5 be much faster and solve the same problem?



At a guess: that hash is performed relatively few times, so any performance difference is lost in the noise floor. Never having to answer "why did you use this insecure hash" or eliminating/minimising any possibility of a class of security problem is worth more.


This has nothing to do with security. It's just wasted CPU. I imagine you have to do this every time you make a query to lookup the users DB?

Security is not a concern here. It's just literally bucketing ids. Also, this is not needed with modern file systems.


all modern server CPUs have intrinsics for sha256, it just doesnt matter CPU-wise


It looks like for something the size of a UUID on Node 18, on my 8th Gen i7 (main machine broken) MD5 is only 10% faster. I guess I was remembering a time when it was like twice as fast... Neat. :)


At their scale maybe they're worried about collisions?

Or, like me, they're drowning in security tooling from corporate and don't want to have to carve out exceptions for md5 usage in each.


> At their scale maybe they're worried about collisions?

With their scheme, collisions are already guaranteed to happen if they have >256 users.


I guess parent meant abusable non-uniform distribution of collisions (they have collisions anyway as the take only the first two characters according to GP comment)


It could be they didn't want to explain the md5 usage, yeah. But that's kinda nuts if they do this every query.


It's probably not healthy to have broken cryptographic hashes running around. If you don't need a secure hash there are plenty of fast non-cryptographic hashes.


There's nothing about security here. By this logic you should probably stop using hashmaps, then? :)


That's literally not their logic.

They said:

if you need security don't use md5.

If you don't need security, use something faster than md5.

md5 is neither secure nor fast, why use it at all?


That's dumb. Security is not a component here. They literally just want to put the files into buckets because filesystem. Also MD5 is much faster in my experience and on any benchmark I can find.


It's really not dumb.

MD5 was designed as a cryptographic hash. It's certainly faster than many other cryptographic hashes, and it's also insecure.

Many, many hashes that are not designed to be cryptographic are 10x or more faster than md5 (e.g. murmur, siphash, xxhash, and many more).

No one thinks security is a component here. Literally everyone in this thread has agreed to that fact.

What's actually kinda dumb is to use any cryptographic hash when security is not needed - there's no need to weigh the tradeoffs between md5, sha, blake, etc. Those hashes are universally slow compared to the likes of xxhash - why not just use that to bin the files into subdirectories? Why limit yourself to the slow subset of hashes if you don't need the specific properties that make them slow?


This is probably not about collisions but about filesystem limitations (max number of files in a directory).


I've done something similar and that's absolutely what it was. I'm no pro, knew I wasn't doing it the right way, but it was for a personal side project and Windows starts to get weird when you have a million files in a single directory.


having a good hash uniformly distribute content helps scaling (by sharding of data)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: