Hacker News new | past | comments | ask | show | jobs | submit login

The next obvious solution is to have something that works fast for non-colliding station names, but then falls back to another (slow) implementation for colliding station names.

It's very cheap to detect station name collisions - just sample ~10,000 points in the file, and hope to find at least one of each station. If you find less stations than the full run, you haven't yet seen them all, keep hunting. If you find two that collide, fall back to the slow method. Checking 0.00001% of the dataset is cheap.




But how would you detect that a station name is colliding or not colliding? With a hash set?


Search for cuckoo hashing. It's a whole thing for data structures.


Cuckoo doesn't aid in detecting collisions, the algorithm is about what happens IF a collision is found. The whole reason we're hashing in the first place is to not have to linearly compare station names when indexing.

In other words: Cuckoo is a strategy to react to the case if two values map to the same hash. But how to know weather you have two different values, or two of the same, if they have an identical hash?


It seems like you have to run the slow collision-resistant thing over the whole file.


Collision detection is far cheaper. One method:

Simply add up all the bytes of all seen placenames while running your fast algorithm. This is effectively a checksum of all bytes of placename data.

Then at the end, calculate the 'correct' name-sum (which can be done cheaply). If it doesn't match, a collision occurred.


You can design inputs that defeat this scheme though. I thought the point was to be correct for all valid inputs.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: