As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?
Yes, you are right. This came up yesterday and indeed two solutions were violating the "must work for all station names" rule by relying on specific hash functions optimized for the specific data set, which I unfortunately missed during evaluation. I've just removed these entries from the leaderboard for the time being. Both authors are reworking their submissions and then they'll be added back.
Yeah if you are releasing the competition dataset then it can and will 100% be gamed. What is to stop me from just hardcoding and printing the result without any computation?
The next obvious solution is to have something that works fast for non-colliding station names, but then falls back to another (slow) implementation for colliding station names.
It's very cheap to detect station name collisions - just sample ~10,000 points in the file, and hope to find at least one of each station. If you find less stations than the full run, you haven't yet seen them all, keep hunting. If you find two that collide, fall back to the slow method. Checking 0.00001% of the dataset is cheap.
Cuckoo doesn't aid in detecting collisions, the algorithm is about what happens IF a collision is found. The whole reason we're hashing in the first place is to not have to linearly compare station names when indexing.
In other words: Cuckoo is a strategy to react to the case if two values map to the same hash. But how to know weather you have two different values, or two of the same, if they have an identical hash?
- Station name: non null UTF-8 string of min length 1 character and max length 100 characters
- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit
* Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported
The README says max length 100 bytes, which I suppose we can (?) assume are octets. Also, it mentions that you can assume the station string does not contain the separator ';'.
I guess the station string is also supposed to be free of control characters like newlines, though spaces are allowed. This, however, is not stated.
[0] https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...