As far as I see the currently best performing solution [0] does not account for ...

gunnarmorling · on Jan 4, 2024

Yes, you are right. This came up yesterday and indeed two solutions were violating the "must work for all station names" rule by relying on specific hash functions optimized for the specific data set, which I unfortunately missed during evaluation. I've just removed these entries from the leaderboard for the time being. Both authors are reworking their submissions and then they'll be added back.

[0] https://twitter.com/mtopolnik/status/1742652716919251052

rkangel · on Jan 4, 2024

Have you considered having two datasets? The "development" dataset which is public, and the "competition" dataset which is private?

That might help against algorithms that are (accidentally, or on purpose) tuned to the specific dataset.

paxys · on Jan 4, 2024

Yeah if you are releasing the competition dataset then it can and will 100% be gamed. What is to stop me from just hardcoding and printing the result without any computation?

refulgentis · on Jan 4, 2024

Your code is submitted (per parent comment)

paxys · on Jan 4, 2024

Sure, but who is manually checking every entry?

946789987649 · on Jan 4, 2024

You don't need to check every entry, just the top few.

londons_explore · on Jan 4, 2024

The next obvious solution is to have something that works fast for non-colliding station names, but then falls back to another (slow) implementation for colliding station names.

It's very cheap to detect station name collisions - just sample ~10,000 points in the file, and hope to find at least one of each station. If you find less stations than the full run, you haven't yet seen them all, keep hunting. If you find two that collide, fall back to the slow method. Checking 0.00001% of the dataset is cheap.

posix86 · on Jan 4, 2024

But how would you detect that a station name is colliding or not colliding? With a hash set?

viraptor · on Jan 4, 2024

Search for cuckoo hashing. It's a whole thing for data structures.

posix86 · on Jan 11, 2024

Cuckoo doesn't aid in detecting collisions, the algorithm is about what happens IF a collision is found. The whole reason we're hashing in the first place is to not have to linearly compare station names when indexing.

In other words: Cuckoo is a strategy to react to the case if two values map to the same hash. But how to know weather you have two different values, or two of the same, if they have an identical hash?

anonymoushn · on Jan 4, 2024

It seems like you have to run the slow collision-resistant thing over the whole file.

londons_explore · on Jan 4, 2024

Collision detection is far cheaper. One method:

Simply add up all the bytes of all seen placenames while running your fast algorithm. This is effectively a checksum of all bytes of placename data.

Then at the end, calculate the 'correct' name-sum (which can be done cheaply). If it doesn't match, a collision occurred.

anonymoushn · on Jan 4, 2024

You can design inputs that defeat this scheme though. I thought the point was to be correct for all valid inputs.

thomaswue · on Jan 4, 2024

What are the constraints on station names - i.e. min length, max length, maximum number of different names?

gunnarmorling · on Jan 4, 2024

Just clarified this in the README:

* Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit

* Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported

ykonstant · on Jan 7, 2024

The README says max length 100 bytes, which I suppose we can (?) assume are octets. Also, it mentions that you can assume the station string does not contain the separator ';'.

I guess the station string is also supposed to be free of control characters like newlines, though spaces are allowed. This, however, is not stated.