Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?

[0] https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...



Yes, you are right. This came up yesterday and indeed two solutions were violating the "must work for all station names" rule by relying on specific hash functions optimized for the specific data set, which I unfortunately missed during evaluation. I've just removed these entries from the leaderboard for the time being. Both authors are reworking their submissions and then they'll be added back.

[0] https://twitter.com/mtopolnik/status/1742652716919251052


Have you considered having two datasets? The "development" dataset which is public, and the "competition" dataset which is private?

That might help against algorithms that are (accidentally, or on purpose) tuned to the specific dataset.


Yeah if you are releasing the competition dataset then it can and will 100% be gamed. What is to stop me from just hardcoding and printing the result without any computation?


Your code is submitted (per parent comment)


Sure, but who is manually checking every entry?


You don't need to check every entry, just the top few.


The next obvious solution is to have something that works fast for non-colliding station names, but then falls back to another (slow) implementation for colliding station names.

It's very cheap to detect station name collisions - just sample ~10,000 points in the file, and hope to find at least one of each station. If you find less stations than the full run, you haven't yet seen them all, keep hunting. If you find two that collide, fall back to the slow method. Checking 0.00001% of the dataset is cheap.


But how would you detect that a station name is colliding or not colliding? With a hash set?


Search for cuckoo hashing. It's a whole thing for data structures.


Cuckoo doesn't aid in detecting collisions, the algorithm is about what happens IF a collision is found. The whole reason we're hashing in the first place is to not have to linearly compare station names when indexing.

In other words: Cuckoo is a strategy to react to the case if two values map to the same hash. But how to know weather you have two different values, or two of the same, if they have an identical hash?


It seems like you have to run the slow collision-resistant thing over the whole file.


Collision detection is far cheaper. One method:

Simply add up all the bytes of all seen placenames while running your fast algorithm. This is effectively a checksum of all bytes of placename data.

Then at the end, calculate the 'correct' name-sum (which can be done cheaply). If it doesn't match, a collision occurred.


You can design inputs that defeat this scheme though. I thought the point was to be correct for all valid inputs.


What are the constraints on station names - i.e. min length, max length, maximum number of different names?


Just clarified this in the README:

* Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit

* Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported


The README says max length 100 bytes, which I suppose we can (?) assume are octets. Also, it mentions that you can assume the station string does not contain the separator ';'.

I guess the station string is also supposed to be free of control characters like newlines, though spaces are allowed. This, however, is not stated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: