Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

However, using a Hilbert Curve for sharding doesn't seem like the best approach.

Yes, that's also what I thought. Searching for "same size k-means" yields a simple postprocessing step to even out the clusters produced by the usual k-means algorithm.

EDIT: k-means is adapted directly here: https://elki-project.github.io/tutorial/same-size_k_means



In my experience the linked algorithm behaves quite poorly when the dataset and number of desired clusters becomes large. I think the issue is that it only allows for pairwise switching between clusters, and this ends up with a lot of points still being assigned to clearly the wrong cluster. Some day I would like to try more complex neighbourhood searches with it and see if it helps.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: