Hacker News new | past | comments | ask | show | jobs | submit login

The author of the article would have been wise to include this information.

I did find "Clustering by Compression", https://arxiv.org/abs/cs/0312044

The emphasis in this paper was not on "efficiency", but rather on phylogenetic grouping. The authors used the "Universal Declaration of Human Rights", which apparently has 52 translations as the subject text. See Fig 13 and section 5.2 for details. Instead of just compressing and normalizing, they developed a normalized compression distance, which involves concatenating two texts, compression, and division by the largest of the two texts, as compressed alone.

I think that using the same encoding of all the languages and some kind of normalizing would wash the results of differences due to encodings.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: