The emphasis in this paper was not on "efficiency", but rather on phylogenetic grouping. The authors used the "Universal Declaration of Human Rights", which apparently has 52 translations as the subject text. See Fig 13 and section 5.2 for details. Instead of just compressing and normalizing, they developed a normalized compression distance, which involves concatenating two texts, compression, and division by the largest of the two texts, as compressed alone.
I think that using the same encoding of all the languages and some kind of normalizing would wash the results of differences due to encodings.
I did find "Clustering by Compression", https://arxiv.org/abs/cs/0312044
The emphasis in this paper was not on "efficiency", but rather on phylogenetic grouping. The authors used the "Universal Declaration of Human Rights", which apparently has 52 translations as the subject text. See Fig 13 and section 5.2 for details. Instead of just compressing and normalizing, they developed a normalized compression distance, which involves concatenating two texts, compression, and division by the largest of the two texts, as compressed alone.
I think that using the same encoding of all the languages and some kind of normalizing would wash the results of differences due to encodings.