I'm a little skeptical about that table though, because as you point out, there's not a great way of figuring "meaning per character". For some more (real flawed) comparison, the English version of The Tale of Genji is ~60k words and 224 pages, and the Chinese version is ~75k words and 300 pages. [1] [2]
So yeah, I'm skeptical about the claim that some languages have more meaning per character, and as a result you'll end up storing less text overall. I think it'd be cool to look at more data about this though, like (for example) stats from Treasure Data [3].
But regardless, it's pretty indisputable that UTF-16 is a lot better space-wise for languages that tend to be multibyte in UTF-8. Mostly what I'm trying (Quixotically) to say is "UTF-8 the world" ignores a lot of the world.
> the Chinese version is ~75k words and 300 pages. [1] [2]
Note that link [2] says "guess [of # of words] based on page count". 75k words, 300 pages is one statistic, not two separate statistics. (Estimating words based on page count is likely to be highly reliable, but still.)
> I'm skeptical about the claim that some languages have more meaning per character, and as a result you'll end up storing less text overall.
Your skepticism is unwarranted. From a pure information-theoretic perspective, the claim that some languages have more meaning per character is a slam dunk, and less than a second of examination proves it conclusively.
For example, an entire English novel is unlikely to use more than 256 unique characters. A Chinese novel couldn't use anywhere near that few without sounding incredibly artificial.
Here are some single-character words in modern Mandarin:
高 - high/tall
低 - low
大 - big
小 - small
最 - most (superlative marker)
更 - more (comparative marker)
到 - arrive
走 - leave (go away)
玩 - play (e.g. a game)
看 - look
听 - listen
贵 - expensive
爱 - love (verb)
恨 - hate (verb)
龍 - dragon
For something more representative, here are some song lyrics -- I'll enclose every word of multiple characters. Unenclosed stretches of text are words of one character each:
How many distinct one-character words do you think English could practically support? How many twos? In this verse-and-a-half, there's one word that's always[1] three characters and two that have reached three characters by picking up a verb suffix, for a total of three words that are longer than two characters. (鼓起勇气 is kind of a special case, in that it's a well-known fixed expression, but its meaning is transparent as an ordinary combination of the two words 鼓起 and 勇气, which also see use outside the expression.)
[1] Actually, 什么 ["what"] has the vernacular contraction 啥, and this also applies to 为什么 ["why"], so you could argue that the word is sometimes just two characters.
Ugh, I was trying to get a good comparison of translations, but too bad I guess.
Yeah I mean, there's no question ideographic languages have more meaning per character than alphabetic languages. But other comparisons and considerations aren't as obvious:
- how do ideographic languages compare with each other?
- do people using ideographic languages write more?
- are ideographic languages as effective at compression as general (or special) compression algorithms?
Moving up the conceptual ladder from "average bytes per codepoint" to "average size of encoded tweet" (for example) is a big leap is all I'm saying.
> are ideographic languages as effective at compression as general (or special) compression algorithms?
That one we know; the compression algorithms are more compressive. For example, compressed Chinese text takes up less space than the same text uncompressed. Ideographic languages are still languages that real humans have to use, and they feature redundancy because that helps everyone. Compression algorithms have the luxury of stripping that redundancy out.
> how do ideographic languages compare with each other?
Of modern languages, only Chinese and Japanese could really be described as ideographic. (Japanese much more so than Chinese, in fact.) Chinese will have more meaning per character, because Japanese makes heavy use of the comparatively less meaningful kana. (Interestingly... it has to do this precisely because of its more ideographic nature.)
> do people using ideographic languages write more?
> do people using ideographic languages write more?
A related question is if people using ideographic languages write electronically when they do, and I suspect yes.
Language Log has several articles on the phenomenon where Chinese seem to be forgetting how to write characters due to IT input methods. https://languagelog.ldc.upenn.edu/nll/?p=7142
Handwriting isn't exactly uncommon in China. Nobody's forgetting how to write characters they write all the time, and if they do need to write a character they can't quite remember, they can squiggle something and rely on the reader to understand it.
Handwriting is actually a moderate-level problem for me -- I can't recognize most handwritten characters. Often a special handwriting form will be used.
I'm a little skeptical about that table though, because as you point out, there's not a great way of figuring "meaning per character". For some more (real flawed) comparison, the English version of The Tale of Genji is ~60k words and 224 pages, and the Chinese version is ~75k words and 300 pages. [1] [2]
So yeah, I'm skeptical about the claim that some languages have more meaning per character, and as a result you'll end up storing less text overall. I think it'd be cool to look at more data about this though, like (for example) stats from Treasure Data [3].
But regardless, it's pretty indisputable that UTF-16 is a lot better space-wise for languages that tend to be multibyte in UTF-8. Mostly what I'm trying (Quixotically) to say is "UTF-8 the world" ignores a lot of the world.
[1]: https://www.readinglength.com/book/isbn-4805314648
[2]: https://www.readinglength.com/book/isbn-7544717275
[3]: https://www.treasuredata.com/