Yeah, that is quite inconsistent. Kanji literally means "Chinese Character" so i...

jcranmer · on Dec 3, 2019

Arabic numerals (0123456789) are not to be confused with the Arabic numerals (٠١٢٣٤٥٦٧٨٩). So the fact that kanji literally means "Chinese character" doesn't mean that kanji and hanzi should be considered the same script.

The Latin script (that which I write right now) and the Cyrillic script both derived heavily from the Greek script, especially the capital letters--fully 60% of them are identical in Latin and Greek, even more if you include obsolete letters like digamma and lunate sigma (roughly F and C, respectively). Most of these homoglyphs furthermore share identical phonetic values.

In retrospect, treating traditional Chinese, simplified Chinese, and Japanese kanji as different scripts seems like it would have been the better path. I don't know enough about the Korean and Vietnamese usage of Chinese characters to know if those scripts are themselves independent daughter scripts or complete imports of Chinese with a few extra things thrown in (consider Farsi's additions to Arabic, or Icelandic's þ and ð additions to Latin).

tasogare · on Dec 3, 2019

> So the fact that kanji literally means "Chinese character" doesn't mean that kanji and hanzi should be considered the same script.

The Japanese writing system differs from the Chinese one by having its own distincts scripts (hiragana, katakana), but most of its subset made of Chinese characters (kanji) is the same than the Chinese script. The most comprehensive Chinese character dictionary is a Japanese one (Daikanwa jiten), which give definition and Japanese readings and this is possible precisely because the script is the same.

The only differences are characters created for use in Japan (kokuji) which can be treated as an extension like the Vietnamese Nôm, characters simplified by the Japanese government (some jôyô kanji) and variation in some glyph's shape (黃/黄). So, treating the full inventory of these languages as different scripts wouldn't make more sense than encoding the English, French and Czech alphabets separately because few characters differ.

My opinion is that Han unification makes sense, but the mistake made was to encode the variant interpretation at application level (e.g. html lang tags), which is not portable. I don't know how Unicode variant form works in details (putting a trailing code to a character to indicate precisely which variant is meant) but something like that at text encoding level could ease a lot of pain.

mantap · on Dec 3, 2019

The alternative is worse. Look at all of the problems we have with Turkish I, just because they didn't create new codepoints to make Turkish I and Latin I distinct even though they look the same.

Shorel · on Dec 3, 2019

Cyrillic А is not the same as English A.

For example, some fonts render A in a way that looks like Cyrilic Л. (Like The Mandalorian title screen.)

This would be incorrect if using the same A for both: https://i.ytimg.com/vi/V8fC7bdV-mI/maxresdefault.jpg

bonoboTP · on Dec 3, 2019

I don't see anything special in the linked image. The As look as they would in Latin script.

(probably not what you meant, but just in case: the fourth letter is not a Cyrillic A but a D.)

GrantSolar · on Dec 3, 2019

I think the image is not meant to show the problem but show a case where if the Cyrillic A had been stylised the same way that the English A is in the English version, the two distinct letters would become indistinguishable such that the Cyrillic title would effectively read "The Mlndlloriln"

bonoboTP · on Dec 3, 2019

I see, for those out of the loop, the English title screen does not have the horizontal bar in the As.

Shorel · on Dec 4, 2019

I should have linked the English title as well, now I read my comment and see it is not sufficiently unambiguous.

Here: https://en.wikipedia.org/wiki/File:The_Mandalorian_logo.jpg you can see that the A letters are rendered the same as Cyrillic Л.

To further the confusion, in Buenos Aires there are many street signs that use Л as the letter A, and they use П as the letter N.

bonoboTP · on Dec 5, 2019

Part of the confusion was because I see the English title's As are actually Λ (capital Greek lambda), rather than Л (at least in the font that HN uses). I'm guessing from context that Л is sometimes rendered as Λ in some Cyrillic fonts.

Shorel · on Dec 6, 2019

> I'm guessing from context that Л is sometimes rendered as Λ in some Cyrillic fonts.

Exactly. It is more often rendered like that in Bulgaria. But it is still the letter Л.

Which just furthers the point that glyph rendering and character code points are very different problems and the multiple code points in Unicode are the right approach.

tomp · on Dec 3, 2019

If the readers were to confuse A and Д, that's a problem with the font, not the letters. Cyrillic, Greek and Latin A are all one letter (in uppercase).