Adding all of the dead languages bloated the system very quickly. Shame UCS2 tur...

WalterBright · on Sept 2, 2023

UCS2 was not a mistake. Unicode forgot its mission.

otabdeveloper4 · on Sept 2, 2023

The idea of fixed-length characters was the mistake. Unicode is fine.

numpad0 · on Sept 2, 2023

24- or 32-bit fixed length would be probably fine…

The issue is doing a unified codepoint right run into problems of accepting that zh-CN/zh-HK/zh-TW and ja-JP/kr-KR each uses slightly different definitions for every letters in common use, which is not just memory inefficient but also have to do with Taiwanese identity problem.

Much easier to keep it to just zh-CN and zh-TW(recently renamed zh-Hans and zh-Hant) with zh-CN completely double defined with ja-JP, and kr-KR dealt as footnotes on ja-JP, which is what they did.

lifthrasiir · on Sept 2, 2023

> The issue is doing a unified codepoint right run into problems of accepting that zh-CN/zh-HK/zh-TW and ja-JP/kr-KR each uses slightly different definitions for every letters in common use, which is not just memory inefficient but also have to do with Taiwanese identity problem.

I'm not sure what exactly you wanted to say, but all of six or seven variants of Han characters are substantially different from each other. Glyphwise there are three major clusters due to the different simplification history (of the lack thereof): PRC, Japan and everyone else. (Yes, Korean Hanja is much similar to traditional Chinese.) Even when Taiwan didn't exist at all there would have been three major clusters anyway.

Also pedantry corner: ko-KR, not kr-KR. And zh-TW wasn't renamed to zh-Hant; it is valid to say zh-CN-Hant if PRC somehow wants to use traditional Chinese characters.

numpad0 · on Sept 2, 2023

My apologies for ignorance to Korean speakers.

What I was trying to say is, to fix Unicode, we must accept that whatever many of those variants don't interchange, and therefore all require its own planes, which I believe had never been the Consortium's stance on the matter. And I think that should solve about half of problems with Unicode by volume.

lifthrasiir · on Sept 2, 2023

Yeah, everyone seem to agree that the Han unification was a mistake. But I also think its impact was also exaggerated. The latest IVD contains 12,192 additional variants [1] for the original CJK unified ideograph block (currently 20,992 characters). They indeed represent most variants of common interest, so if the Han unification didn't happen and we assigned them in advance, we could have actually fit everything into the BMP! [2] Of course this also meant that a further assignment is almost impossible.

[1] https://unicode.org/ivd/data/2022-09-13/

[2] Unicode 2.0 had 38,885 assigned characters (almost in the BMP) and ~6,200 unallocable characters otherwise (mostly because they were private use characters since 1.0). 38,885 + 6,200 + 12,192 = 57,277 < 2^16. A careful block reassignment would have been required though.

numpad0 · on Sept 2, 2023

IVS is going to work out if everyone switches to always use it by default, but it's currently only used for special cases where the desired letter shape diverges from language default(the "bridge-shaped high" cases) rather than the-default and I don't see a perfect IVS world happening.

I mean, it's clearly tied to languages, so I think only viable options are either we split languages into separate tables into those reserved-for-unknown-reasons planes, or add a font selector instruction like ":flag_tw: :right_arrow: 今天是晴天 :left_arrow:", OR, if there's going to be a WWIII and one or more of Zh-Hans or ja-JP becomes obsolete, it's going to be not an important issue.