Hacker News new | past | comments | ask | show | jobs | submit login

Unicode 88 (wrongly) assumed that Han characters in current use are small enough (see p. 3 for specifics). This turned out to be massively incorrect because they have an extremely long tail. Same goes for Hangul.

EDIT:

> Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.

I do believe that the initial estimation was not very off. IICore 2.2 [1] contains 9,810 Han characters across all relevant countries and even assuming that every character should be encoded in two ways (to account for simplified-traditional splits) and IICore had some significant omissions we could have just encoded ~40,000 characters. If everything went well according to their plan.

The reality however was that PUA for rare characters was an old concept---pretty much every East Asian character set has one---and proven so ineffective. PUA is by definition not interchangable, but rare characters are still in use and they have to be interchanged anyway. So people avoided them like a plague and argued for bigger character sets instad, East Asian character sets had to be ballooned up, and at the point of Unicode they became so large that a naive vision of Unicode 88 no longer worked.

[1] https://en.wikipedia.org/wiki/International_Ideographs_Core




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: