Hacker News new | past | comments | ask | show | jobs | submit login

It's not true if you count current languages of East Asia, such as Chinese. Or does the document address that omission somehow?

EDIT: Skimming it, I don't see how the issue is addressed, except in a diagram that says,

* Modern-Use Ideographs: 54x256 = 13,824

* Added Ideographs: 64x256 = 16,384

Also it talks about deduplicating identical characters in Chinese, the Japanese alphabets, and Korean (Hangul).

More interesting is the full explanation of modern use:

Distinction of "moderm-use" characters: Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2^14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.

It's very interesting, and revealing about the author, to define use as newspaper and magazine publishing.




Unicode 88 (wrongly) assumed that Han characters in current use are small enough (see p. 3 for specifics). This turned out to be massively incorrect because they have an extremely long tail. Same goes for Hangul.

EDIT:

> Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.

I do believe that the initial estimation was not very off. IICore 2.2 [1] contains 9,810 Han characters across all relevant countries and even assuming that every character should be encoded in two ways (to account for simplified-traditional splits) and IICore had some significant omissions we could have just encoded ~40,000 characters. If everything went well according to their plan.

The reality however was that PUA for rare characters was an old concept---pretty much every East Asian character set has one---and proven so ineffective. PUA is by definition not interchangable, but rare characters are still in use and they have to be interchanged anyway. So people avoided them like a plague and argued for bigger character sets instad, East Asian character sets had to be ballooned up, and at the point of Unicode they became so large that a naive vision of Unicode 88 no longer worked.

[1] https://en.wikipedia.org/wiki/International_Ideographs_Core




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: