Hacker News new | past | comments | ask | show | jobs | submit login
Unicode 88 (1988) [pdf] (unicode.org)
57 points by robin_reala on Sept 2, 2023 | hide | past | favorite | 94 comments



In 1978, the initial proposal for a set of "Universal Signs" was made by Bob Belleville at Xerox PARC. Many persons contributed ideas to the development of a new encoding design. Beginning in 1980, these efforts evolved into the Xerox Character Code Standard (XCCS) by the present author, a multilingual encoding which has been maintained by Xerox as an internal corporate standard since 1982, through the efforts of Ed Smura, Ron Pellar, and others.

Unicode arose as the result of eight years of working experience with XCCS. Its fundamental diferences rom XCCS were proposed oy Peter Fenwick and Dave Opstad (pure 16-bit codes), and by Lee Collins (ideographic character unification), Unicode retains the many features of XCCS whose utility have been proved over the years in an international line of communicating multilingual system products.


"2.1 Sufficiency of 16 bits"

Someone should go back in time and tell them...


Their justification is:

> Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is far below 2¹⁴ = 16,384.

...and yeah, that's probably true. It wasn't until several years later that they joined forces with the ISO10646 project, which aimed to encode every character ever used by any human writing system, and 16 bits turned out not to be enough.


It's a pity that Unicode forgot its purpose and added:

1. formatting

2. fonts

3. semantic meaning

4. multiple encodings for the same code point

5. you vote for my invented character and I'll vote for your invented character

6. multiple code points for the same glyph

7. an incomprehensible document describing all this nonsense

8. an impossibility to implement it all


> multiple encodings for the same code point

I'm fairly sure a good fraction of these are due to the desire for compatibility:

PDF: https://www.unicode.org/versions/Unicode15.0.0/ch02.pdf#G110...

> Conceptually, compatibility characters are characters that would not have been encoded in the Unicode Standard except for compatibility and round-trip convertibility with other standards. Such standards include international, national, and vendor character encoding standards. For the most part, these are widely used standards that pre-dated Unicode, but because continued interoperability with new standards and data sources is one of the primary design goals of the Unicode Standard, additional compatibility characters are added as the situation warrants.

Round-trip convertibility is important for interoperability, which is important for getting the standard accepted. A data encoding standard that causes data loss is not acceptable, and trying to impose such on the world would have rightfully failed.


> new standards ... added as the situation warrants

Never ending madness.

> impose such on the world

The imposing is inflicted by the Unicode standard on everyone in the world who have no use for never ending new invented encodings for the same thing.


Unicode being widely adopted (which is what is happening) is the solution to the problem you seem to have of Unicode having to add more codepoints for compatibility with other encodings. Even if the pre-Unicode encodings never go away, they've already taken their respective bites out of the codepoint space, and they won't take up any more of the Consortium's time.

The only way this becomes a live issue again is if someone develops a whole new encoding which is incompatible with the latest version of Unicode. And it gets enough traction to get on anyone else's radar. I simply can't see that as very likely.


> Never ending madness.

[Unicode] didn't start the fire; it was always burning, since the world's been turning.

(I.e. the "never-ending madness" is in the continual desire of human cultures to invent new characters with unique semantics and then have them faithfully digitally encoded in a way that carries those semantics. Being upset that this makes the management of a universal text encoding a never-ending job, is a bit like being upset that the Library of Congress always needs reorganizing and expanding because it keeps getting new books.)


It's different in that nobody expects anyone to implement their own copy of the LoC. This constant expansion of the character set puts an enormous burden on a very large collection of developers. The inevitable happens - they just don't bother - making full Unicode support a pipe dream.


I'm perplexed in what you think "full support" for a novel Unicode codepoint entails.

Most of the properties of a novel codepoint that matter, are attributes that the codepoint has defined within various Unicode-Consortium-defined data files. If your custom implementation of Unicode support already parses those data files, then a new codepoint is essentially "absorbed" automatically as soon as the new version of the Unicode data files are made available to your implementation.

Are you just complaining that new codepoints don't automatically receive glyphs in major fonts?

To that, I would say: who cares? As I just said upthread — character encoding is about semantics, not about rendering. A character doesn't have to be renderable to be useful! The character is still doing its job — "transmitting meaning into the future" — when it's showing up as a U+FFFD replacement character! Screen readers can still read it! Search engines can still index on it! People who want to know what the character was "meant to mean" can copy and paste it into a Unicode database website! It still sorts and pattern-matches correctly! And so forth.

Rendering the codepoint as a glyph — if it even makes sense to render it standalone as a glyph (for which some codepoints don't, e.g. the country-code characters that only have graphical representations in ligatured pairs) — can come at some arbitrary later point, when there's enough non-niche, non-academic, supply-side usage of the codepoint, that there arises real demand to see that codepoint rendered as something. Until then, just documenting the semantics as a codepoint and otherwise leaving it un-rendered, works fine for the needs of the anthropologists and archivists who "legibilize" existing proprietary-encoding-encoded texts through Unicode conversion.


How hard is it to use AI to generate fonts on demand?


That still needs some encoding that can be converted to enough information that can reconstruct a glyph.


I’m going to defend this set of complaints:

> “multiple encodings for the same code point”, “multiple code points for the same glyph”, and “an incomprehensible document describing all this nonsense”

One important factor for adoption was round trip idempotency. Unicode doesn’t simply encode every glyph but incorporates as many existing character sets as possible. (Of course “possible” is modulated by the fact there are/were competing encodings for the same language).

If you use a unicode-aware grep and use it to search for ‘.’, you want the result to be the same as `cat`, rather than having the result re-coded, objcopy-like.

Sure, this is less of a problem in 2023, but still a major problem that will continue indefinitely due to legacy files, legacy hardware, and small-resource devices. So there is no way in unicode to tell by inspection if ‘A’ is Latin or Greek..which is true also on the printed page.

Likewise the writing of different languages don’t merely have different appearance, but different semantics for the visually same shape. In some languages (e.g. French), diacritics are largely semantically silent: you can understand a sentence with them stripped out, though they can simplify reading in some cases by disambiguating some otherwise homonyms. In other languages (e.g. the German umlaut) they are simply contractions — fossils of medieval handwriting. Yet again, in Swedish they can be optional, like the acute accent or form completely different letters like ä. Capturing this is complicated and even discussing it requires, yes, an almost incomprehensible document.

And it has all been implemented but using those libraries is hard, because language processing is hard. In general I/O is hard.

Im not sure what you mean about “semantic meaning” but I do agree, mostly, with your other complaints.


> semantic meaning

There's one code point for the letter `a` and another code point for the mathematical symbol `a`, despite being visually identical. I.e. the code points have acquired semantic meaning.

Spy agencies use this so different versions of a document look identical when printed. Thus, they can use this to track down the mole who leaked a sensitive document.


Emoji are “fun” but it’s got to stop. These characters are going to be used until civilization collapses, so we can’t just keep being like “lol, let’s add a food that is only eaten in some regions and has been around for less than 100 years.”


And, let's be honest. Add some of the glaring omissions in emoji. There's no need to use the eggplant emoji between lovers if you can use the real thing. I mean seriously, even the ancient Egyptians knew this: https://unicode-explorer.com/c/130B8 and even more explicit: https://unicode-explorer.com/c/130BA

Of course it will be inappropriate in many situations. Well, guess what, so are many words. Emoji are no longer just fun. They're also a great way for people of different languages to understand each other. No need to make things more difficult to understand and there's also a lot of non-controversial cases for this (doctor my eggplant sparks when the water tap is open is not quite as clear :) )

Of course the level of explicitness can be decided by the font designers, and the big tech companies will no doubt stylise them tastefully. The same way the gun emoji is now a green water pistol.

But I really don't understand leaving this major topic of human interaction out of what is becoming a language. I'm really serious about this. Any service, website or app not wanting to support them can simply block them just like they already do certain words, and it's much easier to do than a filter on words which can be bypassed by misspelling, inserting spaces etc. This is how 'pr0n' became a word. With emoji you can't do this.

I'm a bit tired of the tech sector pushing American prudeness on the whole world. Here in Barcelona you can buy waffles in the shape of penises and vulvas with strategically placed white chocolate and eat them on the street but a tastefully stylised emoji is not ok? Different cultures have different standards and Unicode was supposed to represent everyone.


A lot of those weird emojis have been repurposed beyond their reason for inclusion in Unicode, so I won't discount them.

See, e.g.:

https://totallythebomb.com/stone-man-moai-emoji


Not different from the fact that we have useless CC0/CC1 characters somewhere in Unicode. As long as bearable (i.e. not too many), having rare characters is okay.


Emojis are also great from a technical standpoint: because they are desirable to users the unicode consortium can (and has) leverage them to encourage text rendering engines to properly implement features necessary for various languages e.g. astral codepoints, ligatures, grapheme clusters, ...


They aren’t useless because they support file idempotency.

Let’s say “important while almost entirely useless” and call it a day?


But I use the khlav kalash emoji all the time!


Displaying emoji is the job of the rendering engine, not the encoding.


Japanese telcos disagreed. (Seriously. We have emojis in Unicode because of interoperabilities with them.)


Yes, but IMHO we could have frozen them (same as the ASCII control codes) and not made the problem worse by adding more.

And I say this as an emoji user on my keitai back in the 90s.


But they are different in every font. Only what they represent is the same. Just like other Unicode characters.


Japanese telcos eventually implemented a conversion server for emojis from other telcos (before emojis in Unicode were formally accepted). It was just a matter of time.


More like weaponized by making it a moat against “PC”, and so Google did its thing.


That's much later than the initial Emoji assignments, and pretty expectable given much broad reach of availability. The earliest UTC decision regarding Emoji is [1] AFAIK.

[1] https://www.unicode.org/L2/L2007/07225.htm#112-C12 (links to https://www.unicode.org/L2/L2007/07274-symbols.txt)


Emoji was always carrier dependent and Japanese carriers always had conversion/filtering at carrier e-mail MTA, years before 2007. I think you have in mind one of later coordinated attempts at it.


Oh, I assumed the other meaning of PC (not Personal Computer). You are right, Google and Apple had to deal with them because emails went through their pipelines.


We're evolving from phonetic languages back to hieroglyphic ones, because somehow the Unicode committee imagines that 1,000,000 hieroglyphs with distinct meanings are better than 26 phonetic letters.


They hate them. So much that there have been multiple attempts to curb the future Emoji growth, including one based on WikiData QIDs by the president of the Unicode consortium [1]. Emojis persist because the current growth doesn't burden the consortium or vendors too much and no better solution was found.

> 26 phonetic letters

And I'm genuinely offended by this particular phrase. You can't even correctly spell my name in those 26 "phonetic" letters. (My preferred convention is a mere approximation.)

[1] https://www.unicode.org/L2/L2019/19082r-qid-emoji.pdf


No need to be offended. I've clearly advocated Unicode16, which surely encompasses your spelling. Unicode16 is an achievable and implementable goal. I welcomed it as a great advance.

I do note that you use ASCII for your handle. Could that be because few will be able to figure out how to enter the code point for your name? If your handle includes code point 0x123456, how do you expect users to deal with it?


I only use an ASCII handle because this forum is English only and I don't want to waste my time answering stupid questions. My Korean colleagues can type Korean handles just fine, thank you for the consideration.

But still. I'm fortunate that my Korean name does fit in 2,350 common Hangul syllables, but there are people aren't (who are greatly disadvantaged). And way more Chinese and Japanese people have rare characters in their names. I even have a concrete data for them; the Japanese goverment defines the Koseki unified character set (戸籍統一文字) that complement common character sets and are intended to be used for family registers (戸籍). Therefore it should cover most person and location names ever used in Japan (and probably nothing else). There are 44,770 characters in this set, including 857 characters not yet encoded to Unicode [1]. So the reasonable character set supporting most Japanese proper names already require more than 50,000 distinct characters!

Unicode16 was never ready for long tails. The Unicode committee only realized this after a massive number of character assignments, which is their only regretable mistake in my opinion.

[1] Counts based on GlyphWiki's listing: https://en.glyphwiki.org/wiki/Group:%e6%88%b8%e7%b1%8d%e7%95...


The only fault of Unicode is 4 (and we have eventually solved it anyway). Everything else is a natural complication from human writing systems.


What should give pause in advocacy for the tarpit of Unicode is its unimplementability. That's a giant red flag that something went horribly wrong.


By this measure pretty much every common programming language is unimplementable and should be never advocated. People have implemented Unicode for decades. Only a few implemented "full" support (in some sense), but that's same for pretty much everything complex enough. Can you actually measure one meter [1] according to the International System of Units?

[1] Or one foot. I remind you that all current standardized definition of "foot" is based on the SI.


> People have implemented Unicode for decades

Small subsets. No fonts remotely support all the characters. Prog languages do subsets.


As others pointed out, no single font is required to support all available Unicode characters. In fact it is just close to impossible anyway, because it is extremely hard to harmoize all typographic conventions across all scripts. Unicode mostly does the next best thing: it gives a uniform framework to handle multilingual texts without knowing specifics. Even though your font can't render "월터 브라이트" programs can normalize, compare, search and segment it just like it does with the text "Walter Bright".

But if you like a pedantry, Noto fonts are close to be complete. They have to be used as a collection due to the OpenType limitation, but otherwise designed in unison. Most remaining characters are rarer CJK unified ideographs [1] but it is more about the priority issue; there are several fonts that cover the entire CJK unified ideographs as well.

[1] https://notofonts.github.io/overview/


Fonts, for example. That is a style, and should be set by the rendering instructions (like CSS) not the code point.

Semantics: there's a code point for the letter 'a' and another code point for the mathematical symbol 'a', although the glyphs are identical. Unicode points should not have semantic meaning - semantic meaning comes from the context in which the points are used.

You know, like the text in a printed book.

And so on, for the other bullet points.


> Semantics: there's a code point for the letter 'a' and another code point for the mathematical symbol 'a', although the glyphs are identical. Unicode points should not have semantic meaning - semantic meaning comes from the context in which the points are used.

Just the opposite. A codepoint looks different in different fonts, and so things that are "identical" in one font won't be "identical" in another font. Only the semantics are the same.

Printed "A" and cursive "A" are both, semantically, "latin capital A". Those semantics are very concrete and specific: they include its sort order, its conversion re: case-conversion, its joining/splitting behavior, its categoric matching in regexps, etc.

But the codepoint "latin capital A" expresses no preference over whether it should be rendered as a printed/roman A or a cursive A. Codepoints don't define glyphs; they only define semantics! The glyphs associated with codepoints are conventional, following the codepoint's description!

Further, codepoints tell accessibility technologies and other machine-agents what a codepoint "is." Greek letter π is text-to-speech'ed as a letter, and together with other greek letters, as a greek word. Mathematical symbol 𝜋 is text-to-speech'ed as the word "pi."


> Codepoints don't define glyphs; they only define semantics!

And this is the fundamental failure of Unicode post 16 bit code points. Unicode has set itself an absurdly impossible task.


This is how every character encoding ever has always worked. Text on computers — text over telegraph lines, even! — has always, implicitly, been a set of abstract standardized semiotic codels, with effectively-arbitrary per-OS/per-machine/per-font-family/per-style/per-size(!) graphical renditions.

> Unicode has set itself an absurdly impossible task.

Again: "Unicode" is not the problem here. "Human language" is the "absurdly impossible task", and has been, from ~40,000BCE. Unicode is just a formalist lens through which the scope of the task is made legible.

(And I think you're being a slight bit scope-insensitive here. You should expect a project like "encoding all of human language" to be something that takes hundreds or thousands of years to get right. Unicode isn't a project, it's a megaproject, something that'll still be evolving fifteen human generations from now. But humanity as a whole has built many successful megaprojects! In the only-50-years-old realm of software engineering, we don't tend to rate things of this scope and scale as "possible" or "practical" or "tenable" — but Unicode fundamentally isn't a software engineering project; it's a language engineering project, like the project of the Académie Française, or of the creation of Simplified Chinese. Language engineering is, by definition, a project of indefinite length.)


I have already explained why mathematical alphanumeric symbols exist in the first place [1]. They are an extremely small portion of Unicode; you need way more examples.

By the way, if we indeed decide to encode characters according to glyphs, we would have to distinguish a double-storey `a` from a single-storey `a` (among others). They have clearly different glyphs, so why shouldn't we? Think about that.

[1] https://news.ycombinator.com/item?id=28294032


> I have already explained

But somehow mathematical books have done just fine without any need for such encodings. Meaning in written language is inferred from the context, not the symbol. This is because meanings are infinite in variety, and symbols are finite.

> Think about that

I did. That's what fonts are for. Fonts come in endless variety, and so cannot be encoded into Unicode. They are an orthogonal property of rendered text.

And so is formatting - it's an orthogonal property.


> But somehow mathematical books have done just fine without any need for such encodings.

Printed books have no reason to think about encodings. In fact, they can use one-off metal types (or images nowadays) for unusual characters. In return printed books cannot be easily searched if the entry doesn't appear in prepared indices.

> Fonts come in endless variety, and so cannot be encoded into Unicode.

But some fonts convey semantics. For example a normal text does not distinguish two variants of `a`, but IPA does, so Unicode does encode a single-storey `ɑ` separately and unify a double-storey `a` to the normal lowercase `a`. (p. 299 of the Unicode standard 15.0)

Semantics is arguably the only way to judge those cases. If fonts do not matter but glyphs do matter, how would know a particular glyph should be identically encoded to other glyph or not? I think you agree that a double-storey `a` and a single-storey `a` are same characters with different fonts. Semantics is how we decide the "same characters" part!


It's great that you can tell the mathematical contexts from the written language contexts, but can screen readers?

If I have the equation ax + by + c = 0, ideally a screen reader would read it as "a-x plus b-y plus c equals zero", not "axe plus bye plus ku equals zero". This is not a problem printed books have because printed books do not need to read themselves, and audio books do not have this problem because there's a human in the loop doing the reading. But these are not the only uses that matter.


> Printed books have no reason to think about encodings

Exactly! It's the visual that matters. If you cannot print Unicode text on a sheet of paper and know its meaning, then Unicode has gone into a ditch.

> printed books cannot be easily searched if the entry doesn't appear in prepared indices

Are you going to invent a new encoding for every distinct use of the letter 'a'? That's not remotely possible.

> But some fonts convey semantics

People invent new fonts every day. Hell, I've invented fonts! Your mission to encode them all into Unicode is an instantly hopeless task. Again, semantics are divined from context. I know a mathematical 'i' from the english letter 'i' and the french letter 'i' and the bullet point 'i' and the electric current 'i' by context, and so do you and everyone else. I even use the letter 'A' to represent an Army in my computer game Empire. Nobody has a problem divining the meaning. (and B for Battleship, S for Submarine , etc.)


> It's the visual that matters.

I expect a good search function for my electronic book!

> Again, semantics are divined from context.

I gave an explicit counterexample. You can't distinguish an IPA `ɑ` (open back unrounded vowel) from an IPA `a` (open front unrounded vowel) with just a context---[hɑt] and [hat] are different pronunciations. So they have to be differently encoded, right? And due to its history, either the "normal" `a` glyph will exactly look like one of them, depending on the font. So we can't have perfectly distinct glyphs for distinct code points anyway. Maybe we need a better rationale, something like semantics...


We have many words that are spelled the same, but sound different based on context, like "sow". That isn't a problem. They don't have to be differently encoded. Besides, are you now planning on adding accents and dialect pronunciations, too?


I'm not talking about homonyms (but I wasn't clear enough about that, sorry). I'm talking about the exact alphabets describing pronunciations. Honestly I can't really differentiate [hɑt] and [hat] sounds, but their IPA renditions are clearly different and have to be differently encoded.


> have to be differently encoded.

They don't. It's a fake problem.


So you have two different pronunciations [hat] and [hat]. Which one is which? Do you really expect to say "[hat] as in General American hot" and "[hat] as in Received Pronunciation hat" every time? Isn't the IPA supposed to be an unambiguous transcription (so to say)?


It’s a fake problem if you just want to write the word “sow” or “hat”, sure, but if you want to write the IPA pronunciation of a word you need the IPA characters in your charset.

Wikipedia has those IPA pronunciations and they’re useful. How else would you do that without including those characters in Unicode? Would you rather have them as images?


> good search function

If there are 5 encodings for the glyph 'a', how are you going to know which one to search for? And there's no guarantee whatsoever which encoding the author used, or if he used them consistently. Your search function is going to be unreliable at best.

After all, if I was writing algebra, I'd use the ASCII 'a' not the math glyph. So will most everyone else but the most pedantic.


> If there are 5 encodings for the glyph 'a', how are you going to know which one to search for?

There are two possible cases and Unicode covers both of them.

For the case of IPA `ɑ`, it is a distinct character from `a` (which doubles as an IPA) so results containing `ɑ` will be ignored. But this is not far from I want, because `a` and `ɑ` are distinct in IPA and the search correctly distinguishes them. I would have to type `ɑ` to get the other results, but this is an irrelevant matter.

For the case of mathematical alphanumerics, they do normalize to normal counterparts in the compatibility mapping (i.e. NFKC or NFKD). The compatibility mapping disregards some minor differences and would be useful in this case. Of course I would like to narrow the search so that only mathematical alphanumerics match (or the inverse)---this is why Unicode defines multiple levels of comparison [1]. Assuming the default collation table, all mathematical alphanumerics are considered different from normal letters only at the level 3, while IPA `ɑ` is different at the level 1 (but sorts immediately after all `a`s).

[1] https://unicode.org/reports/tr10/#Multi_Level_Comparison

> After all, if I was writing algebra, I'd use the ASCII 'a' not the math glyph.

What I would type in TeX doesn't exactly relate to the actual encoding.


> If you cannot print Unicode text on a sheet of paper and know its meaning, then

then you've picked the wrong font for the job. This can happen even with ASCII, e.g. if you print out cryptographic secrets in base64 for use during disaster recovery and then discover you cannot distinguish O and 0 (capital o and zero) nor I and l (capital i and small L). (Hopefully you discover this immediately after printing, instead of after losing all your data.) That doesn't mean those identical glyphs should never have been encoded differently, it means font designers need to make them more distinct.


And on top of that people decided to start supporting Unicode everywhere, like in domain names, source code, filesystems, etc.

So now we can have poop and animal symbols in filenames.

And left-to-right overrides and countless homograph attacks.

It's just sad.


It's not true if you count current languages of East Asia, such as Chinese. Or does the document address that omission somehow?

EDIT: Skimming it, I don't see how the issue is addressed, except in a diagram that says,

* Modern-Use Ideographs: 54x256 = 13,824

* Added Ideographs: 64x256 = 16,384

Also it talks about deduplicating identical characters in Chinese, the Japanese alphabets, and Korean (Hangul).

More interesting is the full explanation of modern use:

Distinction of "moderm-use" characters: Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2^14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.

It's very interesting, and revealing about the author, to define use as newspaper and magazine publishing.


Unicode 88 (wrongly) assumed that Han characters in current use are small enough (see p. 3 for specifics). This turned out to be massively incorrect because they have an extremely long tail. Same goes for Hangul.

EDIT:

> Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.

I do believe that the initial estimation was not very off. IICore 2.2 [1] contains 9,810 Han characters across all relevant countries and even assuming that every character should be encoded in two ways (to account for simplified-traditional splits) and IICore had some significant omissions we could have just encoded ~40,000 characters. If everything went well according to their plan.

The reality however was that PUA for rare characters was an old concept---pretty much every East Asian character set has one---and proven so ineffective. PUA is by definition not interchangable, but rare characters are still in use and they have to be interchanged anyway. So people avoided them like a plague and argued for bigger character sets instad, East Asian character sets had to be ballooned up, and at the point of Unicode they became so large that a naive vision of Unicode 88 no longer worked.

[1] https://en.wikipedia.org/wiki/International_Ideographs_Core


Adding all of the dead languages bloated the system very quickly. Shame UCS2 turned out to be such a mistake. Beyond being not just big enough, trading off space for compute power turned out to be the wrong trade, CPUs increased much more rapidly than memory sizes and data especially transmission rates.


UCS2 was not a mistake. Unicode forgot its mission.


The idea of fixed-length characters was the mistake. Unicode is fine.


24- or 32-bit fixed length would be probably fine…

The issue is doing a unified codepoint right run into problems of accepting that zh-CN/zh-HK/zh-TW and ja-JP/kr-KR each uses slightly different definitions for every letters in common use, which is not just memory inefficient but also have to do with Taiwanese identity problem.

Much easier to keep it to just zh-CN and zh-TW(recently renamed zh-Hans and zh-Hant) with zh-CN completely double defined with ja-JP, and kr-KR dealt as footnotes on ja-JP, which is what they did.


> The issue is doing a unified codepoint right run into problems of accepting that zh-CN/zh-HK/zh-TW and ja-JP/kr-KR each uses slightly different definitions for every letters in common use, which is not just memory inefficient but also have to do with Taiwanese identity problem.

I'm not sure what exactly you wanted to say, but all of six or seven variants of Han characters are substantially different from each other. Glyphwise there are three major clusters due to the different simplification history (of the lack thereof): PRC, Japan and everyone else. (Yes, Korean Hanja is much similar to traditional Chinese.) Even when Taiwan didn't exist at all there would have been three major clusters anyway.

Also pedantry corner: ko-KR, not kr-KR. And zh-TW wasn't renamed to zh-Hant; it is valid to say zh-CN-Hant if PRC somehow wants to use traditional Chinese characters.


My apologies for ignorance to Korean speakers.

What I was trying to say is, to fix Unicode, we must accept that whatever many of those variants don't interchange, and therefore all require its own planes, which I believe had never been the Consortium's stance on the matter. And I think that should solve about half of problems with Unicode by volume.


Yeah, everyone seem to agree that the Han unification was a mistake. But I also think its impact was also exaggerated. The latest IVD contains 12,192 additional variants [1] for the original CJK unified ideograph block (currently 20,992 characters). They indeed represent most variants of common interest, so if the Han unification didn't happen and we assigned them in advance, we could have actually fit everything into the BMP! [2] Of course this also meant that a further assignment is almost impossible.

[1] https://unicode.org/ivd/data/2022-09-13/

[2] Unicode 2.0 had 38,885 assigned characters (almost in the BMP) and ~6,200 unallocable characters otherwise (mostly because they were private use characters since 1.0). 38,885 + 6,200 + 12,192 = 57,277 < 2^16. A careful block reassignment would have been required though.


IVS is going to work out if everyone switches to always use it by default, but it's currently only used for special cases where the desired letter shape diverges from language default(the "bridge-shaped high" cases) rather than the-default and I don't see a perfect IVS world happening.

I mean, it's clearly tied to languages, so I think only viable options are either we split languages into separate tables into those reserved-for-unknown-reasons planes, or add a font selector instruction like ":flag_tw: :right_arrow: 今天是晴天 :left_arrow:", OR, if there's going to be a WWIII and one or more of Zh-Hans or ja-JP becomes obsolete, it's going to be not an important issue.


"it is certainly possible, albeit uninteresting, to come up with an unreasonable definition of "character" such there are more than 65536 of them".

They probably didn't foresee the "unreasonable" and "uninteresting" use of emojis.


I was thinking more of Han Unification, which we now realize was a bad idea, and to this day makes it difficult if not impossible to correctly render a block of text that contains characters in more than one CJK language.


“A sequence of Unicodes (e.g. text file, string, or character stream) is called ‘Unitext’.”

I like this, shame it didn’t catch on.


"A book written in Unicodes is called a 'Unibook', and the totality of materials available in Unicodes is known as the 'Universe'"—in another timeline...


Looks like a unipost in a uniblog


Unicycle: A digital document or data stream in Unicode with no distinct beginning or end, allowing for continuous looping.


It made me raise my unibrow


I miss Uniqlock.


It's a disappointment that we never used different conceptual "types" for text files with different encodings. You can sort of get there with MIME types and encoding data attached, but imagine if you had a filename convention like '.txt.ASCII' or '.txt.UTF8LE', or '.txt.ISO8859-1' so software would know exactly what to expect rather than trying to sniff byte-order marks or hueristics for guessing encoding and mojibake fixes.


I think if someone had tried to make encoding-in-filename a common convention it would have just resulted in people writing “.txt.ASCII” without understanding what that meant, and we’d have the same charset sniffing we have now plus another mess on top. (For example: when you have a “config.ini.UTF-16” and a “config.ini.WINDOWS-1252”, which should your program prefer?)

---

An anecdote: I was writing some code to send data to another company as an XML file. When I sent them a test file to validate, I got a complaint that it didn’t declare its encoding as UTF-8.

“Yes, our database uses UCS-2 for exports,” I replied. “Why, can you not read it?”

“Our system imported it just fine,“ they explained, “but it needs to say ‘<?xml version="1.0" encoding="UTF-8"?>’ at the top and when I looked at it with a text editor yours says something else there instead.”

“Yes, it doesn’t say UTF-8 because it’s not UTF-8. Sounds like your system is OK with that, though?”

“Yes, the rest of the file is fine, but we can’t move forward with go-live until you conform to the example in our implementation guide...”

Actually changing the encoding proved difficult (for reasons I no longer recall), so I ended up shoving in a string replacement to make the file claim the wrong encoding, along with an apologetic comment; their end apparently did sniffing (and the BOM was unambiguous) so this worked out fine, but it goes to show just how confused people can get about what a character encoding a file actually is.


I don't think having the extensions as a part of the file name was a good idea in the first place; I see it more as a practical compromise.

HTTP does not care about extensions in URLs at all, and you can serve any content type (plus information about its encoding) independently of the contents. Early file systems did not have capabilities this advanced, and through decades of backward compatibility we're kind of stuck with .jpg (or is it .jpeg? or .jpe?), .html (or .htm?), or .doc (is this MS Word? or RTF?).


> Early file systems did not have capabilities this advanced, and through decades of backward compatibility we're kind of stuck with

Some did. Some mainframes did. And Apple pre-OSX (HFS I think was the file system) did too.

The problem isn’t just backwards compatibility with DOS. It’s that most file transfer protocols don’t forward that kind of meta information. They don’t even share whether it’s executable, a system file or hidden (those that do have an attribute for hidden files). So the file type ended up needing to be encoded in the file name if just to preserve it.


FTP (and friends) indeed are a factor as well, but what stuck in my mind was how Forth initially was supposed to be named "Fourth", but the OS didn't support file names this long. https://en.wikipedia.org/wiki/Forth_(programming_language)#H...

HFS (and HFS+, and APFS) indeed has "resource forks", and supporting (or just not tripping over) them on non-Apple filesystems comes with a bag of surprises, even on Apple platforms. Just last week we've run into a problem with a script that globbed for "*.mp4" files to feed them into ffmpeg, but ran into and choked on the "._*.mp4"'s (which is how macOS stores the resource forks on FAT32/exFAT). The script was tested and worked fine on every Mac I tested it on, and everything was smooth until the customer plugged in an external HD to process videos off of. https://en.wikipedia.org/wiki/Resource_fork

So unfortunately the most reliable way of knowing exactly what you're working with, is lots of heuristics, and very defensive coding.


Resource forks are something different; the parent is referring to type and creator codes, which are part of the metadata for a file. (Type codes were also used to identify types inside the resource fork, which is maybe what you’re thinking of?)

The type code didn’t carry encoding information, though I believe the later https://en.wikipedia.org/wiki/Uniform_Type_Identifier can.


I didn’t mean FTP specifically, just file transfer protocols in general. But I do agree that FTP, specifically, is a garbage protocol.


> UTF8LE

There is a big-endian UTF8?!


All UTF-8 is big-endian, at least for various ways to think of it. Is little endian UTF-8 a thing that gets used?


It's called "text" now. After all what text is really not Unicode these days?


Japanese, since Unicode is set up to display Japanese wrong (CJK unification - theoretically you can configure your application to display Chinese wrong instead, but no-one does that).


Interesting, I didn't know that.

But I read up on it and it sounds like Unicode is widely used in Japan and the issue is mainly academic when it comes to old texts. It also seems to have been mitigated with selectors:

> Since the Unihan standard encodes "abstract characters", not "glyphs", the graphical artifacts produced by Unicode have been considered temporary technical hurdles, and at most, cosmetic. However, again, particularly in Japan, due in part to the way in which Chinese characters were incorporated into Japanese writing systems historically, the inability to specify a particular variant was considered a significant obstacle to the use of Unicode in scholarly work. For example, the unification of "grass" (explained above), means that a historical text cannot be encoded so as to preserve its peculiar orthography. Instead, for example, the scholar would be required to locate the desired glyph in a specific typeface in order to convey the text as written, defeating the purpose of a unified character set. Unicode has responded to these needs by assigning variation selectors so that authors can select grapheme variations of particular ideographs (or even other characters).

From: https://en.m.wikipedia.org/wiki/Han_unification


> the issue is mainly academic when it comes to old texts.

If it was just an academic concern, there could have been other solutions. It is more about how much people care about the exact glyph in display, and Japaneses were particularly pedantic in my experience (even more so than Chineses).

The Han unification itself was a well-intentioned but ultimately misguided effort to fit the massive Han characters into 16 bits. Another evidence that a 16-bit character set was so naive.


It’s not an academic or legacy text issue at all, systems and apps that deal with Japanese texts were usually configured to do Chinese wrong and vice versa, which worked until computer OS images became universal.

The Chinese Hanzi codespace and Japanese Kanji codespace are effectively reused for each other’s sets. That’s what it is.


Title should be Unicode 88 (1998). The paper is a 10 years article


It's a 1988 article; the linked version is a scan of a 1998 reprinting.


Maybe when I have some time we'll have to go back to a practical and sane 16bit Unicode, at least for programming. Not for archeology.

Only the common scripts, proper identifiers (without the bugs unfixed for decades), forbidden mark combinations for which a single symbol exists. Of course without emoji, but also without all the silly supplementary planes.


These are the situations where I restrict myself to 7-bit ASCII (and before the torches and pitchforks arrive, I'm not a native English speaker).




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: