Hacker News new | past | comments | ask | show | jobs | submit login

> multiple encodings for the same code point

I'm fairly sure a good fraction of these are due to the desire for compatibility:

PDF: https://www.unicode.org/versions/Unicode15.0.0/ch02.pdf#G110...

> Conceptually, compatibility characters are characters that would not have been encoded in the Unicode Standard except for compatibility and round-trip convertibility with other standards. Such standards include international, national, and vendor character encoding standards. For the most part, these are widely used standards that pre-dated Unicode, but because continued interoperability with new standards and data sources is one of the primary design goals of the Unicode Standard, additional compatibility characters are added as the situation warrants.

Round-trip convertibility is important for interoperability, which is important for getting the standard accepted. A data encoding standard that causes data loss is not acceptable, and trying to impose such on the world would have rightfully failed.




> new standards ... added as the situation warrants

Never ending madness.

> impose such on the world

The imposing is inflicted by the Unicode standard on everyone in the world who have no use for never ending new invented encodings for the same thing.


Unicode being widely adopted (which is what is happening) is the solution to the problem you seem to have of Unicode having to add more codepoints for compatibility with other encodings. Even if the pre-Unicode encodings never go away, they've already taken their respective bites out of the codepoint space, and they won't take up any more of the Consortium's time.

The only way this becomes a live issue again is if someone develops a whole new encoding which is incompatible with the latest version of Unicode. And it gets enough traction to get on anyone else's radar. I simply can't see that as very likely.


> Never ending madness.

[Unicode] didn't start the fire; it was always burning, since the world's been turning.

(I.e. the "never-ending madness" is in the continual desire of human cultures to invent new characters with unique semantics and then have them faithfully digitally encoded in a way that carries those semantics. Being upset that this makes the management of a universal text encoding a never-ending job, is a bit like being upset that the Library of Congress always needs reorganizing and expanding because it keeps getting new books.)


It's different in that nobody expects anyone to implement their own copy of the LoC. This constant expansion of the character set puts an enormous burden on a very large collection of developers. The inevitable happens - they just don't bother - making full Unicode support a pipe dream.


I'm perplexed in what you think "full support" for a novel Unicode codepoint entails.

Most of the properties of a novel codepoint that matter, are attributes that the codepoint has defined within various Unicode-Consortium-defined data files. If your custom implementation of Unicode support already parses those data files, then a new codepoint is essentially "absorbed" automatically as soon as the new version of the Unicode data files are made available to your implementation.

Are you just complaining that new codepoints don't automatically receive glyphs in major fonts?

To that, I would say: who cares? As I just said upthread — character encoding is about semantics, not about rendering. A character doesn't have to be renderable to be useful! The character is still doing its job — "transmitting meaning into the future" — when it's showing up as a U+FFFD replacement character! Screen readers can still read it! Search engines can still index on it! People who want to know what the character was "meant to mean" can copy and paste it into a Unicode database website! It still sorts and pattern-matches correctly! And so forth.

Rendering the codepoint as a glyph — if it even makes sense to render it standalone as a glyph (for which some codepoints don't, e.g. the country-code characters that only have graphical representations in ligatured pairs) — can come at some arbitrary later point, when there's enough non-niche, non-academic, supply-side usage of the codepoint, that there arises real demand to see that codepoint rendered as something. Until then, just documenting the semantics as a codepoint and otherwise leaving it un-rendered, works fine for the needs of the anthropologists and archivists who "legibilize" existing proprietary-encoding-encoded texts through Unicode conversion.


How hard is it to use AI to generate fonts on demand?


That still needs some encoding that can be converted to enough information that can reconstruct a glyph.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: