No, where Unicode is complicated is where the Unicode people decided to make it ...

cryptonector · on Dec 3, 2019

You're demonstrably wrong.

Most complexity in Unicode derives from:

  - real complexity in human scripts
  - politics

neither of which is something that Unicode could have avoided. Complexity in human scripts necessarily leads to complexity in Unicode. Not having Unicode at all would be much worse than Unicode could possibly seem to you -- you'd have to know the codeset/encoding of every string/file/whatever, and never lose track. Not having politics affect Unicode is a pipe dream, and politics has unavoidably led to some duplication.

Confusability is NOT a problem created by Unicode, but by humans. Even just within the Latin character set there are confusables (like 1 and l, which often look the same, so much so that early typewriters exploited such similarities to reduce part count).

Nor were things like normalization avoidable, since even before Unicode we had combining codepoints: ASCII was actually a multibyte Latin character set, where, for example, á could be written as a<BS>' (or '<BS>a), where <BS> is backspace. Many such compositions survive today -- for example, X11 Compose key sequences are based on those ASCII compositions (sadly, many Windows compositions aren't). The moment you have combining marks, you have an equivalence (normalization) problem.

Emojis did add a bunch of complexity, but... emojis are essentially a new script that was created by developers in Japan. Eventually the Unicode Consortium was bound to have to standardize emojis for technical and political reasons.

Of course, some mistakes were made: UCS-2/BMP, UTF-16, CJK unification, and others. But the lion's share of Unicode's complexity is not due to those mistakes, but to more natural reasons.

kazinator · on Dec 3, 2019

[flagged]

rswail · on Dec 3, 2019

And the alternative would be what exactly?

happytoexplain · on Dec 3, 2019

As a developer who's been working intimately with user-facing strings for years, I have to disagree in the strongest possible terms. Unicode is one of the borderline zero standards that is almost angelic in its purity, with only an extremely few things I think might have served better if done differently.

kazinator · on Dec 3, 2019

> As a developer who's been working intimately with user-facing strings for years, ...

User-facing is easy; things go downhill when users have system-facing strings of their own, and some of those strings become other-user-facing strings.

> with only an extremely few things I think might have served better if done differently.

Thus, in spite of disagreeing in the strongest possible terms, you do have some nits to pick.

A "few things" could be far-reaching. For instance, allowing the same semantic character to be encoded in more than one way can count as "one thing". If someone happens to think this is the only problem with Unicode, then that's "extremely few things". Yet, it's pretty major.

jrochkind1 · on Dec 3, 2019

Your idea that either there are NO "nits to pick" (things that could have been done better in a standard, complete perfection), OR it means that the standards-makers "decided to make it complicated to bolster their egos" -- is ABSOLUTELY INSANE.

kazinator · on Dec 3, 2019

My point isn't that there must be no nits to pick, but that look, even a self-proclaimed Unicode cheerleader who disagrees with me in the "strongest possible" terms still finds it necessary to mention that he or she has some.

saagarjha · on Dec 3, 2019

> No, where Unicode is complicated is where the Unicode people decided to make it complicated to bolster their egos, to the detriment of everyone downstream of them.

Where?

ori_b · on Dec 3, 2019

Interesting statement. Other than maybe han unification, what would you do differently?

cryptonector · on Dec 3, 2019

u/kazinator is decidedly wrong (see above), but besides not trying CJK unification, I wish we had had UTF-8 from day 0, no UCS-2, no UTF-16, no BMP, no codespace limit as low as 21 bits. That's mostly it. If we could have stood not having precompositions, I'd rather not have had those either, but that would have required a rather large leap in functionality in input modes in the late 80s or early 90s, which would not have been feasible.

kazinator · on Dec 3, 2019

kazinator is wrong, decidedly so, but let me take this opportunity to opine my own impractical list of gripes that require going back in history and redesigning Unicode in fundamental ways ...

What a comic thread!

Dylan16807 · on Dec 3, 2019

That 'list of gripes' is all about the single change of having UTF-8 from the start. It's not multiple separate problems.

Also why are you implying that any gripes automatically prove you right? It's kind of ridiculous to suggest that not having UTF-8 was people "deciding to make it complicated to bolster their egos".

qtplatypus · on Dec 3, 2019

What we are contesting is your characterisation that the people in charge of Unicode added complexity in order to puff up there egos. Rather then them making decisions that with the benefit of our present knowledge was the incorrect ones.