> As to the idea that programmers who want nothing to do with I18N are the most ...

cryptonector · on Dec 5, 2019

They read text in many scripts (because many writers use more than one script, and many readers can read more than one script). Without Unicode you can usually use TWO scripts: ASCII English + one other (e.g., ISO8859-*, SHIFT_JIS, ...). There are NO OTHER CODESETS than Unicode that can support THREE or more scripts, or any combination of TWO where one isn't ASCII English. For example, Indian subcontinent users are very likely to speak multiple languages and use at least two scripts other than ASCII English. Besides normal people all over the world who have to deal with multiple scripts, there's also: scolars, diplomats, support staff at many multi-nationals, and many others who also need to deal with multiple scripts.

Whether you like it or not, Unicode exists to make USERS' lives better. Programmers?? Pfft. We can deal with the complexity of all of that. User needs, on the other hand, simply cannot be met reasonably with any alternatives to Unicode.

rstuart4133 · on Dec 7, 2019

> Whether you like it or not, Unicode exists to make USERS' lives better.

They've been using multiple scripts long before computers. When computers came along those users quite reasonably demanded they be able to write the same scripts. This created a problem for the programmers. The obvious solution is a universal set of code points - and ISO 10646 was born. It was not rocket science. But if it had not come along some other hack / kludge would have been used because the market is too large to be abandoned by the computer companies. They would have put us programmers in a special kind of hell, but I can guarantee the users would not have known about that, let alone cared.

Oddly the encoding schemes proposed by ISO 10646 universally sucked. Unicode walked into that vacuum with their first cock up - 16 bits was enough for anybody. It was not just dumb because it was wrong - it was dumb because they didn't propose unique encoding. They gave us BOM markers instead. Double fail. They didn't win because the one thing they added, their UCS-2 encoding, was any better than what came before. They somehow managed to turn it into a Europe vs USA popularity contest. Nicely played.

Then Unicode and 10646 became the same thing. They jointly continued on in the same manner as before, inventing new encodings to paper over the UCS-2 mistake. Those new encodings all universally sucked. The encoding we programmers use today, UTF-8, was invented by, surprise, surprise, a programmer who was outside of Unicode / 10646 group think. It was popularised at a programmers conference, USENIX, and from there on it was obvious was going to be used regardless of what Unicode / 10646 thought of it, so they accepted it.

If Perl6 is any indication, the programmers are getting the shits with the current mess. Perl6 has explicitly added functions that treat text as a stream of grapheme's rather than Unicode code points. The two are the same of course except when your god damned compositions rear their ugly head - all they do is eliminate that mess. Maybe you should take note. Some programmers have already started using something other than Unicode because it makes their jobs easier. If it catches on you are going to find out real quickly just how much the users care about the encoding scheme computers use for text.

None of this is to trivialise the task of assigning code points to grapheme's. It's huge task. But for pete's sake don't over inflate your ego's by claiming you are doing some great service for mankind. The only thing on the planet that uses the numbers Unicode assigns to characters as is computers. Computers are programmed by one profession - programmers. Your entire output is consumed by that one profession. Yet for all the world what you've written here seems to say Unicode has some higher purpose, and you are optimising for that, whatever it may be. For gods same come down to earth.

lizmat · on Dec 7, 2019

Thank you for your reference to Perl 6. Please note that Perl 6 has been renamed to "Raku" (https://raku.org) using #rakulang as a tag for social media.

Please also not that all string handling in Raku (the `Str` class) is based on graphemes, not just some added functions. This e.g. means that a newline is always 1 grapheme (regardless of whether it was a CR or LF or CRLF). And that for é there is no difference between é (LATIN SMALL LETTER E WITH ACUTE aka 0x00E9) and (LATIN SMALL LETTER E, COMBINING ACUTE ACCENT aka 0x0065, 0x0301).

Please note that for any combination between characters and combiners for which there does not exist a composed version in Unicode, Raku will create synthetic codepoints on the fly. This ensures that you can consider your texts as graphemes, but still be able to roundtrip strange combinations.