See also: *The Absolute Minimum Every Software Developer Absolutely, Positively ...

speleo_engr · on Nov 20, 2019

I just read this over and it's a very dated Windows-centric view. Several glaring errors - glosses over the difference between UCS-2 and UTF-16, no mention of surrogate pairs for UTF-16 (thinks only 65k code points), says UTF-8 can be up to 6 bytes (no it can't, this was proposed but never standardized), the idea that ASCII standardization dates to the 8088 (its much older), mentions UTF-7 (don't), no mention that wchar_t changes size based on platform, no mention of Han unification, no mention of shaping, and no mention of normalization.

bloak · on Nov 20, 2019

RFC 2279 says: "In UTF-8, characters are encoded using sequences of 1 to 6 octets." That's not technically a standard, but it was widely implemented.

mark-r · on Nov 20, 2019

UTF-8 was originally designed to handle codepoints up to a full 32 bits. It wasn't until later that the codepoint range was restricted so that 4 octets would be sufficient.

cogburnd02 · on Nov 20, 2019

> mentions UTF-7 (don't)

Wait, what's so wrong about mentioning UTF-7? Wasn't it just a (proposed but abandoned) way to represent Unicode characters in MIME email?

speleo_engr · on Nov 20, 2019

Yeah, I meant don't use it. It seems to confuse things to even bring it up.

jacobush · on Nov 20, 2019

Kinda half-sad it didn't make it. Would have been cool to able to "see" behind the curtains of UTF strings. As it is now, you can only paste a UTF string in a UTF aware environment, and you also need the correct fonts etc.

It would have been cool to be able to incrementally upgrade legacy environments to use UTF via UTF-7. Unaware parts would just have displayed the encoding. String lengths would have sort of worked.

(All of these things would of course have come with horrible drawbacks, so in that alternative universe I might have been cursing that we got UTF-7...)

gpderetta · on Nov 20, 2019

UTF-8 is the sane incremental path from ASCII.

snagglegaggle · on Nov 20, 2019

Most issues are in old implementations and on Windows, so it's not completely off base.

speleo_engr · on Nov 20, 2019

Sure, but there is no way this should be used as a reference in 2019. It was wrong even in 2003 when it was written - Unicode 3.0 from 1999 defined the maximum number of code points, surrogate pairs, and code points above U+FFFF.

His single most important fact still rings true though, "It does not make sense to have a string without knowing what encoding it uses."

blowski · on Nov 20, 2019

Tom Scott's video is a great intro:

> https://www.youtube.com/watch?v=MijmeoH9LT4