Hacker News new | past | comments | ask | show | jobs | submit login

See also: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...




I just read this over and it's a very dated Windows-centric view. Several glaring errors - glosses over the difference between UCS-2 and UTF-16, no mention of surrogate pairs for UTF-16 (thinks only 65k code points), says UTF-8 can be up to 6 bytes (no it can't, this was proposed but never standardized), the idea that ASCII standardization dates to the 8088 (its much older), mentions UTF-7 (don't), no mention that wchar_t changes size based on platform, no mention of Han unification, no mention of shaping, and no mention of normalization.


RFC 2279 says: "In UTF-8, characters are encoded using sequences of 1 to 6 octets." That's not technically a standard, but it was widely implemented.


UTF-8 was originally designed to handle codepoints up to a full 32 bits. It wasn't until later that the codepoint range was restricted so that 4 octets would be sufficient.


> mentions UTF-7 (don't)

Wait, what's so wrong about mentioning UTF-7? Wasn't it just a (proposed but abandoned) way to represent Unicode characters in MIME email?


Yeah, I meant don't use it. It seems to confuse things to even bring it up.


Kinda half-sad it didn't make it. Would have been cool to able to "see" behind the curtains of UTF strings. As it is now, you can only paste a UTF string in a UTF aware environment, and you also need the correct fonts etc.

It would have been cool to be able to incrementally upgrade legacy environments to use UTF via UTF-7. Unaware parts would just have displayed the encoding. String lengths would have sort of worked.

(All of these things would of course have come with horrible drawbacks, so in that alternative universe I might have been cursing that we got UTF-7...)


UTF-8 is the sane incremental path from ASCII.


Most issues are in old implementations and on Windows, so it's not completely off base.


Sure, but there is no way this should be used as a reference in 2019. It was wrong even in 2003 when it was written - Unicode 3.0 from 1999 defined the maximum number of code points, surrogate pairs, and code points above U+FFFF.

His single most important fact still rings true though, "It does not make sense to have a string without knowing what encoding it uses."


Tom Scott's video is a great intro:

> https://www.youtube.com/watch?v=MijmeoH9LT4




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: