Hacker News new | past | comments | ask | show | jobs | submit login

> If you were counting code points, a surrogate pair would be 1. If it's two, you're counting code units.

And to be explicit as to why that is: surrogate pairs are a feature of the UTF-16 encoding, where two 16-bit code units ("code units" being the lexemes of the decoder) decode to a single Unicode codepoint.

I feel like everything to do with Unicode is clearer if you never bring up how it's encoded; or, alternately, if you pretend for the sake of your tutorial that everybody uses UTF-32, so you can just talk about flinging single-code-unit codepoints around as machine-words, the same way ASCII flings single-code-unit codepoints around as bytes. This being basically what Unicode text-handling libraries are doing underneath anyway.

After all, from the perspective of the Unicode standard itself, all the stuff below the abstraction of "a codepoint" is implementation detail.

The standard has to let the abstraction leak in a few places, like surrogate pairs or BOMs, but these leaks aren't what the Unicode standard is supposed to be "about", and should really be thought of as features of the encodings that have found their way up a layer, rather than features of Unicode per se. Heck, even the categorization of codepoint-ranges into "planes" is just a pragma of UTF-16. Putting these pragma-features front-and-center in a discussion of "what Unicode is", is IMHO entirely backwards.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: