Hacker News new | past | comments | ask | show | jobs | submit login

One thing I was confused about. The document says there are 7 byte types, but I thought UTF-8 was variable width up to only 4 bytes. Did I misunderstand something?



Both are correct: This original UTF-8 encoding can encode values up to 2^32. But because UTF-16 encoding limits possible values to 16 planes of 64K values, unicode has a hard limit of 2^20 codepoints.

This means UTF-8 encoded values of more than 4 bytes can never represent a valid unicode codepoint even if they produce a valid 32 bit numerical value.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: