Hacker News new | past | comments | ask | show | jobs | submit login

When UTF-8 was first defined, they didn't know how big the Unicode range was going to be, so they defined it as a 1-6 byte encoding that could encode any 32-bit codepoint.

When Unicode was deemed to end at U+10FFFF (because that's the largest value that UTF-16 can encode), UTF-8 was revised to be a 1-4 byte encoding that ends in the same place.

Python clearly implements UTF-8 in a way that uses at most four bytes per codepoint (why support five and six byte sequences if they'll never be used?). I think what we're seeing in '\xfb\x9b\xbb\xaf' is four bytes out of a six byte sequence.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: