Hacker News new | past | comments | ask | show | jobs | submit login

No, the 140 characters don't mean 140 bytes [0]. A zigzag read seems to indicate that it's much more accepting than that: Every character is normalized to a preset format, such that combinations like é are represented in a single codepoint (and not "e plus diacritic" which would be two), and then you count the number of codepoints.

So, contrary to popular belief, the languages that could be discriminated seem to not be chinese or japanese, but languages with possible combinations on each character, such as european languages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: