Hacker News new | past | comments | ask | show | jobs | submit login

The reason you mentioned (requiring every system to use the same string encoding) matters. Interpreting a UCS-2 byte offset in rust (which uses UTF-8 internally) isn’t easy. Or symmetrically, patch a javascript string based on a UTF-8 byte offset. It’s especially hard if you want to do better than a O(n) linear scan of the entire document’s contents.

Using byte offsets also makes it possible to express a change which corrupts the encoding - like inserting in the middle of a multi byte codepoint. That goes against the principle of “make invalid data unrepresentable”. Your code is simpler if you don’t have to guard against this sort of thing. And you don’t have to worry about that if these invalid changes are impossible to represent in the patch format.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: