> For collaborative editing, the basic unit of measure is codepoints
I'd quibble it's not the basic unit of measure so much as how changesets are represented. The user edits based on grapheme clusters. The final edit is then encoded using codepoints, which makes sense because a changeset amounts to a collection of basic string operations (splitting, concatenating, etc). As you note, it would be undesirable for changesets to be aware of higher level string representation details.
For that matter, as long as the format is restricted to one encoding, I don't see why the unit of a changeset can't just be a byte array.
I can see why it would happen to be a codepoint, this might be ergonomic for the language, but it seems to me that, like clustering codepoints together in graphemes, clustering bytes into codepoints is something the runtime takes care of, such that a changeset will be a valid example of all three.
The reason you mentioned (requiring every system to use the same string encoding) matters. Interpreting a UCS-2 byte offset in rust (which uses UTF-8 internally) isn’t easy. Or symmetrically, patch a javascript string based on a UTF-8 byte offset. It’s especially hard if you want to do better than a O(n) linear scan of the entire document’s contents.
Using byte offsets also makes it possible to express a change which corrupts the encoding - like inserting in the middle of a multi byte codepoint. That goes against the principle of “make invalid data unrepresentable”. Your code is simpler if you don’t have to guard against this sort of thing. And you don’t have to worry about that if these invalid changes are impossible to represent in the patch format.
I'd quibble it's not the basic unit of measure so much as how changesets are represented. The user edits based on grapheme clusters. The final edit is then encoded using codepoints, which makes sense because a changeset amounts to a collection of basic string operations (splitting, concatenating, etc). As you note, it would be undesirable for changesets to be aware of higher level string representation details.