Combining characters are their own unicode codepoint, so they count towards length. The beauty of this approach is that its simple and objective.
If you had a list of 5 Dom Element objects, and one Dom Attr object, the length of that list is 6. Its nonsensical to say "The Attr object modifies an Element object, so its not really in the list".
Going byte by byte is useless. You can't do anything with a single byte of a unicode codepoint (unless, by luck, the codepoint is encoded in a single byte).
Codepoint is the smallest useful unit of a unicode string. It is a character, and you can do all the character things with it.
If you wanted to implement a toUpper() function for example, you would want to iterate over all the codepoints.
> If you wanted to implement a toUpper() function for example, you would want to iterate over all the codepoints.
Nope. In order to deal with special casings you will have to span multiple codepoints, at which point it's no more work with whatever the code units are.
> Combining characters are their own unicode codepoint, so they count towards length.
This is incredibly arbitrary - it depends entirely on what "length" means for a particular usecase. From the user's perspective there might only be a single character on the screen.
Any nontrivial string operation must be based around grapheme clusters, otherwise it is fundamentally broken. Codepoints are a useful encoding agnostic way to handle basic (split & concatenate) operations, but the method by which those offsets are determined needs to be grapheme cluster aware. Raw byte offsets are encoding specific and only really useful for allocating the underlying storage.
Should the letter plus a combining character count as one (I think so), or two characters? Should you normalize before counting length? And so on.