But what about combining characters? https://en.wikipedia.org/wiki/Combining_Dia...

dahfizz · on March 26, 2021

Combining characters are their own unicode codepoint, so they count towards length. The beauty of this approach is that its simple and objective.

If you had a list of 5 Dom Element objects, and one Dom Attr object, the length of that list is 6. Its nonsensical to say "The Attr object modifies an Element object, so its not really in the list".

shawnz · on March 26, 2021

Going by bytes is also simple and objective. And also totally arbitrary, just like going by codepoints.

Which is the most useful for dealing with strings in practice though? Are either interpretations useful at all?

dahfizz · on March 26, 2021

Going byte by byte is useless. You can't do anything with a single byte of a unicode codepoint (unless, by luck, the codepoint is encoded in a single byte).

Codepoint is the smallest useful unit of a unicode string. It is a character, and you can do all the character things with it.

If you wanted to implement a toUpper() function for example, you would want to iterate over all the codepoints.

masklinn · on March 26, 2021

> If you wanted to implement a toUpper() function for example, you would want to iterate over all the codepoints.

Nope. In order to deal with special casings you will have to span multiple codepoints, at which point it's no more work with whatever the code units are.

d110af5ccf · on March 26, 2021

> Combining characters are their own unicode codepoint, so they count towards length.

This is incredibly arbitrary - it depends entirely on what "length" means for a particular usecase. From the user's perspective there might only be a single character on the screen.

Any nontrivial string operation must be based around grapheme clusters, otherwise it is fundamentally broken. Codepoints are a useful encoding agnostic way to handle basic (split & concatenate) operations, but the method by which those offsets are determined needs to be grapheme cluster aware. Raw byte offsets are encoding specific and only really useful for allocating the underlying storage.

rurban · on March 26, 2021

You don't count the length. Length specifies the size of the internal encoding.

You count the width. And there are Unicode rules how you count the width. Which do change every year.