Hacker News new | past | comments | ask | show | jobs | submit login

But what about combining characters? https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

Should the letter plus a combining character count as one (I think so), or two characters? Should you normalize before counting length? And so on.




Combining characters are their own unicode codepoint, so they count towards length. The beauty of this approach is that its simple and objective.

If you had a list of 5 Dom Element objects, and one Dom Attr object, the length of that list is 6. Its nonsensical to say "The Attr object modifies an Element object, so its not really in the list".


Going by bytes is also simple and objective. And also totally arbitrary, just like going by codepoints.

Which is the most useful for dealing with strings in practice though? Are either interpretations useful at all?


Going byte by byte is useless. You can't do anything with a single byte of a unicode codepoint (unless, by luck, the codepoint is encoded in a single byte).

Codepoint is the smallest useful unit of a unicode string. It is a character, and you can do all the character things with it.

If you wanted to implement a toUpper() function for example, you would want to iterate over all the codepoints.


> If you wanted to implement a toUpper() function for example, you would want to iterate over all the codepoints.

Nope. In order to deal with special casings you will have to span multiple codepoints, at which point it's no more work with whatever the code units are.


> Combining characters are their own unicode codepoint, so they count towards length.

This is incredibly arbitrary - it depends entirely on what "length" means for a particular usecase. From the user's perspective there might only be a single character on the screen.

Any nontrivial string operation must be based around grapheme clusters, otherwise it is fundamentally broken. Codepoints are a useful encoding agnostic way to handle basic (split & concatenate) operations, but the method by which those offsets are determined needs to be grapheme cluster aware. Raw byte offsets are encoding specific and only really useful for allocating the underlying storage.


You don't count the length. Length specifies the size of the internal encoding.

You count the width. And there are Unicode rules how you count the width. Which do change every year.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: