Hacker News new | past | comments | ask | show | jobs | submit login

> they all use the same faulty algorithm

Well, yes. To be fair, it's not like any of them make a secret of the fact that they're mistakenly counting unicode code points instead of characters.




Are they really doing so "mistakenly"? I feel like there's more to this.


It's not mistakenly. Unicode's complexity is a bit more than trivial, and since much work has gone into abstracting over it many people are surprised when the complexity rears up at them.

Consider, for example, the wonderful piece of writing in the answer to this question.

https://stackoverflow.com/questions/1732348/regex-match-open...

How many characters do you suppose are in this string?

.

"TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚ N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ"

.

And what should Python tell you the length of this string is?


Yes, what you said. It's not a mistake. It's a... useful abstraction.

Unicode is complicated in some ways because the domain it is dealing with (representing all possible human written communication, basically) is complicated. Unicode is pretty ingenious. It pays to invest in learning about it, rather than assuming your "naive" conclusions are what it "should" do (and unicode's standard docs are pretty readable).

Unicode does offer an algorithm for segmenting text into "grapheme clusters", specifically "user-perceived characters." https://unicode.org/reports/tr29/

It's worth reading that document when deciding what you think the "right" thing to do with "len()" is.

The "user-perceived character segmentation" algorithm is complicated, it has a performance cost... and it's implemented in terms of the lower-level codepoint abstraction.

Dealing with codepoints is the right thing for most platforms to do, as the basic API. Codepoints are the basic API into unicode.

It's true that they ideally ought to also give you access to TR29 character segmentation. And most don't. Cause it's hard and confusing and nobody's done it I guess. It would be nice.

If you want to know "well, howe come codepoints are the basic unicode abstraction/API? Why couldn't user-perceived characters be?" Then start reading other unicode docs too, and eventually you'll understand how we got here. (For starters, a "user-perceived character" can actually be locale-dependent, what's two characters in one language may be one in another).


> It's not a mistake. It's a... useful abstraction.

It is specifially a abstraction that is not useful.

> It's worth reading that document when deciding what you think the "right" thing to do with "len()" is.

Technically not - the right thing to do is return the number of characters[0] - but the character segmentation parts are worth reading when deciding how to decode UTF-8 bytes into characters in the first place, so the distinction is somewhat academic.

> a [character] can actually be locale-dependent, what's two characters in one language may be one in another

[citation needed]; ch, ij, dz, etc are not examples, but I'm admitted not exhaustively familiar with non-latin scripts[1], so I would be interested to see what other scripts do.

0: or bytes, but that's trivial

1: Which is why I hate Unicode; I'd prefer to pawn that work off on someone else and just import a library, but Unicode has ensured that all available libraries are always unusably broken.


> Technically not - the right thing to do is return the number of characters[0]

> 0: or bytes, but that's trivial

In what encoding? The utf-8, utf-32, and utf-16 encodings of the same string are different numbers of bytes.


Number of bytes would apply in cases - like the len() of a python3 bytes or python str object, or something like C's strlen function - where you're not operating on characters in the first place. It's trivial precisely because there is no encoding.

"\xC4\xAC" is two bytes regardless of whether you interpret it as [latin capital i + breve] or [hangul gyeoh] or [latin capital a + umlaut][not sign] ("Ĭ" / "곃" / "Ĭ").


23 if I'm counting correctly (there's a space in "PO NY" for some reason). I would also accept 209 from a language that elected not to deal with large amounts of complexity in string handling. The problem with Unicode is they go to enormous amounts of effort to deliberately give a wrong answer.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: