It took me a bit to understand what you were trying to do. Here's a paste of my Python 3 shell session, showing that Python 3 does indeed return the number of characters.
~$ python3
Python 3.7.3 (default, Mar 27 2019, 09:23:15)
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> len("ẅ")
1
Python 3 uses UTF 32 internally, so the byte representation you placed below your post is not how Python 3 represents it. Instead, it looks like this:
This has disadvantages (memory usage) but for most cases where Python is used, it's an advantage (faster random access, more intuitive for situations like the one you've proposed).
Python 3.7.3 (default, Mar 27 2019, 09:23:32)
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> len("ẅ")
2
I wonder why this is? Is the Clang version relevant here?
EDIT: Your "ẅ" doesn't seem to be the same as the OP's "ẅ", although they look the same at first glance.
>>> import unicodedata
>>> w1 = "ẅ"
>>> w2 = "ẅ"
>>> unicodedata.name(w1)
'LATIN SMALL LETTER W WITH DIAERESIS'
>>> unicodedata.name(w2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: name() argument 1 must be a unicode character, not str
>>> unicodedata.name(w2[0])
'LATIN SMALL LETTER W'
>>> unicodedata.name(w2[1])
'COMBINING DIAERESIS'
So the second version (w2) does seem to consist of two separate "characters", LATIN SMALL LETTER W and COMBINING DIAERESIS, which is apparently not the same as the single-character LATIN SMALL LETTER W WITH DIAERESIS. I guess these are actually Unicode code points and not so much "characters" to a human reader, but as another poster pointed out, what the number of characters should be in a string isn't always clear-cut.
Correct, w2 is one character (latin small w with umlaut) represented by two unicode code points. I didn't realize there was a NFC code point for that character; try "\x66\xCC\x88" (f̈) or "\x77\xCC\xBB" (w̻) instead.
> the number of characters should be in a string isn't always clear-cut.
This is why I use examples from latin-with-diacritics, where there is no ambiguity in character segmentation.
Interesting, I learned a bit about Unicode here. It looks like copy/pasting combined the two code points into one when I ran my code.
Still, to the original point, I think this is more of a criticism of Unicode than of Python. It seems to me that the answer is to not use combining diacritics, and that Unicode shouldn't include those.
> this is more of a criticism of Unicode than of Python
True, although it's more specifically a criticism of Python for using Unicode, where these kinds of warts are pervasive. See also "\xC7\xB1" (U+01F1 "DZ") which is two bytes, one code point, and two characters with no correspondence to those bytes.
> the answer is to not use combining diacritics
This doesn't actually work, sadly, because you can't represent eg "f̈"[0] without some means of composing arbitrary base characters with arbitrary diacritics.
0: If unicode has a added a specific NFC code point for that particular character, then that's bad example but the general point still stands.