Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It took me a bit to understand what you were trying to do. Here's a paste of my Python 3 shell session, showing that Python 3 does indeed return the number of characters.

    ~$ python3
    Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
    [Clang 10.0.1 (clang-1001.0.46.3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> len("ẅ")
    1
Python 3 uses UTF 32 internally, so the byte representation you placed below your post is not how Python 3 represents it. Instead, it looks like this:

    >>> "ẅ".encode('utf32')
    b'\xff\xfe\x00\x00\x85\x1e\x00\x00'
This has disadvantages (memory usage) but for most cases where Python is used, it's an advantage (faster random access, more intuitive for situations like the one you've proposed).



So, I see:

    Python 3.7.3 (default, Mar 27 2019, 09:23:32)
    [Clang 9.0.0 (clang-900.0.39.2)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> len("ẅ")
    2
I wonder why this is? Is the Clang version relevant here?

EDIT: Your "ẅ" doesn't seem to be the same as the OP's "ẅ", although they look the same at first glance.

    >>> "ẅ".encode('utf-8')
    b'w\xcc\x88'
    >>> "ẅ".encode('utf-8')
    b'\xe1\xba\x85'
EDIT 2. More info:

    >>> import unicodedata
    >>> w1 = "ẅ"
    >>> w2 = "ẅ"
    >>> unicodedata.name(w1)
    'LATIN SMALL LETTER W WITH DIAERESIS'
    >>> unicodedata.name(w2)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: name() argument 1 must be a unicode character, not str
    >>> unicodedata.name(w2[0])
    'LATIN SMALL LETTER W'
    >>> unicodedata.name(w2[1])
    'COMBINING DIAERESIS'
So the second version (w2) does seem to consist of two separate "characters", LATIN SMALL LETTER W and COMBINING DIAERESIS, which is apparently not the same as the single-character LATIN SMALL LETTER W WITH DIAERESIS. I guess these are actually Unicode code points and not so much "characters" to a human reader, but as another poster pointed out, what the number of characters should be in a string isn't always clear-cut.


Correct, w2 is one character (latin small w with umlaut) represented by two unicode code points. I didn't realize there was a NFC code point for that character; try "\x66\xCC\x88" (f̈) or "\x77\xCC\xBB" (w̻) instead.

> the number of characters should be in a string isn't always clear-cut.

This is why I use examples from latin-with-diacritics, where there is no ambiguity in character segmentation.


Interesting, I learned a bit about Unicode here. It looks like copy/pasting combined the two code points into one when I ran my code.

Still, to the original point, I think this is more of a criticism of Unicode than of Python. It seems to me that the answer is to not use combining diacritics, and that Unicode shouldn't include those.


> this is more of a criticism of Unicode than of Python

True, although it's more specifically a criticism of Python for using Unicode, where these kinds of warts are pervasive. See also "\xC7\xB1" (U+01F1 "DZ") which is two bytes, one code point, and two characters with no correspondence to those bytes.

> the answer is to not use combining diacritics

This doesn't actually work, sadly, because you can't represent eg "f̈"[0] without some means of composing arbitrary base characters with arbitrary diacritics.

0: If unicode has a added a specific NFC code point for that particular character, then that's bad example but the general point still stands.


Well spotted. It was probably normalized when it was copy-pasted.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: