I feel like this is the real success of Python: the "one and preferably only one...

opportune · on June 21, 2019

I don't think this is really true of python, often the "correct" (pythonic, theoretically best performance) way to do something involves using more complicated language constructs than most people are familiar with. For example, when to use list/dict comprehensions, when to use reduce functions, when to use generators. Most beginner programmers and people coming from C-inspired languages will do things the "obvious" yet incorrect way by doing a for-loop of appends.

m463 · on June 21, 2019

I think that python is in general more cohesive than other languages. I like that perl has both "if" and "unless", which is expressive, but it give multiple ways to do the same thing.

I also think pythonic and theoretically best performance might not need to correlate.

I would personally stick to for-loops for general but tricky code and leave things complications like nested comprehensions to places like the guts of libraries or classes that make the tradeoff to have simplified externals.

(for example argparse - very nice externals, tricky tricky guts)

a1369209993 · on June 21, 2019

> Complain all you want about the 2 to 3 transition

Okay.

> it's resulted in a simpler, easier-to-use language.

Does `print(len("ẅ"))`[0] still produce a value (2) that is neither the number of characters (1) nor the number of bytes (3)?

0: "print\x28len\x28\x22\x77\xCC\x88\x22\x29\x29"

kerkeslager · on June 22, 2019

It took me a bit to understand what you were trying to do. Here's a paste of my Python 3 shell session, showing that Python 3 does indeed return the number of characters.

    ~$ python3
    Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
    [Clang 10.0.1 (clang-1001.0.46.3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> len("ẅ")
    1

Python 3 uses UTF 32 internally, so the byte representation you placed below your post is not how Python 3 represents it. Instead, it looks like this:

    >>> "ẅ".encode('utf32')
    b'\xff\xfe\x00\x00\x85\x1e\x00\x00'

This has disadvantages (memory usage) but for most cases where Python is used, it's an advantage (faster random access, more intuitive for situations like the one you've proposed).

zephyrfalcon · on June 22, 2019

So, I see:

    Python 3.7.3 (default, Mar 27 2019, 09:23:32)
    [Clang 9.0.0 (clang-900.0.39.2)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> len("ẅ")
    2

I wonder why this is? Is the Clang version relevant here?

EDIT: Your "ẅ" doesn't seem to be the same as the OP's "ẅ", although they look the same at first glance.

    >>> "ẅ".encode('utf-8')
    b'w\xcc\x88'
    >>> "ẅ".encode('utf-8')
    b'\xe1\xba\x85'

EDIT 2. More info:

    >>> import unicodedata
    >>> w1 = "ẅ"
    >>> w2 = "ẅ"
    >>> unicodedata.name(w1)
    'LATIN SMALL LETTER W WITH DIAERESIS'
    >>> unicodedata.name(w2)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: name() argument 1 must be a unicode character, not str
    >>> unicodedata.name(w2[0])
    'LATIN SMALL LETTER W'
    >>> unicodedata.name(w2[1])
    'COMBINING DIAERESIS'

So the second version (w2) does seem to consist of two separate "characters", LATIN SMALL LETTER W and COMBINING DIAERESIS, which is apparently not the same as the single-character LATIN SMALL LETTER W WITH DIAERESIS. I guess these are actually Unicode code points and not so much "characters" to a human reader, but as another poster pointed out, what the number of characters should be in a string isn't always clear-cut.

a1369209993 · on June 22, 2019

Correct, w2 is one character (latin small w with umlaut) represented by two unicode code points. I didn't realize there was a NFC code point for that character; try "\x66\xCC\x88" (f̈) or "\x77\xCC\xBB" (w̻) instead.

> the number of characters should be in a string isn't always clear-cut.

This is why I use examples from latin-with-diacritics, where there is no ambiguity in character segmentation.

kerkeslager · on June 23, 2019

Interesting, I learned a bit about Unicode here. It looks like copy/pasting combined the two code points into one when I ran my code.

Still, to the original point, I think this is more of a criticism of Unicode than of Python. It seems to me that the answer is to not use combining diacritics, and that Unicode shouldn't include those.

a1369209993 · on June 23, 2019

> this is more of a criticism of Unicode than of Python

True, although it's more specifically a criticism of Python for using Unicode, where these kinds of warts are pervasive. See also "\xC7\xB1" (U+01F1 "Ǳ") which is two bytes, one code point, and two characters with no correspondence to those bytes.

> the answer is to not use combining diacritics

This doesn't actually work, sadly, because you can't represent eg "f̈"[0] without some means of composing arbitrary base characters with arbitrary diacritics.

0: If unicode has a added a specific NFC code point for that particular character, then that's bad example but the general point still stands.

wool_gather · on June 22, 2019

Well spotted. It was probably normalized when it was copy-pasted.

ferbivore · on June 21, 2019

len() counts code points, not abstract characters. Your "ẅ" contains two of them, U+77 and U+0308.

zephyrfalcon · on June 21, 2019

Apparently yes. A quick inspection suggests that this is the same in Ruby, Haskell, SWI-Prolog, Gauche Scheme, SBCL and D. (I might not have the latest version of everything, so maybe this has been fixed in some of them... assuming it needs fixing. Maybe there's a reason for the answer to be 2 if so many language implementations insist on it. Or, they all use the same faulty algorithm. I don't know.)

a1369209993 · on June 22, 2019

> they all use the same faulty algorithm

Well, yes. To be fair, it's not like any of them make a secret of the fact that they're mistakenly counting unicode code points instead of characters.

Smithalicious · on June 22, 2019

Are they really doing so "mistakenly"? I feel like there's more to this.

dodobirdlord · on June 22, 2019

It's not mistakenly. Unicode's complexity is a bit more than trivial, and since much work has gone into abstracting over it many people are surprised when the complexity rears up at them.

Consider, for example, the wonderful piece of writing in the answer to this question.

https://stackoverflow.com/questions/1732348/regex-match-open...

How many characters do you suppose are in this string?

.

"TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚ N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ"

.

And what should Python tell you the length of this string is?

jrochkind1 · on June 24, 2019

Yes, what you said. It's not a mistake. It's a... useful abstraction.

Unicode is complicated in some ways because the domain it is dealing with (representing all possible human written communication, basically) is complicated. Unicode is pretty ingenious. It pays to invest in learning about it, rather than assuming your "naive" conclusions are what it "should" do (and unicode's standard docs are pretty readable).

Unicode does offer an algorithm for segmenting text into "grapheme clusters", specifically "user-perceived characters." https://unicode.org/reports/tr29/

It's worth reading that document when deciding what you think the "right" thing to do with "len()" is.

The "user-perceived character segmentation" algorithm is complicated, it has a performance cost... and it's implemented in terms of the lower-level codepoint abstraction.

Dealing with codepoints is the right thing for most platforms to do, as the basic API. Codepoints are the basic API into unicode.

It's true that they ideally ought to also give you access to TR29 character segmentation. And most don't. Cause it's hard and confusing and nobody's done it I guess. It would be nice.

If you want to know "well, howe come codepoints are the basic unicode abstraction/API? Why couldn't user-perceived characters be?" Then start reading other unicode docs too, and eventually you'll understand how we got here. (For starters, a "user-perceived character" can actually be locale-dependent, what's two characters in one language may be one in another).

a1369209993 · on June 25, 2019

> It's not a mistake. It's a... useful abstraction.

It is specifially a abstraction that is not useful.

> It's worth reading that document when deciding what you think the "right" thing to do with "len()" is.

Technically not - the right thing to do is return the number of characters[0] - but the character segmentation parts are worth reading when deciding how to decode UTF-8 bytes into characters in the first place, so the distinction is somewhat academic.

> a [character] can actually be locale-dependent, what's two characters in one language may be one in another

[citation needed]; ch, ij, dz, etc are not examples, but I'm admitted not exhaustively familiar with non-latin scripts[1], so I would be interested to see what other scripts do.

0: or bytes, but that's trivial

1: Which is why I hate Unicode; I'd prefer to pawn that work off on someone else and just import a library, but Unicode has ensured that all available libraries are always unusably broken.

joshuamorton · on June 27, 2019

> Technically not - the right thing to do is return the number of characters[0]

> 0: or bytes, but that's trivial

In what encoding? The utf-8, utf-32, and utf-16 encodings of the same string are different numbers of bytes.

a1369209993 · on June 27, 2019

Number of bytes would apply in cases - like the len() of a python3 bytes or python str object, or something like C's strlen function - where you're not operating on characters in the first place. It's trivial precisely because there is no encoding.

"\xC4\xAC" is two bytes regardless of whether you interpret it as [latin capital i + breve] or [hangul gyeoh] or [latin capital a + umlaut][not sign] ("Ĭ" / "곃" / "Ä¬").

a1369209993 · on June 22, 2019

23 if I'm counting correctly (there's a space in "PO NY" for some reason). I would also accept 209 from a language that elected not to deal with large amounts of complexity in string handling. The problem with Unicode is they go to enormous amounts of effort to deliberately give a wrong answer.