So, I see: Python 3.7.3 (default, Mar 27 2019, 09:23:32) [Clang 9.0.0 (clang-900...

a1369209993 · on June 22, 2019

Correct, w2 is one character (latin small w with umlaut) represented by two unicode code points. I didn't realize there was a NFC code point for that character; try "\x66\xCC\x88" (f̈) or "\x77\xCC\xBB" (w̻) instead.

> the number of characters should be in a string isn't always clear-cut.

This is why I use examples from latin-with-diacritics, where there is no ambiguity in character segmentation.

kerkeslager · on June 23, 2019

Interesting, I learned a bit about Unicode here. It looks like copy/pasting combined the two code points into one when I ran my code.

Still, to the original point, I think this is more of a criticism of Unicode than of Python. It seems to me that the answer is to not use combining diacritics, and that Unicode shouldn't include those.

a1369209993 · on June 23, 2019

> this is more of a criticism of Unicode than of Python

True, although it's more specifically a criticism of Python for using Unicode, where these kinds of warts are pervasive. See also "\xC7\xB1" (U+01F1 "Ǳ") which is two bytes, one code point, and two characters with no correspondence to those bytes.

> the answer is to not use combining diacritics

This doesn't actually work, sadly, because you can't represent eg "f̈"[0] without some means of composing arbitrary base characters with arbitrary diacritics.

0: If unicode has a added a specific NFC code point for that particular character, then that's bad example but the general point still stands.

wool_gather · on June 22, 2019

Well spotted. It was probably normalized when it was copy-pasted.