Hacker News new | past | comments | ask | show | jobs | submit login
The Python Unicode Mess (complete.org)
284 points by psibi on Oct 6, 2018 | hide | past | favorite | 272 comments



As far as I can tell this is a long-form “I used to be able to ignore encoding issues and now it’s a ‘mess’ because the language is forcing me to be correct”. Each of the examples cited is something which was a source of latent bugs which he thought was working because they were ignored.

Only his third bit of advice isn’t wrong and treating it as something unusual shows the problem: the only safe way to handle text has always been to decode bytes as soon as you get them, work with Unicode, and then encode it when you send them out. Anything else is extremely hard to get right, even if many English-native programmers were used to being able to delay learning why for long periods of time.


> As far as I can tell this is a long-form “I used to be able to ignore encoding issues and now it’s a ‘mess’ because the language is forcing me to be correct”.

The problem with that view is that there are things for which you can not be correct, and there are no encoding issues because there is no encoding (or if there is one it does not map to proper unicode):

* UNIX files and paths have no encoding, they're just bags of bytes, with specific bytes (not codepoints, not characters, bytes) having specific meaning

* Windows file and path names are sequences of UTF-16 code units but not actually UTF-16 (they can and do contain unpaired surrogates), as above with specific code units (again not codepoints or characters) having specific meaning

These are issues you will encounter on user systems, there is no "forcing you to be correct". A non-unicode path is not incorrect. On many systems it just is. OSX is one of the few systems where a non-unicode path is actually incorrect, and that means you will not encounter one as input so you have no reason to handle this issue at all.

> Only his third bit of advice isn’t wrong and treating it as something unusual shows the problem: the only safe way to handle text

That's where you fail, and to an extent so does Python: some text-like things are not actually text. Path names famously is one of this case. You're trying to hammer the square peg of path names in the round hole of unicode.


That’s just restating my point: Unix filenames are bytes (on most filesystems, anyway). The fact that many people were able to conflate them with text strings was a convenient fiction. Python no longer allows you to maintain that pretense but it’s easy to deal with by treating them as opaque blobs, attempt to decode and handle errors, or perform manipulations as bytes.


> That’s just restating my point: Unix filenames are bytes (on most filesystems, anyway). The fact that many people were able to conflate them with text strings was a convenient fiction.

Python tools for backups are my worst terror beacuse of that - they kept destroying data of our clients because they dared (gasp!) name files with characters from their own language or do unthinkable things, like create documents titled "CV - <name with non-unicode characters>.docx".

The fact that Python3 at least tries to make programmers not destroy data as soon as you type in a foreign name (which happens even in USA) is a good thing.


> Python tools for backups are my worst terror beacuse of that

You can have badly written tools in any language. There are even functions to get file paths as bytes (e.g. [os.getcwdb](https://docs.python.org/3/library/os.html#os.getcwdb), it's just most people don't use them because it's rare-ish to see horribly broken filenames and not convenient.

Do other languages get this right 100% of the time on all platforms? I don't think so, it's just you've never noticed.

* C: has no concept of unicode strings per se, may or may not work depending on the implementation and how you choose to display them (CLI probably "works", GUI probably not)

* Rust: seems to assume UTF-8? (https://doc.rust-lang.org/std/ffi/index.html#conversions)

* Go: gets this right, but probably breaks on Windows? "string is the set of all strings of 8-bit bytes, conventionally but not necessarily representing UTF-8-encoded text" (https://golang.org/pkg/builtin/#string)

In short, either I don't understand what point you're making, or it isn't unique to Python.


Not to bring up one of my favorite languages or anything, but I do think D got this completely right.

https://tour.dlang.org/tour/en/basics/alias-strings

> This means that a plain string is defined as an array of 8-bit Unicode code units. All array operations can be used on strings, but they will work on a code unit level, and not a character level. At the same time, standard library algorithms will interpret strings as sequences of code points, and there is also an option to treat them as sequence of graphemes by explicit usage of std.uni.byGrapheme.

And perhaps my favorite part (https://tour.dlang.org/tour/en/gems/unicode):

> According to the spec, it is an error to store non-Unicode data in the D string types; expect your program to fail in different ways if your string is encoded improperly.

I should note that what I really like about this approach is the total lack of ambiguity. There is no question about what belongs in a string, and if it's not UTF then you had better be using a byte or ubyte array or you are doing it wrong by definition.


So in D it's impossible to work with files if their filename is not Unicode?


Rather it would be an error to grab a Unix filename, figure your job was done, and store it directly into a string. So you'd... handle things correctly. Somehow. I admit I've never had the bad luck of encountering a non-UTF8 encoded filename under Linux before and can't claim with any confidence that my code would handle it gracefully. In any language, assuming you're using the standard library facilities it provides things will hopefully mostly be taken care of behind the scenes anyway.

What I like about the D approach isn't that declaring it an error actually solves anything directly (obviously it doesn't) but that it removes any ambiguity about how things are expected to work. If the encoding of strings isn't well defined, then if you're writing a library which encodings are the users going to expect you to accept? Or worse, which encodings does that minimally documented library function you're about to call accept?


OsString/OsStr are not utf-8.


Would you care to elaborate? I'm not claiming to know Rust, but the link I provided clearly says "[t]hese do inexpensive conversions from and to UTF-8 byte slices".


https://doc.rust-lang.org/std/ffi/struct.OsString.html has it pretty clearly: on unixes it’s a bag of bytes, on Windows it’s the modified UTF-16 they’ve got going on. There’s a trick called WTF-8 that bridges some gaps, though that’s considered an implementation detail: https://simonsapin.github.io/wtf-8/

They’re conversions because they’re not UTF-8 in the first place, that is, they’re not String/str. The conversions are as cheap as we can make them. That language is meant to talk about converting from OsString to String, not from the OS to OsString.


Why do people say "bag" instead of string/sequence/vector/array/list/etc.? Bags are multisets... they're by definition unordered. It's already a niche technical term so it's really weird to see it used differently in a technical context...


I think it feels really evocative. Like, a bag of dice or something. You can’t see what’s inside, you have no idea what’s on them. It reinforces the “no encoding” thing well.


I think it is more eloquently stated that "you shouldn't make assumptions about what's inside." Saying "you can't see what's inside" ignores the biggest cause of the conflation. Userspace tools allow you to trivially interpret the bag of bytes as a text string for the purpose of naming it for other people.


Yeah, that might be better, good point.


One thing that amuses me given the number of complaints about the Python 3 string transition is how vastly better Python 3 is for working with bytes. The infrastructure available is light-years ahead of what Python 2 offered, precisely because it gave up on trying to also make bags of bytes be the default string type.


Thank you for saying that. Working with strings and bytes in Python 3 is nothing short of a joy compared to the dodgy stuff Python 2 did. People who complain about the change are delusional.


The only problem I have with Python3 strings/bytes handling is the fact that there are standard library functions which accept bytestrings in Py2 (regular "" strings), and Unicode strings in Py3 (again, regular "" strings in Py3).

This has led to developers attempting to conflate the two distinctly different concepts and make APIs support both while behaving differently.

A simple solution is there in plain sight: just use exclusively b"" and u"" strings for any code you wish to work in both Py2 and Py3, and forget about "". All and any libraries should be using those exclusively if they support both. Python3-only code should be using b"" and "" instead.

One could consider this a design oversight in Python 3: the fact that the syntax is so similar elsewhere makes people want to run the same code in both, yet a core type is basically incompatible.


u"" is a syntax error in python3 (or at least it was for a while, apparently its not anymore, that said...). The correct cross-platform solution is to do

    from __future__ import unicode_literals
which makes python2 string literals unicode unless declared bytes. Then "" strings are always unicode and b"" strings are always bytes, no matter the language version.


> u"" is a syntax error in python3

This has not been the case since 2012. The last release of Python 3 for which this was the case reached end of life in February 2016. Please stop misinforming people.


While u"" is accepted in current Python 3, for some reason they ignored the raw unicode string ur"" which still is a syntax error in Python 3. So, unicode_literals is definitely preferable.


> The correct cross-platform solution is to do

It's absolutely not correct, because there are many APIs which take "native strings" across versions aka they take an `str` (byte string) in Python 2 and an `str` (unicode string) in Python 3. unicode_literals causes significantly more problems than they solve.

The correct cross-platform solution (and the very reason why the `u` prefix was reintroduced after having initially been removed from Python 3) is in fact to use b"" for known byte strings, u"" for known text, and "" for "native strings" to interact with APIs which specifically need that.


Agreed. So many of the complaints leave me wanting to suggest that someone try it for longer than 15 minutes and see what they think then.


I haven't noticed much pain from the ASCII -> UNICODE migration from 2->3, but the one thing that still bothers me is that they did not update the CSV module to be transparent. In particular, the need to explicitly set the newline argument to `csv.reader`[0]. For me, dealing with a lot of data translation/extraction/loading work (ETL), this has been a big annoyance in migrating from 2 to 3.

I've not really had many other issues porting from 2 -> 3 from my own code; issues usually arise from 3rd-party libs that are relied upon (especially if they utilize C/C++ extensions). DB libs have sometimes been problematic. IIRC, pysybase requires you to set the encoding of strings now (which wasn't required before). I use pysybase to talk to both Sybase & MS SQL (it talks the TDS protocol).

[0] https://docs.python.org/3/library/csv.html#csv.reader


I haven't noticed much pain from the ASCII -> UNICODE migration from 2->3, but the one thing that still bothers me is that they did not update the CSV module to be transparent. In particular, the need to explicitly set the newline argument to `csv.reader`[0]. For me, dealing with a lot of data translation/extraction/loading work (ETL), this has been a big annoyance in migrating from 2 to 3.

???

The Python 3 CSV module works on text only (unicode), and you seem to be misreading the note: the module does in fact do newline transparency: https://docs.python.org/3/library/csv.html#id3

`newline=''` is to be specified on the file so it doesn't perform newline translation, because that would change the semantics of newlines in quoted strings: by default, `open` will universally translate newlines to `\n`, `newline=''` means it still does universal newlines (for purpose of iteration) but it doesn't do translation, returning newline sequences unchanged.

The Python 2 CSV module only worked on byte strings, and magic bytes.


Edit: I didn't try to handle files with Python for a long time. All of the following is stuff that can change as an ecosystem matures. All that may not hold anymore.

It's better in some ways. But then, I gave up long ago on manipulating files with Python, because Python 3 simply decided that anything on my filesystem is utf-8 by default.

Want to get an stream out of a file? Better have utf-8 content. Otherwise all the documentation will be buried 15 meters deep, outdated on some aspects and dependent on some not-released yet features.

Want to name a file? Better have an utf-8 name. I gave-up on python before I was able to open a file with a name that doesn't conform to utf-8.

Want to read a line at a time? Sorry, we only do that to text. Go get an utf-8 file.

Want to match with a regular expression or parse it somehow? Sorry, we only do that to text.

And so on.


Perhaps your information is out of date? I think none of what you said is true. Maybe some of it was true in the past.

> Want to get an stream out of a file? Better have utf-8 content. Otherwise all the documentation will be buried 15 meters deep, outdated on some aspects and dependent on some not-released yet features.

    f = open("myfile.txt", "r", encoding=ENCODING)
> Want to name a file? Better have an utf-8 name. I gave-up on python before I was able to open a file with a name that doesn't conform to utf-8.

    export NAME="$(head -c 5 /dev/urandom)"; touch "$NAME"; python -c 'import os; open(os.environ["NAME"])'
> Want to read a line at a time? Sorry, we only do that to text. Go get an utf-8 file.

    fh = open("foo", "rb"); lines=[l for l in fh]
> Want to match with a regular expression or parse it somehow? Sorry, we only do that to text.

    import re; re.match(b"abc", b"abcdef")


Yes, my experience is probably out of date.


This isn't really about Python, though. There is so much crappy software out there, in all languages, that make incorrect assumptions about things like text and filenames. To this day, I'm amazed when anything works even remotely correctly when when you throw something other than ISO-8859-1 and UTF-8 at it.

I lean towards "programmer problem" rather than "language problem".


"Unix filenames are bytes"

Are they? Is the separator 0x2f or / ? Or are you talking about the filename only, not paths?


The main point was that they’re not validated as Unicode so you can store values which cannot be decoded. It’s easy to find cases where someone could find a way to create a file which couldn’t easily be renamed or deleted using the OS’s built-in tools because even they forgot about this.

Similarly, Unicode normalization means that you can have multiple visually indistinguishable values which may or may not point to the same file depending on the OS and filesystem. I’ve had to help very confused users who were told that a file they could see wasn’t present because their tools didn’t handle that, too.


The former. The separator a byte with the value 0x2F, which is equivalent to `/` in ASCII.


0x2f is / in ASCII and UTF-8, so this question doesn't make that much sense.


> Are they? Is the separator 0x2f or / ? Or are you talking about the filename only, not paths?

It's 0xf2. Encode a path to UTF-16 and watch the entire thing burn, with the FS seeing the path you provided end up right before or after the first / (because it encodes as either 0x00f2 or 0xf200 depending on BE versus LE).


Unix paths are nul-byte-terminated, so UTF-16 generally doesn't make any sense in this context. Valid Unicode paths are encoded as UTF-8 on unix systems. UTF-16 and UTF-32 are invalid ways to encode Unicode paths. (That's not to say no one has tried to do it, just that it doesn't make any sense.)

(As other commenters have pointed out, Unix paths do not require a specific encoding, so robust applications cannot rely on any assumptions about encoding of existing files. But when creating new files, they must not try to encode paths as UTF-16.)


> Unix paths are nul-byte-terminated

That's the point. The separator is not "/", because "/" would be a character to encode. The separator is a specific byte, and so is the terminator.

> Valid Unicode paths are encoded as UTF-8 on unix systems.

There is no such thing as "unicode paths" on UNIX systems, valid or invalid.


Endianness affects order of bytes (0xbeef vs 0xefbe), not nibbles (0xfeeb).


The GP's example shows it affecting bytes, not nibbles.


>UNIX files and paths have no encoding, they're just bags of bytes, with specific bytes (not codepoints, not characters, bytes) having specific meaning

Yes, and you should treat them as such. If you decode them you're doing it wrong. os.listdir lets you treat paths like bytes and responds in kind:

https://docs.python.org/3.5/library/os.html#os.listdir

The problems you mentioned do exist, and Python 3's response is to make you deal with them. This is the whole point, not a mark against it. Often when you find that Python 3 is making it hard for you, it's because you're doing it wrong.


While correct that also means that you cant ever correctly show filenames to users either. At somepoint you need to relate it to an encoding as a user displays or enters it. "utf-8 everywhere now!"


Hi again. This was addressed in my other comment.

>You shouldn't decode them except if you need to for display (and in that case, you should be prepared to handle filenames which cannot be decoded as text).


you cant ever correctly show filenames to users either

If I create a file whose name consists entirely of bytes that correspond to non-printable characters, how would you display that in the language of your choice?

If I create a file whose name cannot possibly be valid UTF-8, how would you display that in the language of your choice?

Both of these things are legal to do on at least some systems. Presumably you have in mind a language which will magically handle this case with no extra effort from the programmer, though, so I'm curious to know the language and what it would do.


> Yes, and you should treat them as such.

Of course you should. But the debate here is not about how a programmer should do it. The debate here is that Python3 doesn't do it that way by default and at the very least it takes extra code to do it, and as the article points out in some cases like command line arguments makes it impossible to do it that way.

A well designed computer language encourages good code by making it easy and obvious to write it. By that metric Python3's handling of bytes and strings is not well designed.

By the by I have been bitten by this. As others have mentioned a backup program turns out to be a worst case. I had to redo the string / file name handling twice before I was confident it was correct. I'm an experienced Python programmer both Python2 & 3. The final design wasn't obvious to me, and took substantially more code than I thought it would.


How is that different from Python 2? You have to deal with it either way.


Python 2 encourages the unaware to silently write broken, anglocentric code. Python 3 puts the problem in your face.


I don't see how that is responsive to the question about using bytes/Py2 string for path names.


Python 2 encourages the unaware to silently write broken, anglocentric code [which interacts with paths].


Python 3 encourages the unaware to write broken code in a utf8-centric way instead. The unaware are going to write broken code.



What specific difference are you trying to highlight? The only distinction I see is that Python 3.3 adds support for file descriptors.


Likely the existence and explicit mention of os.fsencode, as well as the api giving what you put in, so accepting bytes if you do things safely.


The 2 API also returns the same type you put in.

os.fsencode() was changed in Python 3 longer after the 3.0 fork and isn't inherent to 3 -- the same change could be made to Python 2. It only wasn't because Python.org has been intentionally neglecting Python 2 since 2015.

(The Python 2 equivalent of os.fsencode() is just encoding with sys.getfilesystemencoding(). The primary difference is that fsencode() uses the surrogateescape error handler by default. Since this is a relaxed behavior, it seems like it could be added to Python 2 without regressing existing programs.)


Yes, so the documentation and language explicitly have tools to avoid encoding issues and make you aware of them. It encourages writing good, unbroken, code.


In my mind dealing with binary data and dealing with textual encodings are entirely separable issues. They should not be forced together by the language.


> the only safe way to handle text has always been to decode bytes as soon as you get them, work with Unicode, and then encode it when you send them out

Which is a lossy process. The example given in the article is dealing with legacy filenames -- strings from environments where the source encoding isn't necessarily even known.

You're doing the classic Programmer Thing here where you assume that the real problem is choosing a correct algorithm. It's not. Real problems are about dealing with data, and you don't always have access to a time machine to fix the algorithms that created that data in the first place.

Honestly the poster is correct as far as it goes: modern python (and modern string libraries more generally) are pretty bad at handling dirty input. It's just not a problem area they've been interested in treating, because they'd like to work on stuff that feels like the correct algorithm instead of making tools for dirty data.


Filenames aren't text, they're bytes. You shouldn't decode them except if you need to for display (and in that case, you should be prepared to handle filenames which cannot be decoded as text). You should decode data into it's canonical representation as early as possible and encode it as late as possible, but the canonical representation of a filename is not text.


Filenames are text if you ever want to let a user interact with them by name.


The two conditions

* OS treats filenames as bytes and allows arbitrary byte strings,

* Users want to edit filenames as text

are in conflict and you can't solve it without forcing some encoding or giving them a byte editor. The solution "filenames are bytes but we'll treat it as ASCII because 0x32-0x7E should be enough for anybody" is not acceptable for möst of thé wõrld.


I addressed this in my original comment.


> Honestly the poster is correct as far as it goes: modern python (and modern string libraries more generally) are pretty bad at handling dirty input.

This is incorrect: Python is fine for working with messy data. The difference is that when you CONVERT that data you have to handle the possibility of errors rather than hoping for the best. If you’re working with filenames, you can pass bytes around all day long without actually decoding them; it’s only when you decode them that you’re forced to handle errors.


> The difference is that when you CONVERT that data

Heh, "convert" it by concatenating it with a different directory path? Print it for the user? Stuff it into some kind of output format? Every one of those actions is going to toss an exception in python[1], and there are no tools available for reliably doing this in a safe and lossless way that matches the obvious syntax you'd use with "good" strings.

Maybe that's "fine" to you, I guess.

[1] And in lots of other environments. The quibble I have with the linked article is, again, that this kind of thinking is pervasive in modern string handling. Python certainly isn't alone.


Changing paths and other filename manipulations are supposed to be done using os.path or pathlib. The discussion at https://docs.python.org/3/library/os.path.html starts with this problem and notes that the functions support all bytes or all Unicode but you have to be consistent. Don’t force a conversion to text and it works fine.

Similarly once you’re talking about output other than passing it through unmodified to a format which can handle that, you are by definition converting it and need to handle what can go wrong. It’s easy to handle this in several ways, either by reporting an error or using some sort of escaped representation with an indication that this doesn’t match the document encoding, but you no long have the luxury of pretending that treating string as a synonym for bytes was ever safe.


And the need to use special libraries to handle objects that have been strings since the dawn of Unix is precisely the kind of mess the poster is talking about. Yes, yes, everyone agrees that these problems "can" be solved in Python. The question treated is whether or not Python (and modern utf8-centric string libraries) solves them WELL.


Pathlib[0] is kind of a step in the right direction. I think you could do something like subclass it to track each component (and component encoding) separately if that makes sense for your application.

I'm not a fan of Python3's language fork in any way (I think it was completely unnecessary to fork the language to make the improvements they wanted to) but I'll admit things like Pathlib and UTF-8b are a step in the right direction for handling arbitrary Unix paths. (I work with a large Python 2 code base and a product that has to interact with a variety of path encodings, so this subject is... sensitive for me.)

[0]: https://www.python.org/dev/peps/pep-0428/


Python 3.1+ use a non-lossy decoding of disk filenames called UTF-8b by default. You can read more about it here: https://www.python.org/dev/peps/pep-0383/ . I don't like Python3 at all, but I think it's a reasonable approach.


I didn't read the article as saying "I hate the complexity of Unicode", I read it as a complaint about language/library ergonomics. Comparing a string to a bytestring should not just always evaluate to false, it should (IMHO) raise a type error.


It read to me as not learning how things worked and then blaming the language when an upgrade revealed that misunderstanding rather than taking the time to learn how things actually work.


Even Python 3 makes it easy to ignore how things work. open happily accepts a string. Regex works on bytes. open and the standard streams can be used for reading and writing without explicitly specifying any encoding or error handling strategy. There's a default value for encoding on codec.encode and codec.decode. And nothing is checked statically, so you get to discover all your mistakes at runtime—and often only with a specific combination of inputs and environment.

People don't think about these things because Python encourages you to ignore them. Until it doesn't. There'd be a lot less confusion if Python 3 were more strict/explicit about conversions.


> Comparing a string to a bytestring should not just always evaluate to false, it should (IMHO) raise a type error.

Equality testing is implicitly done in a lot of places (like searching), it would become a bit of a hassle if the basic types were to throw exceptions when compared with each other.


It would also result in correct code sooner.


Perhaps, but the ship sailed long ago on that one

    >>> 1 == '1'
    False


Finding bugs early is considered a hastle these days? Really?


Not saying it's right but:

For something like Unicode, I think the point is that it "just worked" before where it would have been a lot more time and effort to get it "working" in say Java. Types and compile/runtime errors are a balance in the end that depends on the use case and if the programmer wants something that works (for the short term usually) or is resilient and will work long term. IMO, the influx in popularity of untyped languages (Javascript being used on the server side, ew) is a sign that so much coding today is not done for the long haul. I think there are good balances to be hit though, and Unicode is a place that highlights that python could use some better optional typing aid, as Typescript brought to JS.


As someone with so called “Unicode-characters” in my name, I have encountered bugs and issues on every single American website in the history of the internets, and you will have to forgive me for preferring correctness over programmer convenience.

For me it’s literally personal.


How many more ads are you willing to watch to pay for the correctness?


While your comment is trollish, I’ll let you know I’ve had many issues with payments because shitty web-sites have been unable to 1. Validate my name and 2. Correlate whatever their whitelist permits with actual information returned from the payment processor.

So your inability to handle basic stuff like this is actually costing you money.


Why would this be a type error? "Are these two things equivalent?" has an unambiguous answer regardless of whether the things are of different types. When they are not the same kind of thing, the answer is simply `False`. There's no reason to have a third answer "You're not allowed to ask that" for that case.


It's not unambiguous at all. If I do

    "część" == "część".encode('<some-obscure-encoding>')
should it return True? They are 'equivalent' after all. But that would mean it should work for any encoding, so testing `s == b` would entail trying to decode `b` as every known encoding to see if any of them gives `s`.

My point is, `s == b` is not a well-defined question. It only makes sense to ask it for a particular encoding: "Are these two things isomorphic according to cp-1252?"¹². Giving a type error is a decent way of signalling all this.

---

¹ And even then, I don't like automatic type conversions that enable things like `3 == "3"`, but I guess that's a matter of taste.

² Of course you could pick a convention like `s == b <=> s = b.decode("utf8")`, but that's a different question.


No, it should return `False` because they're different types, as I said.

> I don't like automatic type conversions that enable things like `3 == "3"`

I didn't mention this, and Python, the language under discussion, doesn't do this.


Okay, it looks like I misunderstood your original point:

> "Are these two things equivalent?" has an unambiguous answer regardless of whether the things are of different types.

as

"Isomorphic things should compare equal"

So, to respond to (what I think is) your actual point: in my view, equality is defined on values of a single type, so I prefer to distinguish

"a isn't equal to b" (= False)

and

"there's no defined way to compare a and b, because they have different types" (= some kind of error or a null-like value)

To me, conflating the two seems counterproductive, but perhaps this is personal preference (probably correlated with how much one likes static typing).


Python has operator overloading, though, so a flat rule that comparisons involving values of different types must always be a type error would actually reduce the power of the language. Granted, there aren't a ton of use cases where it's important to be able to do this, but it is useful on occasion.


You're right – I guess I got too hung up on the types. What I meant to say is that I prefer to distinguish "`a` is not equal to `b`" and "equality between `a` and `b` is undefined"; the typing aspect is kind of orthogonal and muddles the discussion.

(I guess I ended up conflating equality and equivalence[1], but in my defense, most languages seem to do that too; and the presence of the `is` operator in Python mixes things up even more)

[1] https://en.m.wikipedia.org/wiki/Equality_(mathematics)#Relat...


It would be a type error because that would be helpful. Since comparing a bytes to a str is not usually what you want, Python could make you be explicit about it. It’s too late now, and it might have been considered another backwards compatibility break (so a good thing to keep out of Python 3), but it’s not a bad idea on its own. Heterogeneous collections are pretty rare.

Also, `==` doesn’t always answer the question of whether two things are equivalent in Python:

  >>> 1.0 == 1
  True
  >>> {'a': 5} == OrderedDict({'a': 5})
  True
(OrderedDict equality isn’t transitive.)


Pretty rare but not unheard of. I do know that most of the code bases I work in would require significant reworking if I have to expect `==` to throw a TypeError for basic builtin types.


Not quite with ==, but there is precedence where operators with basic builtin types throw a TypeError when the operation doesn't make sense:

    >>> 0 > None
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: '>' not supported between instances of 'int' and 'NoneType'


Absolutely; there's no good boolean answer for > when you have different types, so the `TypeError` is perfect there.


At the same time, this result is entirely dependent on the types involved. For example, "1.0 > 0" has a well-defined answer, and Python returns it.

Considering the expression "x > y", the full process is:

1. Call x.__gt__(y).

2. If that returns a bool, that's the result. If it raises an exception, propagate the exception.

3. If it returns the sentinel value NotImplemented, attempt the reflected version of the operation: y.__lt__(x).

4. If that returns a bool, that's the result. If it raises an exception, propagate the exception.

5. If it returns the sentinel value NotImplemented, raise TypeError and inform the programmer this operator is not valid on these types.


Could you explain how your example shows that?


Which “that”? If you’re asking about `OrderedDict`, I’m referring to equality between `OrderedDict`s:

  >>> a = OrderedDict([(1, 2), (3, 4)])
  >>> b = {1: 2, 3: 4}
  >>> c = OrderedDict(reversed(a.items()))
  >>> a == b == c
  True
  >>> a == c
  False


I was referring to this:

> Also, `==` doesn’t always answer the question of whether two things are equivalent

because the examples you posteda kinda make it look like it does, i.e. `1.0 == 1` even though one is a float and the other is an int.


> "Are these two things equivalent?" has an unambiguous answer regardless of whether the things are of different types.

Absolutely not; that's basically the OG subjective question whose lack of an obvious right answer underpins a substantial quantity of the bugs in less strongly-typed languages.

Nor is an unambiguous answer possible symbolically, or in common logic, or the semantics of spoken languages.


I think you stopped reading before my next sentence, which says the answer is "false" if the types are different.


Agree, whinging about Unicode just seems like atavism for the days when computer users who did no speak English were ignored.


> As far as I can tell this is a long-form “I used to be able to ignore encoding issues and now it’s a ‘mess’ because the language is forcing me to be correct”.

That’s one of the reasons I love Swift and appreciate that it wants to get these things right, even if it means changing every year in the first few versions.


Text encoding in general is a mess, and Python 2 Unicode support was a mess, but Python 3 makes it much less of a mess.

I think the author has a mess on his hands because he's trying to do it the Python 2 way – processing text without a known encoding, which is not really possible, if you want the results to come out right.

To resolve the mess in Python 3, choose what you actually want to do:

1. Handle raw bytes without interpreting them as text – just use bytes in this case, without decoding.

2. Handle text with a known encoding – find out the encoding out-of-band from some piece of metadata, decode as early as possible, handle the text as strings.

3. Handle Unix filenames or other byte sequences that are usually strings but could contain arbitrary byte values that are invalid in the chosen encoding – use the "surrogateescape" error handler; see PEP 383: https://www.python.org/dev/peps/pep-0383/

4. Handle text with unknown encoding – not possible; try to turn this case into one of the other cases.

Also, watch Ned Batchelder's excellent talk, Pragmatic Unicode, or, How do I stop the pain?, from 2012: https://pyvideo.org/pycon-us-2012/pragmatic-unicode-or-how-d...


> To resolve the mess in Python 3, choose what you actually want to do...

The thing is that this is not actually going to happen. Programs are simply broken across the board, because few people can be bothered to deal with all these peculiarities.

The difference is, in Python 2, output would be corrupted in some edge cases, but generally it would "just work". In Python 3, the program falls flat on its face even in cases that would've ended up working fine in Python 2.

I don't think there's a general answer on which behavior causes less real-world problems total, but the idea that Python 3 makes less of a mess is not something I can agree with.


Ny experiance being from a non english language was the exact opposite. Python 2 would fail in horribly weird ways and you constantly needed to add weird tricks to get simple functions working. I. Python 3 I haven’t even encounters any similar issues everything just works. Of cause sometimes you need to specify some encodings but I don’t view that as a failure of the language. I think a lot of people have a biased view because tons of issues where just not apparent in English, but if you want a language to be viable for the entire world you have to look outside that limited set of characters.


This just reflects your experience of only speaking english, to most of the world their native language is not and edge case.


Also not using non-ASCII typography, emoji, etc. I’m glad that emoji have become popular in the US since that’s dramatically decreased the time before a text processing system gets non-ASCII input from English-native users. Doubly useful, they’re outside the BMP and flush out partial Unicode support, too, like old MySQL installs using utf8mb3.


> This just reflects your experience of only speaking english, to most of the world their native language is not and edge case.

Excuse me, I don't exclusively deal in 7-bit ASCII characters just because I happen to speak English, which isn't the only language I speak either.


As a Finn and prolific user of å, ä, ö, and € among other things, Python 3 was a massive improvement. Sure it forces me to choose what to do at the Unicode boundary but Py2 was rife with UnicodeErrors that would pop up at the most inopportune of moments. I do not miss Py2's string handling at all.


The vast majority of use cases for every program I have ever used or written would consider silent corruption of data to be a significantly worse issue than a crash.


You cannot simplify a problem by claiming it's too hard to solve for most people.

It's just the way it is.

Python 3 completely removed the need for us to talk about unicode and encodings in our third semester data analysis workshop for physic students because it just works with umlauts and Greek letters. Python 2 was a real pain.


There is a particular use case which leads to frustration with Python 3, if you don't know the latin1 trick.

The use case is when you have to deal with files that are encoded in some unknown ASCII-compatible encoding. That is, you know that bytes with values 0–127 are compatible with ASCII, but you know nothing whatsoever about bytes with values 128–255.

The use case arises when you have files produced by legacy software where you don't know what the encoding is, but you want to process embedded ASCII-compatible parts of the file as if they were text, but pass the other parts (which you don't understand) through unchanged (for example, the files are documents in some markup language, and you want to make automatic edits to the markup but leave the rest of the text unchanged). Processing as text requires you to decode it, but you can't decode as 'ascii' because there are high-bit-set characters too.

The trick is to decode as latin1 on input, process the ASCII-compatible text, and encode as latin1 on output. The latin1 character set has a code point for every byte value, and bytes with the high bit set will pass through unchanged. So even if the file was actually utf-8 (say), it still works to decode and encode it as latin1, and multi-byte characters will survive this process.

The latin1 trick deserves to be better known, perhaps even a mention in the porting guide.


> The latin1 character set has a code point for every byte value

No it doesn’t. The whole range of 128-159 are undefined. However, the old MS-DOS CP-437 encoding (which is incompatible with latin1/ISO-8859-1) does. So your trick is valid, but not with latin1.


I can’t edit my post now, but it turns out I was wrong. The range of 128-159 are defined in ISO-8859-1, as little-used “control characters”:

https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_set

So, the trick described by garethrees does work with latin1, and I was mistaken.


> No it doesn’t. The whole range of 128-159 are undefined.

Not in the sense of being decodable. If you decode a byte string with Latin-1, you get a unicode string containing code points 0-255 only, each code point matching exactly the numerical value of the corresponding byte in the byte string. So you can recover exactly the original byte string by re-encoding. Plus, every possible byte string is valid for decoding in Latin-1, so you will never get any decode/encode errors. As long as you don't care about the semantic meaning of bytes 128-255, this allows you to preserve the data while still working with Unicode strings.


"Undefined" bytes in this context does not equate to "invalid" bytes; they were more like don't-cares. Say, let's assume that they were invalid per se, then ISO/IEC 8859-1 would not allow a newline and tab that are not defined in ISO/IEC 8859-1 but a part of ISO/IEC 6429 C0 control codes. But a character set without a newline sounds... absurd?

It should be pointed out that the historical model of character sets is much different from today. First, recap:

* A (coded) character set is a partial function from an integer to a defined character meaning.

* A character encoding is a total function from a stream of bytes to either a stream of characters or an error.

ISO/IEC 8859-1 is a coded character set, but not a character encoding. It was possible to treat character sets as character encodings, and in fact this separation became apparent only after the rise of Unicode. But as you see 8859-1 does not have a newline, therefore there should be something else to provide them. Thus there had been "adapter" character encodings that makes use of desired character sets: most prominently ISO/IEC 2022 and ISO/IEC 4873. In most practical implementations of both 6429 is a default building block, so as a character encoding 8859-1 contains 6429, although 8859-1 itself was not really a proper encoding.

One more point: 2022 and 4873 were not only character encodings available at that time. One may simply define character encodings by turning a partial function to a total function or defining a total function from the beginning, and that's what IANA did [1]. IANA's version of 8859-1 ("ISO-8859-1") [2] is a proper character encoding with all control codes defined. And I believe the alias "latin1" actually came from this registration!

[1] https://www.iana.org/assignments/character-sets/character-se...

[2] https://tools.ietf.org/html/rfc1345#page-63


This should definitely be documented somewhere. I (and probably many others) figured this out the hard way by trial-and-erroring through the encodings list.

(For future search engine users, this was for PGN files from Kingbase using the python-chess library)


Would not a better solution be to process the file as a byte string?


I don't think so. If you want to detect and operate on only the data that could represent ASCII characters, you could, certainly process it as a byte string if you wanted, but you'd have to track the presence of non-ASCII-range character codes yourself, and keep state around to represent whether you were in the middle of a multibyte character as you read through the bytes.

If done right, it would be a (probably much slower) re-implementation of what happens when you use the latin1 trick mentioned. You have to get it right, though (sneaky edge cases abound--what if the file starts in the middle of an incomplete multibyte character?).

TL;DR this could technically work but is a poor idea.


This is talking about the case where you don't know the encoding. So you don't know which byte sequences are multibyte characters. Whether you use latin1 or bytes the edge cases are exactly the same, and they don't get handled.


You wouldn't be able to use any APIs that only work on string (unicode) type objects.


I ran into this when trying to process .srt (subtiles) files. The timestamp information is encoded in ASCII, and the actual subtitle text you would like to pass through unaltered. (In my case, I was just adjusting the timestamps).


isn't the correct practice to use errors="surrogateescape" for precisely this purpose with any encoding? So in this case, you would use .decode("ascii", errors="surrogateescape") as the first bytes are the only ones you are sure of, and then .encode("ascii", errors="surrogateescape") to save again


> perhaps even a mention in the porting guide

It is in at least one porting guide:

http://www.catb.org/esr/faqs/practical-python-porting/


Not sure I follow. How does the latin1 trick handle multibyte characters?


_If_ there aren’t any multibyte characters that contain bytes that could be ASCII characters, the ”process the ASCII-compatible text” step doesn't change any multibyte characters, so they round-trip.

Of course, this will break down if multi-byte characters can contain byte values that could be ASCII. It can break HTML or TeX, for example.


If you're looking at legacy 8-bit encodings, you'll be ok, most (all?) of those have ascii as the first 128, or if not (ebdic), you're pretty screwed anyway. For utf-8 you're ok too -- all of the multibyte sequences have the high bit set. For ucs-2 or utf-16, you're likely to screw things up.


It doesn't - the parent says this is for (unknown, but) ascii-compatible encodings - old fashioned codepages.


And then gives the example of utf-8.


UTF-8 is ascii-compatible. Everything with the low bit cleared (characters 0x00-0x7F) is represented identically to ASCII. All codepoints >= 0x80 are represented with multiple bytes with the high bit (0x80) set.

UTF-8 is a very elegant construct for Unix-type C systems — you could basically reuse all your nul-terminated string APIs.


Sure. But it’s not at all clear to me that this trick would actually handle multibyte utf-8 chars correctly.


Consider the codepoint U+1F4A9 ("PILE OF POO").

This encodes to the byte sequence F0 9F 92 A9 in UTF-8. Notice that every one of these bytes has a value > 0x7F, which means they're all outside the ASCII range.

That's one of the useful properties of UTF-8: you know that a code point requiring multi-byte encoding will never contain any bytes that could be confused for ASCII, because every byte of a multi-byte code point will be > 0x7F.

Which in turn means that if you use any processing mechanism that only alters bytes which are in the ASCII range, and passes all other bytes through unmodified, you are guaranteed not to modify or corrupt any multi-byte UTF-8 sequences.


Oh that’s interesting, I didn’t realize utf-8 had that nice property.


Can you provide example code of the latin1 trick?


This advice is very dangerous, both shift JIS and utf-16 (some of the most common non UTF-8 encodings) can both have things that are are 0-127 ASCII codepoints and things that look like 0-127 ASCII but are in the second part of a multi-byte sequence, and do not represent ASCII equivalent characters at all.


Note that they said "ASCII-compatible encoding". You're right to note the problem with shift-JIS and others, but then, those aren't ASCII-compatible. Whereas utf8 and the iso-8859 series are all ASCII-compatible in that if it looks like an ASCII character it is.


The point is, certain text, especially shift-JIS and the various EUC encodings can look exactly like an 8 bit "extended ASCII" when its in fact a variable width 8-16bit encoding.

Its bad advice that leads to corruption.

If you already know the encoding, then OP's advice is useless, if you don't but suspect its an 8 bit extended ASCII encoding, it might not be, because the aforementioned look exactly like an 8bit encoding.


The real problem here is that

* UNIX file systems allow any byte sequence that doesn't contain / or \0 as file and directory names

* User interfaces have to render that as strings, so they must decode

* There is no meta data about what the file encoding is

Many programs use the encoding from the current locale, which is mostly a good assumption, but the way that locales scope (basically per process) has nothing to do with how file names are scoped.

So, many programs make some assumptions. Some models are:

1) Assume file names are encoded in the current locale

2) Assume file names are encoded in UTF-8

3) Don't assume anything

The "correct" model would be 3), but it's not very useful. People want to be able to sort and display file names, which generally isn't very useful with binary data.

Which is why most programs, including python, use 1) or 2), and sometimes offer some kind of kludge for when the assumption doesn't hold -- and sometimes not.

IMHO a file system should store an encoding for the file names contained in it, and validate on writes that the names are correct. But of course that would be a huge POSIX incompatibility, and thus won't happen.

People just live with the current models, because they tend to be good enough. Mostly.


A file system doesn’t fix the problem either because there are regularly problems at file system boundaries. For example, moving from a case-sensitive file system to a case-insensitive file system, you will eventually find two distinct bags of bytes that something thinks are the same; if the original disk contains files with both names, you have to pick one or error out. Even if you know the first file system is UTF-8 and the 2nd is ISO-8859-1, the case difference can still be there.

It seems to me that the only correct solution is to error out when there are at least two viable versions of a file. Even if you’re trying to “sort and display file names”, you would need to acknowledge at that point that the exact file name isn’t clear. And once your program is doing that, it doesn’t really matter why the two versions are different: maybe it’s a Unicode problem, maybe it’s not.


Would it really be POSIX-incompatible? Does the standard mandate that a filesystem can place no such restriction on top of "filenames are unencoded bytes"? If not, then it's just that tools cannot blindly assume filenames are decodable. Isn't MacOS guaranteeing UTF8 these days, while still being POSIX-compliant?


I don't know if there is an explicit mandate for that in the standard, but forbidding this that were previously allowed, both in code and documentation, is not backwards compatible.

Just imagine a file system that wouldn't allow the character "e" in file names.

Of course, the impact would be not as drastic, but still it's backwards incompatible.


I'm not so sure other languages do that any better (nodejs doesn't even support non-unicode filenames at all for instance). Modern python does a pretty good job at supporting unicode, very far away from being a "Mess" that's just very much not true at all. People always like to hate on python but then other languages supposedly designed by actually capable people do mess up other stuff all the time. Look at how the great Haskell represents strings for instance and what a clusterfuck[1] that is.

[1] https://mmhaskell.com/blog/2017/5/15/untangling-haskells-str...


Rust is probably one of the languages which does this crap best, and that's thanks to static typing and deciding to not decide:

1. it has proper, validated unicode strings (though the stdlib is not grapheme-aware so manipulating these strings is not ideal)

2. it has proper bytes, entirely separate from strings

3. it has "the OS layer is a giant pile of shit" OsString, because file paths might be random bag of bytes (UNIX) or random bags of 16-bit values (and possibly some other hare-brained scheme on other platforms but I don't believe rust supports other osstrings currently)

4. and it has nul-terminated bag o'bytes CString

For the latter two, conversion to a "proper" language string is explicitly known to be lossy, and the developer has to decide what to do in that case for their application.


> 1. it has proper, validated unicode strings (though the stdlib is not grapheme-aware so manipulating these strings is not ideal)

Grapheme clusters are overrated in their importance for processing. The list of times you want to iterate over grapheme clusters:

1. You want to figure out where to position the cursor when you hit left or right.

2. You want to reverse a string. (When was the last time you wanted to do that?)

The list of times when you want to iterate over Unicode codepoints:

1. When you're implementing collation, grapheme cluster searching, case modification, normalization, line breaking, word breaking, or any other Unicode algorithm.

2. When you're trying to break text into separate RFC 2047 encoded-words.

3. When you're trying to display the fonts for a Unicode string.

4. When you're trying to convert between charsets.

Cases where neither is appropriate:

1. When you want to break text to separate lines on the screen.

2. When you want to implement basic hashing/equality checks.

(I'm not sure where "cut the string down to 5 characters because we're out of display room" falls in this list. I suspect the actual answer is "wrong question, think about the problem differently").

Grapheme clusters is relatively expensive to compute, and its utility is very circumscribed. Iterating over Unicode codepoints is much more useful and foundational and yet still very cheap.


> Grapheme clusters are overrated in their importance for processing. The list of times you want to iterate over grapheme clusters:

> 1. You want to figure out where to position the cursor when you hit left or right.

> 2. You want to reverse a string. (When was the last time you wanted to do that?)

You missed the big one:

3. You want to determine the logical (and often visual) length of a string.

Sure, there are some languages where logical-length is less meaningful as a concept, but there are many, many languages in which it's a useful concept, and can only be easily derived by iterating grapheme clusters.


Visual length of a string is measured in pixels and millimetres, not characters. In a font/graphics library, not in a text processing one.


Sorry, visual length as in visual number of "character-equivalent for purposes of word length" things. Those things are close to, but not exactly the same as, grapheme clusters, so the latter can often be used as an imperfect (but much more useful than unicode points or bytes) proxy for the former.

There's no perfect representation of number-of-character-equivalents that doesn't require understanding of the language being handled (and it's meaningless in some languages as I said), but there are many written languages in which knowing the length in those terms is both extremely useful and extremely hard to do without grapheme cluster identification.


>character-equivalent for purposes of word length

Serious question: why would you want to do this?

I know it's fashionable to limit usernames to X characters... but why? The main reason I've seen has been to limit the rendered length so there are some mostly-reliable UI patterns that don't need to worry about overflows or multiple lines. At least until someone names themselves:

W W W W W W W W W W W W W W W W W W W W

Which is 20 characters, no spaces, and will break loads of things.

(I'm intentionally ignoring "db column size" because that depends on your encoding, so it's unrelated to graphemes)


Serious question: why would you want to do this?

Have you never, in your entire life, encountered a string data type with a length rule? All sorts of ID values (to take an obvious example) either have fixed length, or a set of fixed lengths such that every valid value is one of those lengths, and many are alphanumeric, meaning you cannot get round length checks by trying to treat them as integers. Validating/understanding these values also often requires identifying what code point, not what grapheme, is at a specific index.

Plus there are things like parsing algorithms for standard formats. To take another example: you know how people sometimes repost the Stack Overflow question asking why "chucknorris" turns into a reddish color when used as a CSS color value? HTML5 provides an algorithm for parsing a (string) color declaration and turning it into a 24-bit RGB color value. That algorithm requires, at times, checking the length in code points of the string, and identifying the values of code points at specific indices. A language which forbids those operations cannot implement the HTML5 color parsing algorithm (through string handling; you'd instead have to do something like turn the string into a sequence of ints corresponding to the code points, and then manually manage everything, and why do that to yourself?).


Yes. All instances I've seen have been due to byte-size restrictions (so it depends on encoding) or for visual reasons (based on fundamentally flawed assumptions). With exceptions for dubious science around word-lengths between languages / as a difficulty/intelligence proxy, or just having fun identifying patterns. (interesting, absolutely, but of questionable utility / bias)

But every example you've given have been around visuals, byte sizes, or code points (which are unambiguously useful, yes). Nothing about graphemes.


So?

Rust's stdlib provides iteration on code units and code points. The use cases where these are useful are covered.

It does not provide iteration on grapheme clusters, the use cases where this is useful are not covered (and require an external dependency).

At no point am I requesting replacing codepoints-wise iteration by clusters-wise iteration.


I think a more accurate characterization is that neither code points nor grapheme clusters are usually what you want, but when you're naively processing text it's usually better to go with grapheme clusters so you don't mess up _as_ badly :)

There are definitely some operations that make sense on code points: but if you go through your list, (1), (2), (4) is something you'll rarely implement yourself (you just need a library), (3) is ... kinda rare? The most common valid use case for dealing with code points is parsing, where the grammar is defined in ascii or in terms of code points (which is relatively common).

Treating strings as either code points or graphemes basically enshrines the assumptions that segmentation operations make sense on strings at all -- they only do in specific contexts.

Most string operations you can think of come from incorrect assumptions about text. Like you said, the answer to most questions of the form "how do I X a string" is "wrong question" (reversing a string is my favorite example of this).

The only string operation that universally makes sense is concatenation (when dealing with "valid" strings, i.e. strings that actually make sense and don't do silly things like starting with a stray modifier character). Replacement makes some sense but you have to define "replacement" better based on your context. Taking substrings makes sense but typically only if you already have some metric of validity for the substring -- either that the substring was an ingredient of a prior concatenation, or that you have a defined text format like HTML that lets you parse out substrings. (This is why i actually kinda agree with Rust's decision to use bytes rather than code points for indexing strings -- if you're doing it right you should have obtained the offset from an operation on the string anyway, so it doesn't matter how you index it, so pick the fast one)

Most string operations go downhill from here where there's usually a right thing to do for that operation but it's highly context dependent.

Even hashing and equality are context-dependent, sometimes comparing bytes is enough, but other times you want to NFC or something and it gets messy quickly :)

In the midst of all this, grapheme clusters + NFC (what Swift does) are abstractions that let you naively deal with strings and mess up less. Your algorithm will still be wrong, but its incorrectness will cause fewer problems.

But yeah, you're absolutely right that grapheme clusters are pretty niche for when they're the correct tool to reach for. I'd just like to add that they're often the less blatantly incorrect tool to reach for :)

> (I'm not sure where "cut the string down to 5 characters because we're out of display room" falls in this list. I suspect the actual answer is "wrong question, think about the problem differently").

This is true, and not thinking about the problem differently is what caused the iOS Arabic text crash last year.

For many if not most scripts fewer code points is not a guarantee of shorter size -- you can even get this in Latin if you have a font with some wild kerning -- it's just that this is much easier to trigger in Arabic since you have some letters that have tiny medial forms but big final forms.


There's a very sound argument to be made for the opposite conclusion, that if we care about a problem we should make it necessary to solve the problem correctly or else stuff very obviously breaks, not have broken systems seem like they kinda work until they're used in anger.

Outside of MySQL (which unaccountably had a weird MySQL-only character encoding which only covered the BMP and named it "utf8" then when you tried to shove actual UTF-8 strings into it, they'd get silently truncated because YOLO MySQL) UTF-8 implementations tended to handle the other planes much better than UTF-16 implementations, many of which were in practice UCS-2 and then some thin excuses. Why? Because if you didn't handle multiple code units in UTF-8 nothing worked, you couldn't even write some English words like café properly. For years pretending your UCS-2 code was UTF-16 would only be noticed by people using obscure writing systems or academics.

I am also reminded of approaches to i18n for software primarily developed and tested mainly by monolingual English speakers. Obviously these users won't know if a localised variant they're examining is correctly translated, but they can be given a fake "locale" in which translated text is visibly different in some consistent way, e.g. it has been "flipped" upside down by abusing symbols that look kind of like the Latin alphabet upside down, or Pig Latin is used "Openway Ocumentday". The idea here again is that problems are obvious rather than corner cases, if the translations are broken or missing it'll say "Open Document" in the test locale which is "wrong" and you don't need to wait for a specialist German-speaking tester to point that out.


> There's a very sound argument to be made for the opposite conclusion, that if we care about a problem we should make it necessary to solve the problem correctly or else stuff very obviously breaks, not have broken systems seem like they kinda work until they're used in anger.

Oh, definitely :)

I'm rationalizing the focus on grapheme clusters, if I had my way "what is a string" would be a mandatory unit of programming language education and reasoning about this would be more strongly enforced by programming languages.


> 1. it has proper, validated unicode strings (though the stdlib is not grapheme-aware so manipulating these strings is not ideal)

Sigh, I hoped newer languages would avoid D's mistake. Auto-decoding is slow, unnecessary in most cases, and still only gives you partial correctness, depending on what you're trying to do. It also means that even the simplest string operations may fail, which has consequences on the complexity of the API.


I have no idea what you're talking about. Rust rarely does auto- anything, and certainly does not decode (or transcode) strings without an explicit request by the developer: Rust strings are not inherently iterable. As developer, you specifically request an iterable on code units or code points (or grapheme clusters or words through https://unicode-rs.github.io/unicode-segmentation)


I see, thanks for the clarification - looks like I mis-extrapolated from your comment.


That sounds more like an implementation issue than a design issue. If you are using UTF-8 actual decoding into Unicode code points is not necessary for most operations and Rust will not do that.

It also does not imply that string operations may fail. String construction from raw bytes may fail, but otherwise the use of UTF-8 strings should not introduce additional failure conditions.


> I'm not so sure other languages do that any better

I can only speak of D since I'm familiar with it.

In D, strings are arrays of chars. The standard library assumes that they contain valid UTF-8 code points and together form valid UTF-8, but it's ultimately your responsibility to ensure that. This assumption allows the standard library to present strings as ranges of Unicode code points (i.e. whole characters spanning multiple bytes).

To enforce this assumption, when raw data is interpreted as D strings it is usually checked if it's valid UTF-8. For example, readText() takes a filename, reads its contents, and checks that it is valid UTF-8 before returning it. assumeUTF() will take an array of bytes and return it as-is, but will throw in a check when the program is built in debug mode. Finally, string.representation (nothing more than a cast under the hood) gives you the raw bytes, and .byChar etc. allow iteration over code units rather than code points, if you really want to avoid auto-decoding and process a string byte-wise.

There are also types for UTF-16 and UTF-32 strings and code units, which work as the above. For other encodings, there's std.encoding which provides conversion to and from some common ones.

My only grip with how D deals with Unicode is that its standard library insists on decoding UTF-8 into code points when processing strings as ranges (and many string processing functions in other languages are written as generic range algorithms in D). Often enough, it's unnecessary, slow, and makes processing non-UTF text a chore, but it's not too hard to avoid. Other than this, I think D's approach to Unicode is better among the other languages I've seen.


Some other languages handle Unicode just fine. Rust and Julia are fine. IMO you need to build unicode understanding into string handling from the start for it to happen. Every function and detail needs to make UTF-8 sense, not just some some unicode handling library.


Haskell get things much better than Python.

For example, if you use `readFile` from Data.Text, you'll use utf-8 names and read utf-8 content. If you use `readFile` from Data.ByteString, you'll use utf-8 names and read utf-8 content. You just import whatever you want at the local namespace and use it.

If you define a conversion procedure, you can make code that accepts either text or bytes, and return either one too, automatically. The tools for working with text have equivalents for working with bytes too, whenever that makes sense. Combining text with bytes is a type error, but there are so many shortcuts for going from one to the other that one barely notices them (unless you are concerned about performance, then you will notice).

That small article is basically all that you have to read to be productive dealing with text and bytes differences.


Pretty much everything you've described is possible in Python. The open function can return bytes or utf8, depending on what you ask for, and converting between the two is simply a call to .decode or .encode.

Combing text with bytes is a type error.

There, you're ready for text in python3.


This is one of the things that makes it hard for me to let go of Python, even though my programming style is evolving more towards functional languages: I've become very spoiled and take for granted a lot of the peripheral sanity and general well-behavedness that Python has, certainly in the unix environment.

Every other language or language implementation that I encounter seems to end up having an "oh god no" surprise or three hiding not-very-deep beneath the surface. Let's not even talk about ruby.


Java is pretty good at character processing and has been since the inception of the language. Adopting Unicode from the start helped enormously, along with clearly separating String from byte[] in the type system. Finally the fact you have static typing makes it a lot easier to avoid 'what the heck do I have here' problems with byte vs. str that still pop up even in Python3.

That said Python3 is vastly better than Python2. Basic operations like reading/writing to files and serialization to JSON for the most part just work without having to worry about encodings or manipulate anything other than str objects. I'm sure there are lots of cases where that's not true but for my work at least string handling is no longer a major issue in writing correct programs. The defaults largely seem to work.


Java's string handling is also broken by default in a few ways, due to it historically using UCS-2 internally and hence still allowing surrogate pairs to get split up, giving broken unicode strings.


I have not personally encountered this problem but it's definitely there. The other problem historically is that Java didn't explicitly require clients to specify encodings explicitly when moving between strings and bytes. That's been cleaned up quite a bit in recent releases of the JDK.

All things considered Java character handling was an enormous improvement over the languages that preceded it and still better than implementations in many other languages. (I wish the same could be said of date handling.)


What's the deal with Haskell strings? It's not a mess, it's basically enforcing the same "unicode sandwich" approach Python recommends by using the type checker. Of course to do that you need one type for when the string is in the different possible different layers of the sandwich.

There's added types for lazy vs. non-lazy but that's for performance optimization, and don't get me started on how Python "get messy" when you want to do performance optimization because it usually kicks you out of the language.


> What's the deal with Haskell strings?

I think the linked article laid the case fairly well. Basically, Haskell has a bunch of string types you need to understand and fret about, and the one named "String" is the one you almost never want, but it's also the only one with decent ergonomics unless you know to install a compiler extension.

I think it's a fair criticism. The "Lots of different string types" thing isn't (IMO) such a big deal coming from a language of Haskell's vintage. Given what Python's "decade-plus spent with a giant breaking change right in the middle of the platform hanging over our heads" wild ride has been like, I can't blame anyone for not wanting to replicate the adventure.

But, for newcomers, the whole thing where you need to know to

  {-# LANGUAGE Support, Twenty, First, Century #-}
is a pretty big stumbling block.


I think that "deal" is just complains coming from someone that doesn't yet understand the engineering tradeoffs between strong static and weak type systems. That the nice functions for String come in the Prologue and that you need to implement or use other libraries for the functions for byte sequences or other stuff is not an excuse or a problem of the language as a tool.


My main criticism of Python 3's changes to strings is that it has become much more specific about strings.

In Python 2, if you have a series of bytes -or- a "string", the language has no opinion about the encoding. It just passes around the bytes. If that set of bytes enters and exits Python without being changed, its format is of no concern. Interactions do not force you to define an encoding. This is not correct, but it is often functional.

Python 3, on the other hand, if you ever treat bytes as a string, forces you to have an opinion about the encoding. Same goes for if you convert back to bytes. For uncommon or unexpected encodings, the chance of this going wrong in a casual, accidental way is much higher. Of course, the approach is more correct, but it doesn't feel more correct to the programmer.


> it doesn't feel more correct to the programmer.

I agree with the details of what you said, but the insidious thing about how Python 2 organized strings and encodings is that most programmers were free to ignore it and produce buggy software. Then, later, people who had to use that software on non-ascii data would try to use it and it would blow up. This would lead to a very painful cycle of shaking out bugs that the original author may not even be motivated to fix.

The decision to force encodings to be explicit and strings/bytes to be separate was a great design change. It literally made all our code more valuable by removing hidden bugs from it.


> Python 3, on the other hand, if you ever treat bytes as a string, forces you to have an opinion about the encoding.

Of course. How else can you translate the bytes into a meaningful textual representation without an encoding?

Python 3 requires you to consider the real world, and that’s a good thing.

(Apart from when you want to port sloppily written Python on 2.x code)


In Python 2, if you have a series of bytes -or- a "string", the language has no opinion about the encoding

This is incorrect. Python 2 lets you get away with a lot of things, but some string-y operations on Python 2 bytestrings will still trip the "need to know the encoding" issue. And absent you telling it the encoding, Python 2 will assume ASCII and begin throwing exceptions as soon as it sees a byte outside the ASCII range.


If you want a string that has no opinion on encoding you should just use bytes.

Trying to force bytes into a string without having an opinion in its encoding is asking for problems.


For anyone interested in learning why Python 3 works this way I highly recommend the blog of Victor Stinner[0].

As for the article, this is nothing new. The problem is similar to the issues raised by Armin Ronacher[1]. These problems are well known and Python developers address them one at a time. Issues around these egde cases have improved since the initial release of Python 3.0.

[0] http://vstinner.github.io

[1] http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/


This article is kind of hard to evaluate, because the OP doesn't provide an example program with an example input that fails. So it's hard to judge whether the solution presented here is actually ideal. Instead, we're forced to just take the OP's word for it, which is kind of uncomfortable.

I do somewhat agree with the general sentiment, although I find it difficult to distinguish between the problems specifically related to its handling of Unicode and the fact that the language is unityped, which makes a lot of really subtle things very implicit.


The OP links to StackOverflow, where failing inputs are mentioned in the comments on the accepted answer. And the second-most upvoted answer explains that .decode('unicode_escape') only works for Latin-1 encoded text: https://stackoverflow.com/a/24519338


The question being how to parse character escapes (backslash sequences) in Python.

To be honest, you could write a custom character-by-character parser easily or even use the regex module.


But the OP has their own problem with which they posted a solution, but didn't include the actual problematic program.


IMHO the whole python3 string mess could have been prevented if they had chosen UTF-8 as the only string encoding instead of adding a strict string type with a lot of under-the-hood magic. That way strings and byte streams could remain the same underlying data, just as in python2. The main problem I have with byte-streams vs strings in python3 is that it adds a strict type checking at runtime which isn't checked at 'authoring time'. Some APIs even make it impossible to do upfront type checking even if type hints would be provided (e.g. reading file content either returns a byte stream, or a string, based on the content of a string parameter in the file open function).

Recommended reading: http://utf8everywhere.org/


> IMHO the whole python3 string mess could have been prevented if they had chosen UTF-8 as the only string encoding instead of adding a strict string type with a lot of under-the-hood magic.

That is basically what Python2 does, and it is completely wrong.


Can you give any reasons why this is completely wrong? The web seems to work just fine with UTF-8. The advantage is that you can pass string data around as generic byte streams without even knowing about the encoding. You'll only have to care about the encoding at the end points.


You are joking, right? Have you ever seen non-English webpages? More often than not, a multitude of ??? and Chinese characters pop up at some point or another.


I'm from Germany so I've seen a few non-English webpages. I can't remember having seen any text rendering problems since the late 90's or so.


this is because browsers have very sophisticated algorithms to detect the encoding because this was such a frequent issue. (and yes, UTF-8 adoption/support has been growing, which also helps)

being German and working in a multi-national company, i can confirm it is still very much an issue with software that doesn't handle this. Excel is one of the worst offenders, document corruption is rife especially when going between Excel on Windows and Excel for Mac. this is because Excel doesn't to UTF-8 as default for legacy reasons (I think), but also either doesn't have encoding detection or has very bad encoding detection.


As am I. The encoding detection used for a standardized(!) feed file format I had to write had cyclomatic complexity of 16 and only supported 4 encodings(X). On the other hand it was almost always correct. How you would do that on a global scale is beyond me.

(X) I hear you ask, 'Why would you do that even!?' Try telling tiny companies without IT department what an encoding is. It's faster to just figure it out on the receiving side.


I don't really understand your point here? Why would you change the internal representation of the text storage type? This doesn't change anything.

Or if I read this incorrectly and you want to merge 'bytes' and 'string', but enforce 'utf-8', how would you ensure that conversion to 'utf-8' while communicating with strangely encoded content (files encoded in utf-16-be for example, or worse) would be enforced? I think you can't, it's the programmers job to ensure everything transitioned over to the correct encoding, which is exactly the purpose of 'string' being distinct and incompatible with 'bytes'.


> And the environment? [it’s not even clear.] https://stackoverflow.com/questions/44479826/how-do-you-set-...

That question is about interpreting backslash escape sequences for bytes in an environment variable. All this person wants is `os.environb` (and look, its existence highlighted a Windows incompatibility, saving them from subtle bugs like every other Python 3 improvement). https://docs.python.org/3/library/os.html#os.environb


Thanks, never noticed environb. I’m still learning new things about Python 3 ten years later.


Getting Unicode right, especially with various file systems and cross-platform implementations is hard, for sure. But, I think this quote:

"And, whatever you do, don’t accidentally write if filetype == "file" — that will silently always evaluate to False, because "file" tests different than b"file". Not that I, uhm, wrote that and didn’t notice it at first…"

shows a behavior that, to me, is inexcusable. The encoding of a string should never cause a comparison to fail when the two strings are equivalent except for the encoding. For example, in Delphi/FreePascal, if you compare an AnsiString or UTF-8-encoded string with a Unicode string that is equivalent, you get the correct answer: they are equal.


> The encoding of a string should never cause a comparison to fail when the two strings are equivalent except for the encoding.

If you mean comparing "file" == b"file" that's not possible on several levels.

Firstly, even if you say "just compare the bytes", the computer doesn't know what byte-format you want for "file". Sure, it's "Unicode", but is it UTF-8 or UTF-16 or what? Those choices will produce different results, and the computer cannot accurately guess the right one for you.

Secondly, that violates Python's normal rules by introducing type juggling. It's equivalent to asking for expressions like ("15"==15) or ("True"==True) to work, and involves all the same kinds of long-term problems. (Don't believe me? Work in PHP for a few years...)


As I said, Delphi/FreePascal have been handling this for years without issue, so "not possible" sounds like you're giving up a little too early. The encoding of a string is not the same thing as its type, and shouldn't be treated as such. Python has to know the encoding of the "file" string by the encoding off the source file. It then also has to know what the encoding of the b"file" string is because it is explicitly specified. That's all of the information that it needs to make the comparison, so it should either a) issue a compilation/runtime error if its an invalid comparison, or b) return a proper comparison result. Returning an invalid comparison result is the worst of all possible outcomes.

As for character sets/code points:

https://stackoverflow.com/questions/130438/do-utf-8-utf-16-a...

A byte string is simply a string that is using the lower ASCII characters (< 127). The code points for "file" map cleanly to the same code points in any Unicode encoding.


A python byte sequence (the b"file" in the example) is not necessarily a string that is using the lower ascii characters, it's an arbitrary sequence of arbitrary (not only <127) bytes - the equality operation comparing a string with a sequence of bytes needs to be well defined for all possible byte sequences, including non-ascii (byte)strings like b'\xe2\xe8\xe7\xef' which decodes to different strings in different ANSI encodings (and bytestring data does not include any assumption about the encoding - especially if you just received those bytes over the network or read them from a file), and is not valid UTF-8.

Furthermore, even for ascii sequences like b"file" the bytes do not map to the string "file" in every Unicode encoding - for example, in UTF-16 the bytes of b"file" represent "楦敬", which is a bit different than "file".


If the "string" b"file" does not mean the ASCII string "file", but rather is supposed to be interpreted as a byte array (equivalent to just bytes in memory with no context of the individual array members being characters), then my original point still stands: such a comparison shouldn't be allowed at all and an error should be raised. To simply return False indicates that the comparison is valid with regard to the string types, but the comparison simply returned False because the two strings were not equal.

I thought Python was strongly-typed ? Am I incorrect in this regard ?


Python is not strongly-typed, it's "duck-typed", i.e. everything is an object, and you should be able to hand over "X-like" objects to code that expects type X, and it should work properly if your X-like object supports all the interfaces that type X does. As part of that duck-typing, it's valid to compare anything with anything without raising an error, it's just that different things are (by default) not equal, so the comparison returns false. For example, you can compare a class defining a database connection with the integer 5, that would be a valid comparison that returns False.

This behavior is a key requirement for all kinds of Python core data structures, for example, if you'd define bytestrings so that they throw an error when compared to a "normal" string, then this means that for a heterogenous list (pretty much all data structures in Python can be heterogenous regarding data types) containing both b"something" and "something" many standard operations (e.g. checking if "something" is in that list) would break because the list code would require a comparison to do that.


> Python has to know the encoding of the "file" string by the encoding off the source file.

Not when it's a string literal like in your example.

The string "foo" has no file, not in the past, present, or reasonably-predictable future. Ditto for the literal bytes.

> It then also has to know what the encoding of the b"file" string is because it is explicitly specified.

Explicitly specified by who? Where? When?

They're just bytes, they don't have a text-encoding yet, or perhaps never: It could be a picture file, or a random seed.


<< The string "foo" has no file, not in the past, present, or reasonably-predictable future. Ditto for the literal bytes. >>

Is Python not parsing/compiling the source file ? Does the "foo" string constant not live in such a source file ?

<< Explicitly specified by who? Where? When? >>

The "b" prefix states what the encoding is: it is a raw byte string whose bytes are assumed to correspond to their ASCII counterparts.


> Delphi/FreePascal have been handling this for years without issue

Except people don't really want to write and learn those languages?


That's not my point, and I think that you know that.


> The encoding of a string should never cause a comparison to fail when the two strings are equivalent except for the encoding.

You'll have to admit that the encoding is a property of a string, just like the content itself. As always, you as a programmer are bound to know both of these properties to have predictable results. To compare two strings of different encoding to one another, you'll have to find a common ground for interpreting the data contained in the string.

If you don't want or need that, then all you have is a "string" of bytes.


Sure, but you can have defined rules about what happens when you compare values with disparate encodings, similarly to how you have to have rules about how column expressions are compared in SQL with regard to their collations. The way such things are done is typically to coerce the second value into the encoding of the first value, and then compare the two values. What the Delphi compiler does is issue warnings when there might be data loss or other issues with such coercions so that the developer knows that it might not be safe and that they might want to be more explicit about how the comparison is coded.


Yeah, this behavior might be because Python doesn't store unicode string as it is. AFAIK Python always store string as an array of fixed-sized bytes for random access. In other words, the size of an element of the array is the same as the maximum size of characters in the string, meaning that even a character of 1 byte ASCII can be stored as 4 bytes.

So when one side is bytes (filetype in this case) and the other side is a string, the underlying byte representation can be different even if they represent as the same string in higher level.


Python as of 3.3 chooses an internal representation on a per-string basis. This encoding will be either latin-1, UCS-2, or UCS-4, and the choice is made based on the widest code point in the string; Python chooses the narrowest encoding capable of representing that code point in a single unit.

This does mean that a string which contains, say, some English text and an emoji will "blow up" into UCS-4, but the overhead isn't that severe; most such strings are not especially large. It also means that strings containing only code points < U+00FF are smaller in memory on Python 3.3+ than previously, since prior to 3.3 they would be using at least two bytes per code point and now use only one.


Let's be honest, the real mess is with UNIX filenames. I dare you to come up with a legitimate use case for allowing newlines and other control characters in a file name.


It's like a built-in unit test - devs have to not mangle and assume anything about filenames they get from the system - though they still do, I've seen multiple times how my nice umlauts get mangled or my spaces cause scripts to fail.


A few years ago I tried naming my home directory with the unicode pile of poo () and a space in the name to test what of my code might break. However, it broke too much of third party tools/scripts that I occasionally needed for something, so I reverted within a few days.

Though it might be interesting to have an integration test box where the username (and thus all the relevant paths) includes all kinds of oddities - whitespace, emoji, right-to-left marker, etc.


Backward compatibility.


With what?


with almost 50 years of unix history.


I think the point was that UNIX got it wrong, and we've been dealing with the consequences ever since. It's of course too late to change it, so yeah.


Maybe. But 50 years ago utf-8 didn't exist, unicode didn't exist, possibly not even latin-1 did exist. If unix had enforced a specific encoding (which implies constrains which byte values can appear in a path byte string), transition to newer encodings would have been significantly harder.


Going of on a tangent a bit here, but I think there are 2 important related issues:

* API design should fit the language. In a "high on correctness" language like Haskell or Rust, I'd expect APIs to force the programmer to deal with errors, and make them hard to ignore. In a dynamically typed language like Python where many APIs are very relaxed / robust in terms of dealing with multiple data types (being able to see numbers/strings/objects generically is part of the point of the language), being super strict about string encoding sounds extra painful compared to a statically typed language. I'd expect an API in this language to err on the side of "automatically doing a useful/predictable thing" when it encounters data is only slightly incorrect, as opposed to raising errors, which makes for very brittle code. Most Python code is the opposite of brittle, in the sense that you can take more liberties with data types before it breaks than in statically typed languages. Note that I am not advocating incorrect APIs, or APIs that silently ignore errors, just that the design should fit the language philosophy as best as possible.

* Where in a program/service should bytes be converted to text? Clearly they always come in as bytes (network, files..), and when the user sees them rendered (as fonts), those bytes have been interpreted using a particular encoding. The question where in the program should this happen? You can do this as early as possible, or as late as possible. Doing it as early as possible increase the code surface where you have to deal with conversions, and thus possible errors and code complexity, so that doesn't seem so great to me personally, but I understand there are downsides to most of your program dealing with a "bag of bytes" approach too.


I don’t think Haskell is a very good example to promote for string handling. Things are mostly strict and well behaved once they make it into the Haskell program but before then they either need to satisfy the program’s assumptions before being input or the program will be buggy/crash unless it is carefully written such that it’s assumptions are right.


I didn't mention Haskell specifically for strings, but as a language that tends to be very precise about corner cases. That may not even be the best example, but I couldn't think of any better mainstream-ish language examples :)


Part of the problem is that encoding is treated as something that must be explicitly handled in the string API, but it's something that's just assumed by default in the IO API. Python just guesses what the encoding is, and it often guesses wrong.

The design of the API leads people to do the wrong thing. Encoding should be a required argument for `open` in text mode.


Indeed py3 decided to make unicode strings the default. This fixes all sorts of thorny issues across many use cases. But it does indeed break filenames. I haven't dealt with this issue myself, but the way python was supposed (?) to have "solved" this is with surrogate escapes. There's a neat piece on the tradeoffs of the approach here: https://thoughtstreams.io/ncoghlan_dev/missing-pieces-in-pyt...

Maybe handling the surrogates better would allow you to use 'str' everywhere instead of bytes?


> For a Python program to properly support all valid Unix filenames, it must use “bytes” instead of strings, which has all sorts of annoying implications.

While in python 2, you had to use unicode strings for all sorts of actual text, which caused its own problems.

> What’s the chances that all Python programs do this correctly? Yeah. Not high, I bet.

Exactly.


Don't think of python Unicode as a "string". Think of it as "text". I don't really understand the issues the author is having with things like sys.stdout and such because he did not provide complete examples. He should cite actual examples and bug reports that he has posted for these things, ive had no such issues. There's a lot of things we need to do to accommodate for non-ascii text but they are all "right" as far as I've observed.


Part of the issue is to do with bytes and strings being considered totally different by python but confusingly similar to people.

The error from "file" != b"file" is particularly bad. It makes sense if you realise that a == b means a,b have the same type and their values are equal. But there is no way a even a reasonably careful programmer could spot this without super careful testing (and who’s to say they would remember to test b"file" and not "file"). Other ways this could be solved are:

1. String == bytes is true iff converting the string to bytes gives equality (but then can == become non transitive)

2. String == bytes raises (and so does string == string if encodings are different)

3. Type-specific equality operators like in lisp. But these are ugly and verbose which would discourage their use and so one would not think to use bytesEqual instead of ==

4. A stricter/looser notion of equality that behaves as one of the above called eg === but this is also not great


> The error from "file" != b"file" is particularly bad. It makes sense if you realise that a == b means a,b have the same type and their values are equal. But there is no way a even a reasonably careful programmer could spot this without super careful testing (and who’s to say they would remember to test b"file" and not "file").

I'm of the opposite opinion. I appreciate that b'a' != 'a'.


I don’t think it’s a problem that they aren’t equal. This is reasonable. The problem is that it is hard for one to foresee this error. The mental model of bytes and strings is likely to be either their equal-looking literals or a mental concept of “bytes and strings are basically the same except for some exceptions.” One cannot reasonably trace every variable to figure out whether it is a bytes or a string. The comparison a == b being false comparing strings to bytes makes sense when a and b could be anything. However when b is already definitely a (byte) string, it is more useful to get an error when a has a different type.

What is your opinion on numbers:

Should 1 == 1.0?

What about 1+0j?

Or 1/1 (the rational number, although I’m not sure this can be constructed)?


int and float being comparable is practical, though occasionally troublesome. Complex usually doesn't compare well across types. You can use a Fraction type for 1/1. I haven't formed an opinion about that, since I don't use them often.


It makes sense if you realise that a == b means a,b have the same type and their values are equal.

Which is wrong.

In Python, "a == b" is:

True if a.__eq__(b) is True (short-circuiting the entire expression to True), or if a.__eq__(b) is NotImplemented and b.__eq__(a) is True.

False if a.__eq__(b) is False (short-circuiting the entire expression to False), or if a.__eq__(b) is NotImplemented and b.__eq__(a) is True.

TypeError if a.__eq__(b) is NotImplemented and b.__eq__(a) is NotImplemented.


Second line should read as below, can't edit:

False if a.__eq__(b) is False (short-circuiting the entire expression to False), or if a.__eq__(b) is NotImplemented and b.__eq__(a) is False.


i love unicode handling in python3, it's so much better to work with. python2 was a mess, migrating old code requires looking at old code, the result is only better code, never a mess.


> Python's unicode is a "mess" because of this single edge case I've encountered

FTFY


more like

> Unicode is a "mess" because it can not unquote arbitrary backslash strings.


The article is about a specific instance (filenames). In general, handling Unicode as a bunch of indexable code points as per Py3 turned out to be not that great. I guess the idea came from the era where people still thought that strings could be in some sense fixed length. These days we better understand that strings are inherently variable length. So there is no longer any reason to not just leave everything encoded as UTF-8 and convert to other forms as and if required. Strings are just a bunch of bytes again.


The most correct way to expose Unicode to a programmer in a high-level language is to make grapheme clusters the fundamental unit, as they correspond to what people think of as "characters". Failing that, exposing strings as sequences of code points is a second-best choice.

UTF-8 is a non-starter because it encourages people to go back to pretending "byte == character" and writing code that will fall apart the instant someone uses any code point > 007F. Or they'll pat themselves on the back for being clever and knowing that "really" UTF-8 means "rune == code point == character", and also write code that blows up, just in a different set of cases.

And yes, high-level languages should have string types rather than "here's some bytes, you deal with it". Far too many real-world uses for textual data require the ability to do things like length checks, indexing and so on, and it doesn't matter how many times you insist that this is completely wrong and should be forbidden to everyone everywhere; the use cases will still be there.


That's silly. How often have you had to work with grapheme clusters without also using a text rendering engine? But the number of times you need to know the number of bytes a string takes, even when using scripting languages, is much higher. The only way to deal with this is to not make assumptions, and not have a string.size() function, but specific accessors for 'size in bytes', 'number of code points' and (potentially, if the overhead is warranted) 'nr of grapheme clusters'.

The 'fundamental' problem here is that the average programmer doesn't understand 'strings' because it seems so easy but it's actually very hard (well, not even hard, just big and tedious). Even more so now that many people can have careers without really knowing about what a 'byte' is or how it relates to their code.


> How often have you had to work with grapheme clusters without also using a text rendering engine?

All the time. Want to truncate user text? You need grapheme clusters. Reverse text? Grapheme clusters. Get what the user thinks of as the length of text? Grapheme clusters. Not saying it’s a good idea to make them any sort of default because of that, though; you’re right that it should be explicit.


Truncating text is almost always (in my experience) a UI thing, where you pass a flag to some UI widget saying 'truncate this and that on overflow' and while rendering, it can then truncate using grapheme clusters.

How often does one reverse text? And when do users care about text length? Almost always (again, in my experience) in the context of rendering - when deciding on line length or break points, so when you know and care about much more than just 'the string' - but also font, size, things you only care about in the context of displaying. Not something that should be part of the 'core' interface of a programming language.

I mean I think we agree here; my point was that I too used to think that grapheme clusters mattered, but when I started counting in my code, it turned out they didn't. Sure I can think of cases where it would theoretically would matter, but I'm talking about what do you actually use, not what do you think you will use.


I’m biased towards websites, but truncating text server-side to provide an excerpt is something I need to do pretty often. Providing a count of remaining characters is maybe less common, but Twitter, Mastodon, etc. need to do it, and people expect emoji (for example) to count as one.

Plus sometimes you’re the one building the text widget with the truncation option.


Twitter's count of "characters" is code points after normalization[1].

I don't know who expects emoji to count as one character, but they'd be surprised by Twitter's behavior: something like ‍[2] counts as 4 characters (woman, dark skin tone, zero width joiner, school).

[1] https://developer.twitter.com/en/docs/basics/counting-charac... [2] https://emojipedia.org/female-teacher-type-6/


And when do users care about text length?

When validating many types of data submitted via the web.


> The most correct way to expose Unicode to a programmer in a high-level language is to make grapheme clusters the fundamental unit, as they correspond to what people think of as "characters".

> UTF-8 is a non-starter because it encourages people to go back to pretending "byte == character" and writing code that will fall apart the instant someone uses any code point > 007F.

These are somewhat different concerns, you can provide cluster-based manipulation as the default and still advertise that the underlying encoding is UTF-8 and guarantees 0-cost encoding (and only validation-cost decoding) between proper strings and bytes (and thus "free" bytewise iteration, even if that's as a specific independent view).

> Far too many real-world uses for textual data require the ability to do things like length checks, indexing and so on

This is not a trivial concern e.g. "real-world uses for length checks" might be a check on the encoded length, the codepoint length or the grapheme cluster length. Having a "proper" string type doesn't exactly free you from this issue, and far too many languages fail at at least one and possibly all of these use cases, just for length queries.


and still advertise that the underlying encoding is UTF-8

You can, but I'd avoid it. If you want to offer multiple levels of abstraction, you'll want to offer a code point abstraction in between graphemes and bytes. And for several reasons you'll probably want to either have, or have the ability to emit, fixed-width units.

Python does this internally (3.3 and newer): the internal storage of a string is chosen at runtime, based on the widest code point in the string, and uses whichever fixed-width encoding (latin-1, UCS-2 or UCS-4) will accommodate that. This dodges problems like accidental leakage of the low-level storage into the high-level abstraction (prior to 3.3 Python, like a number of languages, would "leak" the internal implementation detail of surrogate pairs into the high-level string abstraction).


There are lots of comments indicating that the programmer is doing things wrong. But what is the right way to deal with encoding issues? Wait for code to break in production?

Whatever "best practices" there are for dealing with unexpected text encoding in Python, they do not seem to be widely well-known. I bet a large % of Python programmers (myself included) made the exact same errors the author had, with little insight as to how avoid them in the future.


His examples are all stuff that isn't Unicode. The filename thing would probably work using a latin1 encoding, since that leaves 8 bit bytes undisturbed.


That's not Python's fault - those are programmer errors.

Having said that, Python really has something to answer for with "encode" versus "decode" - WTF? Which is which? Which direction am I converting? I still have to look that up every single time I need to convert.

Why the heck are there not "thistothat" and "thattothis" functions in Python that are explicit about what they do?


This is something I see folks trip on. I think encode/decode are fine names actually. The problem it's that Unicode strings have decode defined at all, and similarly, that byte strings have encode defined. Byte strings should only have a decode operation and Unicode strings should only have an encode operation. Depending on your input, the wrong operations can actually succeed!


Not sure what you're talking about mate, on Python 3.6:

    >>> "hello world".decode("utf-8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'str' object has no attribute 'decode'
    >>> b"hello world".encode("utf-8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'bytes' object has no attribute 'encode'


I'm pretty sure I tripped up on an issue like this on python 3.7 where I double encoded or decoded something. Can't recall the exact case but it was confusing for sure.


That's good. I guess it's only in Python 2 then.


In Python 2 it works almost the same, except you only get an error when encoding/decoding doesn't work out. So I see this as an improvement.


If you're storing files with non-Unicode-compatible names, you should really stop. Even if on Unix, you can technically use any kind of binary mess as a name, doesn't mean you should. And this applies to all kinds of data. All current operating systems support (and default to) Unicode, so handling anything else is a job for a compatibility layer, not your application.

If you write new code to be compatible with that one Windows ME machine set to that one weird IBM encoding sitting in the back of the server room, you're just expanding your technical debt. Instead, write good, modern code, then write a bridge to translate to and from whatever garbage that one COBOL program spits out. That way, when you finally replace it, you can just throw away that compatibility layer and be left with a nice, modern program.

In EE terms, think of it like an opto-isolator. You could use a voltage divider and a zenner diode, but thats just asking for trouble.


I can't believe there are still people whining about this in 2018.

Those problems with gpodder, pexecpt, etc. aren't due to Python 3, they're due to the software being broken. Without knowing the encoding, UNIX paths can't be converted to strings. It's unfortunate, but that's the way it is, and it's not Python's fault.


The author has files with invalid names and complains that Python refuses to accept them. Maybe he should fix the names first?


If all tools he had access to behaved in the same way, he wouldn't be able to fix these "wrong" file names.



Author doesn't seem to care that there is a difference between Unicode the standard and utf-8 the encoding. While the changes on the fringes to the system are debatable, they are also in a way sensible. Internal to your application everything should be encoding independent (unicode objects in Py2, strings in Py3) while when talking to stuff outside your program (be it network, local file content or filesystem names) it has to be encoded somehow. The distinction between encoding independent storage and raw byte-streams forces you to do just that!

Stop worrying and go with the flow. Just do it as it is supposed to be done and you'll be happy.


The encoding is a property of the string, just like the content, just as with any other object. If you want to compare strings with different encodings, you'll have to convert at least one of them.

I was never forced into encoding hell again, after reading this excellent post: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...


I invest some karma to point out how I'd love for str to just use UTF-8 by default, and print as UTF-8 by default:

print(b'DONT b"EVERYTHING!"')

print(str(b'SAME!'))

print(str(b'I DONT WANT TO add ,"UTF-8" everywhere!','UTF-8'))

line="ום עולם"

output.write(line) # TypeError: a bytes-like object is required, not 'str'

fp.write(output.getvalue()) # TypeError: write() argument must be str, not bytes

Please at least allow us to set a global option via sys.setdefaultencoding('UTF8') as before to automatically encode/decode as UTF-8 by default!


Dealing with string encoding has always been the bane of my existence in Python...going back over 10 years when I first started using it. I've never had such wild issues with decoding/encoding in other languages...that may be my privilege, though, since I was dealing with internal systems before Python, and then I got into web scraping.

Regardless, string encoding/decoding in Python is hard, and it doesn't feel like it needs to be.


I agree Python3 is an awful mistake and that straight-up Unicode is not well suited for storing arbitrary byte strings from old disk images. However, Python 3.1+ encode disk names as WTF-8 (aka utf-8b): https://www.python.org/dev/peps/pep-0383/ .


This post barely scratches the tip of the iceberg.

For a more comprehensive discussion of unicode issues and how to solve them in Python, "Let’s talk about usernames" does this issue more justice than I could write in a comment: https://news.ycombinator.com/item?id=16356397


TFA is short and to the point. A few examples, a few links to other examples. Py3's insistence on shoving Unicode into every API it possibly could maybe fit, is often inconvenient for coders and for users. This thread has 100 comments, mostly disagreeing in the same fingers-in-ears-I-can't-hear-you fashion. Whom are we struggling to convince, here?


If python programmers think they are the only ones with UTF problems, try Lazarus and Freepascal development mailing lists. The debates have going since forever, and I am sure issues will be popping up every now and then.

Try Elixir. According to their docs they've had it right from the word go - I think.


Is the author saying that the Python programming language handles this badly, and all other (relevant) programming languages do not?

Or is that that Python's attention to detail means that issues that would be glossed over or hidden using ther languages are brought to the fore and require addressing?


I just came from pycon India 2018. This is exactly what the keynote was about.(it was by author of Flask)


Filenames are a good example to show people why forcing an encoding onto all strings simply doesn't work. The usual reaction from people is to ignore that and they'll shout: "fix your filenames!"

Here is another example: Substrings of unicodestrings. Just split a unicodestring into chunks of 1024 bytes. Forcing an encoding here and allowing automatic conversions will be a mess. People will shout: "Your're splitting your Strings wrong!"

The first language I knew that fell for encoding aware strings was Delphi - people there called it "Frankenstrings" and meanwhile that language is pretty dead.

As a professional who has to handle a lot of different scenarios (barcodes, Edifact, Filenames, String-buffers, ...) - in the end you'll have to write all code using byte-strings. Then you'll have to write a lot of GUI-Libraries to be able to work with byte-strings... and in the end you'll be at the point where the old Python was... (In fact you'll never reach that point because just going elsewhere will be a lot easier)


Sub strings of Unicode strings are fine. Byte level chunking of a Unicode string requires encoding this string as bytes, then working with bytes, then deciding the text.

Splitting a piece of Unicode text every 1024 bytes is like splitting an ascii string every 37 bits. It doesn't make sense.


It's just a not very well explained rant of some shitty libraries and a lot of legacy code. If you want to read about REAL complaints, read Armin Ronacher thought about it instead: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/


This is mostly why PHP6 wasn't a thing.


Looks like we need a Python 4. (/s)


Well, Python already has a plan for Python 4. The Python 4 will be released after Python 3.8. There are already discussion on Python 4.0 in the dev group. It is just a new number after 3.8 so there won't be breaking issues like 2=>3.


I think that was just some idea that was discarded.

"Seems that we've reached the consensus: we release Python 3.10 after Python 3.9. We maybe release Python 4.0 at some point if there's a significant backwards incompatible change." https://mail.python.org/pipermail/python-committers/2018-Sep...


Python 3 to be retired in 2050 =)


That sounds more like a mess handling things that are not Unicode.


Yes, but the issue here would be that Python forced these "things which are not unicode" into unicode.


This is just the cost of using a dynamic language with implicit error handling (exceptions).


You can garble filenames just as easily in statically-typed languages. Consider, for example, Windows' infamous 16-bit units that aren't actually well-formed UTF-16. I'm not aware of any widely-used programming language whose type system will save you from that sort of thing ("here's some bytes, figure out if they're a string and if so what encoding" is a historically very difficult problem).


Rust `String`s are always UTF-8 – distinct from `OsString`s which are not. Conversions are explicit, and the programmer is often forced to decide between conversions which are a) fallible or b) lossy. If neither choice is appropriate, the only option is to avoid the conversion, which… is correct.

https://doc.rust-lang.org/std/string/struct.String.html

https://doc.rust-lang.org/std/ffi/struct.OsString.html

https://doc.rust-lang.org/std/path/struct.PathBuf.html


> You can garble filenames just as easily in statically-typed languages.

If the language assumes filenames are regular language strings, which not all do.

> Consider, for example, Windows' infamous 16-bit units that aren't actually well-formed UTF-16.

unix filenames are literally just bags of bytes with no known or specified encoding.


Unix filenames don't pretend to be something they aren't. Windows filenames like to present a convincing façade of being UTF-16 right up until they aren't.


Windows filenames "like to present a convincing façade of being UTF-16" in the exact same way unix filenames "like to present a convincing façade of being UTF-8". Both are common assumptions neither is actually true, and all of that is well-documented.


> unix filenames "like to present a convincing façade of being UTF-8"

Except they never have? Unix paths have always been bags of bytes, both before Unicode and UTF-8 were invented and after. It's just convention that modern Unix systems use UTF-8 as the text encoding for paths.


> Except they never have?

And neither have Windows paths ever actually pretended to be UTF-16, that's my point.


I would express it more as "programmers in Unix environments like to act as if everything still uses the C locale everywhere, all the time".


You're getting downvoted because it's a flamewarish comment but as a mostly python dev, I think you have a point. Where types are so different that your code will throw a runtime error the first time you run it if you get typing wrong, you don't feel that much pain from dynamic languages. Also where types are being used like dicts explicitly as bags of stuff that might or might not be present/set, you can have the same thing in most typed langages with null errors. The place in my experience where a typesystem would have avoided some annoyance is where you have two very similar but not quite the same types that can fall a long way through your code before you notice. Str/Bytes is one, for me another common one is date/datetime. Fortunately in Python3 you can type annotate and just use a decorator to enfore runtime typechecks, this lets you track your assumptions and catch the errors closer to their origin.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: