I was looking for the catch. Here it is: "It's really simple: Know what encoding a certain piece of text, that is, a certain byte sequence, is in, then interpret it with that encoding."
That's like "knowing" the truth. How?
I have received some very interesting files that made Python yack unicode errors, again and again. Why? Not only did I not "know" what encoding it was in -- the encodings changed at different points in the stream of bytes. I call this "slamming bytes together" because somewhere along the line, someone's program did exactly that.
There is nothing you can do with text file with unknown encoding but treat it as an array of bytes.
If you start guessing the encoding, at best it won't work in some cases, at worst you are introducing security vulnerabilities. You can try, but there is just no way to do it right.
Generally you can safely treat text in an unknown encoding as UTF-8. Since you're expecting potential failures but want to press on anyway instead of causing an exception/ error you treat invalid sequences as U+FFFD the Replacement Character as you would in a language or API with no exception reporting mechanism.
There are lots of pleasing aspects to this choice. It's ASCII compatible of course, so anything that was actually ASCII is still ASCII, anything that was almost ASCII is just ASCII with U+FFFD where it deviated.
The replacement character resolutely isn't any of the specific things, nor any of the generic classes of thing you might be expected to treat differently for security reasons. It isn't a number, or a letter (of either "case"), it isn't white space, and it certainly isn't any of the separators, escapes or quote markers like ? or \ or + or . or _ or...
... yet it is still inside the BMP so it won't trigger weird (perhaps less well tested) behaviour for other planes.
It's self-synchronising. If something goes wrong somehow, in a few bytes if there is UTF-8 or an ASCII-compatible encoding the decoder will synchronise properly, you never end up "out of phase" as can happen for some encodings.
Most usefully, whatever you're now butted up against works with UTF-8 now. Maybe some day that'll get formally documented, maybe it won't. As the years drag on the chance of specifying _anything else_ shrink more, and the de facto popularity of UTF-8 means even if it's never formalised anywhere everybody will just assume UTF-8 anyway and you haven't to lift a finger.
I probably don't need to say this, but it all depends. Many operating systems and GUI frameworks internally use UTF-16 because it was more common when they were built. Lots of old files use really obscure encodings. Sometimes you get a UTF BOM to identify UTF-16 and UTF-32, other times you don't. Then there are the pesky ways you can encode characters with HTML or XML entities, the occasional double-encoding of such, and so on.
When I worked with library records, I had to deal with text encodings that pre-dated SQL, though I suppose I should be thankful that ASCII existed by then so they were mostly ASCII compatible, but even today there are systems designed to output MARC-8 + UTF-8 as a fallback only when a MARC-8 character isn't available (MARC-21) instead of just using UTF-8.
I'll admit though, outside of MARC-8 and the various Unicode encodings, I'm having trouble thinking up systems that would still be incompatible today. Old documents, yes, absolutely would be encoded in different charsets, Windows still generally defaults to encoding in their Latin1 if I recall correctly, but most systems today do expect UTF-8 over the network at least, and UTF-16 for display perhaps...
Don't get me started on line endings though, and how many files use one, both, more than one ... and especially how much fun it can be with git repos cross-platform, or when automated tools use platform default line endings when they should be configurable, etc. CSV files that aren't properly escaped are also a special mini hell...
Wanted to amend this list of character encodings with another one I came across recently -- GSM-7, used for SMS. https://en.wikipedia.org/wiki/GSM_03.38#GSM_7-bit_default_al... If a message you send includes other Unicode characters than that, including emoji, it will cost more to send and use UCS-2 encoding (which later became UTF-16).
I mean, are you though? Was it "data" ? The supposed data doesn't have metadata to tell you what it means (which encoding / character set was used), so, did it actually mean anything?
Smashing the bits that don't mean anything to U+FFFD leaves humans with the unmistakable evidence that something was lost here. It's not like U+FFFD doesn't scream "Hey stuff went wrong here" - it's an inverse question mark on a diamond, short of an animated GIF that says "Uh-oh" with an anvil dropping onto a cartoon character's head we can't do much more.
If you're sure it's supposed to be ISO-8859-1 then sure, treating it was UTF-8 eats data, likewise if it was supposed to be KOI-8 or something. But you don't know, so, if "Give up and demand a human fix things" isn't a sensible option, which it often won't be, this is the best we can do.
Yes. The fact that you couldn't interpret it doesn't mean that the consumer of your output couldn't have if you'd passed it through without going out of your way to destroy it.
There is a large, enormous class of software that cares detects specific sequences meaningful to it that exist in ASCII, and copies other parts of it input directly into its output without really caring to modify it. If you don't intentionally destroy your input, this will silently Just Work with many encodings in actual use.
But this involves a sleight of hand where, at first, you deny knowing the encoding so as to require we can't decode it, and then, you declare you did know the encoding so as to declare the results "destroyed" because now you can't decode them.
Just tell the text processing tool the encoding. Or, don't use text.
If you resent having to pick an encoding, it still works - just the encoding is UTF-8 because duh, of course it is.
In a world where you design your entire data processing pipeline from the ground up for each process, sure, you the person who knows the encoding of the program, and the program itself, have the same knowledge of text encodings. You also wouldn't have this problem in the first place.
In practice, if this comes up at all it's a huge mistake to be destroying your data. Curse the person who wrote the tool in the middle that eats data, and don't be them.
Even outright throwing errors is better than replacing characters and expecting whoever looks at the data on the other end to notice.
If you want an error, throw an error. We specified that up front. The entire sub-thread you're in is about what happens if you for whatever reason can't throw an error or don't want to.
Notice that, again if you want an error you can detect U+FFFD and error out on that. I mean apparently this isn't Unicode after all right? So the only way U+FFFD got into the pipeline is because of an error you've now decided you should have caught but... didn't?
Your approach randomly introduces unspecified behaviour which is likely to introduce security vulnerabilities and who knows what other problems because it resists "Full Recognition Before Processing".
Unlike treating text in unknown encoding as UTF-8, passing it through mangled by tools that didn't actually understand it as you've proposed does lead to real world vulnerabilities that can be as serious as remote code execution.
Oh, that's what I did of course. Some substitution, some Pokemon Exception Handling. At the end of the day, it was analysis, not a random file, so I wasn't worried about security.
What I am pointing out is that "know" is just doing a lot of magical work in that sentence.
Like to win a race Usain Bolt, it's simple, run faster!
Fortunately, Python is well equipped for that. If you open a file with Python that you know it might contains mixed encoded text, you can use try/except to inform the user or open in binary mode, and just store the binary.
But my favorite way of doing it is:
open('file', error=strategy)
Strategy can be:
- "ignore": undecodable text is skipped
- "replace": undecodable text is replace with "?"
- "surrogateescape" (you need to use utf8): undecodable text is decoded to a special representation which makes no human sense, but can en rencoded back to it's original value.
It's kinda ironic because people bashed Python for separating bytes/text, forcing them to deal with encoding correctly in Python 3. After all, this problem of "slamming bytes together" comes from languages that treat text as a bytes array, allowing this stupid mistake.
Yeah, that was more or less my approach to analysis for these files. Essentially it was an export that was quasi-aware of Unicode and dumped out certain fields (of course they had to be variable length) in their original encodings, whatever they were. I got more than a few of these.
Unicode is great, as long as everyone upstream follows all of the rules and nothing goes wrong.
One of the reasons there is a lot of confusion about encodings vs Unicode is that Unicode was initially an encoding. It was thought that 65K characters was enough to represent all the characters in actual use across the languages and thus you just needed to change the from an 8 bit char to a 16 bit char and all would be well (apart from the issue of endianness). Thus Unicode initially specified what each symbol would look like encoded in 16bits. (see http://unicode.org/history/unicode88.pdf, particularly section 2). Windows NT, Java, ICU, all embraced this.
Then it turned out that you needed a lot more characters than 65K and instead of each character being 16 bits, you would need 32 bit characters (or else have weird 3 byte data types). Whereas people could justify going from 8 bits to 16 bits as a cost of not having to worry about charsets, most developers balked at 32 bits for every character. In addition, you now had a bunch of the early adopters (Java and Windows NT) that had already embraced 16 bit characters. So then encodings were hacked on such as UTF-16 (surrogate pairs of 16 bit characters for some unicode code points).
I think, if the problem had been understood better at the start that you have a lot more characters than will fit in 16 bits, then something UTF-8 would likely have been chosen as the canonical encoding and we could have avoided a lot of these issues. Alas, such is the benefit of 20/20 hindsight.
I personally had a good time re-reading this over and over again when I was migrating python 2 to python 3, it's a great resource: http://farmdev.com/talks/unicode/
I love C++ so much, and it has brought me such joy as a hobbyist programmer, but good grief, this one aspect of it (dealing with encodings & charsets) is so depressing I just want to cry sometimes.
That's like "knowing" the truth. How?
I have received some very interesting files that made Python yack unicode errors, again and again. Why? Not only did I not "know" what encoding it was in -- the encodings changed at different points in the stream of bytes. I call this "slamming bytes together" because somewhere along the line, someone's program did exactly that.
Everything is simple -- until it isn't.