Code Pages, Character Encoding, Unicode, UTF-8 and the BOM [video]

hmottestad · on Nov 20, 2019

This video is about “Code Pages, Character Encoding, Unicode, UTF-8 and the BOM”.

I don’t know about everyone else, but I had to implement an ISO 8859 encoding to UTF-8 converter in assembly when I was at uni around 8-10 years ago. So this is standard stuff for most developers who graduate from the University of Oslo.

lmilcin · on Nov 20, 2019

Converting to UTF-8 from a simple codepage like ISO is easy. The difficult part is parsing UTF-8 with all its various behaviour changing characters.

Take the BOM as an example.

I work on backend Java projects for large banks. Over the years I fought with BOM on numerous times. For some reason 95% of software that says it is UTF-8 compliant is not. My modus operandi for dealing with it is to remove BOM on ingestion and only add it when the string leaves the system when we know the outside party absolutely requires it (though it should not...)

tialaramex · on Nov 20, 2019

People say if the only tool you have is a hammer, everything looks like a nail - but when the only tool you have is a left-handed can opener life gets _really weird_ and that's how we got the UTF-8 BOM

UTF-8 BOM is largely a Microsoft idea, they've got a bunch of code that thinks in UCS-2 (now retrofitted to more or less pretend it knows UTF-16) and so it thinks about byte order when decoding text files, and from there a Byte Order Mark in files that don't have byte ordering seems like a reasonable idea.

If the files actually _mean_ something then a UTF-8 BOM just introduces confusion. Lots of code I'm responsible for processes UTF-8 just fine, but if it handles say files full of key = value pairs and your file begins with a BOM, well, OK then, that first key starts with U+FEFF, weird choice but no reason we should disallow that. And of course that isn't what you wanted and so now Windows users are complaining I'm not "compatible".

mark-r · on Nov 20, 2019

The choice of character for the BOM isn't weird or random - it's a zero-width no-break space, which means it's supposed to be invisible even when your software is capable of displaying it. If you remove whitespace from the beginning and end of your keys you'll be fine.

It seems arbitrary, but I don't think they could have made a better choice.

marcosdumay · on Nov 20, 2019

Shouldn't U+FEFF be whitespace?

tialaramex · on Nov 21, 2019

Sure. So what? Text file formats aren't magically obliged to ignore leading whitespace just because that suits Microsoft. If my format would consider U+0009 TAB or U+0020 SPACE to be part of the key if placed at the start of the key why not U+FEFF ?

swiley · on Nov 20, 2019

I feel like some of the European CS programs have really neat stuff in them. I’m not sure more than a passing mention of Unicode or any of its encodings (just assume no one uses more than the first 255 code points and it’s encoded in UTF-8) was ever made in any CS class I took.

I guess assembly made a little more sense 8 years ago and they only had so much class time but I wouldn’t have minded a “decide utf-8” assignment. It would have been another good practice doing bit manipulation which is kind of difficult to get right in C++.

bobthepanda · on Nov 20, 2019

For what it's worth, one of my CS classes at Stony Brook University had an assignment to write a converter that could do UTF-8 to UTF-16 and back, and one of the test cases we were graded on included Emoji support.

knolax · on Nov 20, 2019

I've never taken a college compsci course before but I've written UTF-8 to UTF-32 converters before as a part of a project. Do Unicode/other character encodings really need to be explicitly taught in a class? Most of the complicated behavior (like left to right text, precomposed characters and decomposed characters, grapheme vs. code point vs. glyph) gets abstracted away and would probably require a semester course to teach in depth while the general concepts (character encodings turn text into numbers, Unicode is the one with the funny faces) on the other hand are so widely known that I'm pretty sure even lay-people are aware of them on a basic level.

tialaramex · on Nov 20, 2019

UTF-8 in particular seems like a nice thing to show in data structures and algorithms 10x course. Here's a problem, and here is the beautiful yet practical data structure that somebody came up with to solve that problem.

I agree that getting into details of text rendering and so on is one of those 3xx or 4xx courses with narrow appeal because maybe one student in a thousand will actually make use of this knowledge, but grokking why UTF-8 is how it is and the basic outlines of Unicode seems very broadly applicable.

roel_v · on Nov 20, 2019

That can be done without actually knowing much about what an encoding is, or even what 'unicode' really means in a practical sense...

UglyToad · on Nov 20, 2019

This was really nicely done and a really good initiative. I got in to software development by teaching myself Python (badly) and PHP (badly), and it's probably quicker to list the things I do know than the cavernous gaps in my knowledge.

For most software development jobs you can get by without knowing this stuff but it's great there are things like this to clearly explain fundamentals that are either assumed knowledge or communicated in (to an outsider) gatekeeping levels of dense terminology.

teddyh · on Nov 20, 2019

See also: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

speleo_engr · on Nov 20, 2019

I just read this over and it's a very dated Windows-centric view. Several glaring errors - glosses over the difference between UCS-2 and UTF-16, no mention of surrogate pairs for UTF-16 (thinks only 65k code points), says UTF-8 can be up to 6 bytes (no it can't, this was proposed but never standardized), the idea that ASCII standardization dates to the 8088 (its much older), mentions UTF-7 (don't), no mention that wchar_t changes size based on platform, no mention of Han unification, no mention of shaping, and no mention of normalization.

bloak · on Nov 20, 2019

RFC 2279 says: "In UTF-8, characters are encoded using sequences of 1 to 6 octets." That's not technically a standard, but it was widely implemented.

mark-r · on Nov 20, 2019

UTF-8 was originally designed to handle codepoints up to a full 32 bits. It wasn't until later that the codepoint range was restricted so that 4 octets would be sufficient.

cogburnd02 · on Nov 20, 2019

> mentions UTF-7 (don't)

Wait, what's so wrong about mentioning UTF-7? Wasn't it just a (proposed but abandoned) way to represent Unicode characters in MIME email?

speleo_engr · on Nov 20, 2019

Yeah, I meant don't use it. It seems to confuse things to even bring it up.

jacobush · on Nov 20, 2019

Kinda half-sad it didn't make it. Would have been cool to able to "see" behind the curtains of UTF strings. As it is now, you can only paste a UTF string in a UTF aware environment, and you also need the correct fonts etc.

It would have been cool to be able to incrementally upgrade legacy environments to use UTF via UTF-7. Unaware parts would just have displayed the encoding. String lengths would have sort of worked.

(All of these things would of course have come with horrible drawbacks, so in that alternative universe I might have been cursing that we got UTF-7...)

gpderetta · on Nov 20, 2019

UTF-8 is the sane incremental path from ASCII.

snagglegaggle · on Nov 20, 2019

Most issues are in old implementations and on Windows, so it's not completely off base.

speleo_engr · on Nov 20, 2019

Sure, but there is no way this should be used as a reference in 2019. It was wrong even in 2003 when it was written - Unicode 3.0 from 1999 defined the maximum number of code points, surrogate pairs, and code points above U+FFFF.

His single most important fact still rings true though, "It does not make sense to have a string without knowing what encoding it uses."

blowski · on Nov 20, 2019

Tom Scott's video is a great intro:

> https://www.youtube.com/watch?v=MijmeoH9LT4

joshbaptiste · on Nov 20, 2019

Brian Will's video covers code pages succinctly https://www.youtube.com/watch?v=IM-vnsnLGd4

Rels · on Nov 20, 2019

That's the first time I've heard about UTF8 files sometimes having a BOM, so that's nice to learn something. :)

I'm wondering if it's widely used.

sigjuice · on Nov 20, 2019

The first time I saw a BOM (0xEF,0xBB,0xBF) was in Project Gutenberg files.

  $ curl -sO https://www.gutenberg.org/cache/epub/16681/pg16681.txt

  $ file pg16681.txt 
  pg16681.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

  $ head -c3 pg16681.txt | xxd
  00000000: efbb bf                                  ...

davidwtbuxton · on Nov 20, 2019

I've seen UTF-8 with a BOM while consuming data when integrating with strongly Windows-centric environments. Relatively uncommon, but does happen. And it is very annoying!

C1sc0cat · on Nov 20, 2019

It used to and maybe still does cause problems with how google parsed robots.txt files!

Which is why all my robots.txt files have a comment on the first line.

Someone1234 · on Nov 20, 2019

> Which is why all my robots.txt files have a comment on the first line.

That doesn't stop a BOM being generated or consumed.

YSFEJ4SWJUVU6 · on Nov 20, 2019

BOM is only a problem with strict syntaxes, which robots.txt is not an example of. If the "consumer" simply ignores invalid or meaningless lines, you can avoid issues from invisible characters by not having anything meaningful on the first line of your file.

OskarS · on Nov 20, 2019

Yes, it’s widely used. Many text editors insert a UTF-8 BOM as the first character in a text file to signal that the encoding is UTF-8. It’s technically pointless since UTF-8 doesn’t depend on endianness, but since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint.

You can occasionally see it in git diffs as U+FEFF, or if you open a text file in a hex editor as EF BB BF

YSFEJ4SWJUVU6 · on Nov 20, 2019

>since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint

Neither does any other of the hundreds of existing text encodings.

It's debatable how much of a magic number it's supposed to be anyway, considering that few people have insisted on having magic numbers in text files, and that you get the BOM at the beginning by simply naively converting a UCS-2/UTF-16 file codepoint by codepoint (and vice versa, enforce it to be there if you ever happen to do the conversion the other way around because of course you're conversion couldn't include that extra logic in it).

mark-r · on Nov 20, 2019

The nice thing about the BOM is you can't get it accidentally in an ASCII file - all the bytes have the upper bit set but all ASCII characters have that bit as zero. It makes an excellent magic number for that reason. It's probably just as unlikely to come up in other encodings that use the upper bit.

finchisko · on Nov 20, 2019

What is the vscode hex extensions Scott is using?

hutattedonmyarm · on Nov 20, 2019

Looks like hexdump: https://marketplace.visualstudio.com/items?itemName=slevesqu...

mark-r · on Nov 20, 2019

I hate it when a page consists of nothing but a video. I like to take things in at my own pace by reading. There should be a warning in the title.

dang · on Nov 20, 2019

Ok, we've added that now.