This video is about “Code Pages, Character Encoding, Unicode, UTF-8 and the BOM”.
I don’t know about everyone else, but I had to implement an ISO 8859 encoding to UTF-8 converter in assembly when I was at uni around 8-10 years ago. So this is standard stuff for most developers who graduate from the University of Oslo.
Converting to UTF-8 from a simple codepage like ISO is easy. The difficult part is parsing UTF-8 with all its various behaviour changing characters.
Take the BOM as an example.
I work on backend Java projects for large banks. Over the years I fought with BOM on numerous times. For some reason 95% of software that says it is UTF-8 compliant is not. My modus operandi for dealing with it is to remove BOM on ingestion and only add it when the string leaves the system when we know the outside party absolutely requires it (though it should not...)
People say if the only tool you have is a hammer, everything looks like a nail - but when the only tool you have is a left-handed can opener life gets _really weird_ and that's how we got the UTF-8 BOM
UTF-8 BOM is largely a Microsoft idea, they've got a bunch of code that thinks in UCS-2 (now retrofitted to more or less pretend it knows UTF-16) and so it thinks about byte order when decoding text files, and from there a Byte Order Mark in files that don't have byte ordering seems like a reasonable idea.
If the files actually _mean_ something then a UTF-8 BOM just introduces confusion. Lots of code I'm responsible for processes UTF-8 just fine, but if it handles say files full of key = value pairs and your file begins with a BOM, well, OK then, that first key starts with U+FEFF, weird choice but no reason we should disallow that. And of course that isn't what you wanted and so now Windows users are complaining I'm not "compatible".
The choice of character for the BOM isn't weird or random - it's a zero-width no-break space, which means it's supposed to be invisible even when your software is capable of displaying it. If you remove whitespace from the beginning and end of your keys you'll be fine.
It seems arbitrary, but I don't think they could have made a better choice.
Sure. So what? Text file formats aren't magically obliged to ignore leading whitespace just because that suits Microsoft. If my format would consider U+0009 TAB or U+0020 SPACE to be part of the key if placed at the start of the key why not U+FEFF ?
I feel like some of the European CS programs have really neat stuff in them. I’m not sure more than a passing mention of Unicode or any of its encodings (just assume no one uses more than the first 255 code points and it’s encoded in UTF-8) was ever made in any CS class I took.
I guess assembly made a little more sense 8 years ago and they only had so much class time but I wouldn’t have minded a “decide utf-8” assignment. It would have been another good practice doing bit manipulation which is kind of difficult to get right in C++.
For what it's worth, one of my CS classes at Stony Brook University had an assignment to write a converter that could do UTF-8 to UTF-16 and back, and one of the test cases we were graded on included Emoji support.
I've never taken a college compsci course before but I've written UTF-8 to UTF-32 converters before as a part of a project. Do Unicode/other character encodings really need to be explicitly taught in a class? Most of the complicated behavior (like left to right text, precomposed characters and decomposed characters, grapheme vs. code point vs. glyph) gets abstracted away and would probably require a semester course to teach in depth while the general concepts (character encodings turn text into numbers, Unicode is the one with the funny faces) on the other hand are so widely known that I'm pretty sure even lay-people are aware of them on a basic level.
UTF-8 in particular seems like a nice thing to show in data structures and algorithms 10x course. Here's a problem, and here is the beautiful yet practical data structure that somebody came up with to solve that problem.
I agree that getting into details of text rendering and so on is one of those 3xx or 4xx courses with narrow appeal because maybe one student in a thousand will actually make use of this knowledge, but grokking why UTF-8 is how it is and the basic outlines of Unicode seems very broadly applicable.
This was really nicely done and a really good initiative. I got in to software development by teaching myself Python (badly) and PHP (badly), and it's probably quicker to list the things I do know than the cavernous gaps in my knowledge.
For most software development jobs you can get by without knowing this stuff but it's great there are things like this to clearly explain fundamentals that are either assumed knowledge or communicated in (to an outsider) gatekeeping levels of dense terminology.
I just read this over and it's a very dated Windows-centric view. Several glaring errors - glosses over the difference between UCS-2 and UTF-16, no mention of surrogate pairs for UTF-16 (thinks only 65k code points), says UTF-8 can be up to 6 bytes (no it can't, this was proposed but never standardized), the idea that ASCII standardization dates to the 8088 (its much older), mentions UTF-7 (don't), no mention that wchar_t changes size based on platform, no mention of Han unification, no mention of shaping, and no mention of normalization.
UTF-8 was originally designed to handle codepoints up to a full 32 bits. It wasn't until later that the codepoint range was restricted so that 4 octets would be sufficient.
Kinda half-sad it didn't make it. Would have been cool to able to "see" behind the curtains of UTF strings. As it is now, you can only paste a UTF string in a UTF aware environment, and you also need the correct fonts etc.
It would have been cool to be able to incrementally upgrade legacy environments to use UTF via UTF-7. Unaware parts would just have displayed the encoding. String lengths would have sort of worked.
(All of these things would of course have come with horrible drawbacks, so in that alternative universe I might have been cursing that we got UTF-7...)
Sure, but there is no way this should be used as a reference in 2019. It was wrong even in 2003 when it was written - Unicode 3.0 from 1999 defined the maximum number of code points, surrogate pairs, and code points above U+FFFF.
His single most important fact still rings true though, "It does not make sense to have a string without knowing what encoding it uses."
I've seen UTF-8 with a BOM while consuming data when integrating with strongly Windows-centric environments. Relatively uncommon, but does happen. And it is very annoying!
BOM is only a problem with strict syntaxes, which robots.txt is not an example of. If the "consumer" simply ignores invalid or meaningless lines, you can avoid issues from invisible characters by not having anything meaningful on the first line of your file.
Yes, it’s widely used. Many text editors insert a UTF-8 BOM as the first character in a text file to signal that the encoding is UTF-8. It’s technically pointless since UTF-8 doesn’t depend on endianness, but since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint.
You can occasionally see it in git diffs as U+FEFF, or if you open a text file in a hex editor as EF BB BF
>since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint
Neither does any other of the hundreds of existing text encodings.
It's debatable how much of a magic number it's supposed to be anyway, considering that few people have insisted on having magic numbers in text files, and that you get the BOM at the beginning by simply naively converting a UCS-2/UTF-16 file codepoint by codepoint (and vice versa, enforce it to be there if you ever happen to do the conversion the other way around because of course you're conversion couldn't include that extra logic in it).
The nice thing about the BOM is you can't get it accidentally in an ASCII file - all the bytes have the upper bit set but all ASCII characters have that bit as zero. It makes an excellent magic number for that reason. It's probably just as unlikely to come up in other encodings that use the upper bit.
I don’t know about everyone else, but I had to implement an ISO 8859 encoding to UTF-8 converter in assembly when I was at uni around 8-10 years ago. So this is standard stuff for most developers who graduate from the University of Oslo.