That's the first time I've heard about UTF8 files sometimes having a BOM, so tha...

sigjuice · on Nov 20, 2019

The first time I saw a BOM (0xEF,0xBB,0xBF) was in Project Gutenberg files.

  $ curl -sO https://www.gutenberg.org/cache/epub/16681/pg16681.txt

  $ file pg16681.txt 
  pg16681.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

  $ head -c3 pg16681.txt | xxd
  00000000: efbb bf                                  ...

davidwtbuxton · on Nov 20, 2019

I've seen UTF-8 with a BOM while consuming data when integrating with strongly Windows-centric environments. Relatively uncommon, but does happen. And it is very annoying!

C1sc0cat · on Nov 20, 2019

It used to and maybe still does cause problems with how google parsed robots.txt files!

Which is why all my robots.txt files have a comment on the first line.

Someone1234 · on Nov 20, 2019

> Which is why all my robots.txt files have a comment on the first line.

That doesn't stop a BOM being generated or consumed.

YSFEJ4SWJUVU6 · on Nov 20, 2019

BOM is only a problem with strict syntaxes, which robots.txt is not an example of. If the "consumer" simply ignores invalid or meaningless lines, you can avoid issues from invisible characters by not having anything meaningful on the first line of your file.

OskarS · on Nov 20, 2019

Yes, it’s widely used. Many text editors insert a UTF-8 BOM as the first character in a text file to signal that the encoding is UTF-8. It’s technically pointless since UTF-8 doesn’t depend on endianness, but since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint.

You can occasionally see it in git diffs as U+FEFF, or if you open a text file in a hex editor as EF BB BF

YSFEJ4SWJUVU6 · on Nov 20, 2019

>since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint

Neither does any other of the hundreds of existing text encodings.

It's debatable how much of a magic number it's supposed to be anyway, considering that few people have insisted on having magic numbers in text files, and that you get the BOM at the beginning by simply naively converting a UCS-2/UTF-16 file codepoint by codepoint (and vice versa, enforce it to be there if you ever happen to do the conversion the other way around because of course you're conversion couldn't include that extra logic in it).

mark-r · on Nov 20, 2019

The nice thing about the BOM is you can't get it accidentally in an ASCII file - all the bytes have the upper bit set but all ASCII characters have that bit as zero. It makes an excellent magic number for that reason. It's probably just as unlikely to come up in other encodings that use the upper bit.