Hacker News new | past | comments | ask | show | jobs | submit login

That's the first time I've heard about UTF8 files sometimes having a BOM, so that's nice to learn something. :)

I'm wondering if it's widely used.




The first time I saw a BOM (0xEF,0xBB,0xBF) was in Project Gutenberg files.

  $ curl -sO https://www.gutenberg.org/cache/epub/16681/pg16681.txt

  $ file pg16681.txt 
  pg16681.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

  $ head -c3 pg16681.txt | xxd
  00000000: efbb bf                                  ...


I've seen UTF-8 with a BOM while consuming data when integrating with strongly Windows-centric environments. Relatively uncommon, but does happen. And it is very annoying!


It used to and maybe still does cause problems with how google parsed robots.txt files!

Which is why all my robots.txt files have a comment on the first line.


> Which is why all my robots.txt files have a comment on the first line.

That doesn't stop a BOM being generated or consumed.


BOM is only a problem with strict syntaxes, which robots.txt is not an example of. If the "consumer" simply ignores invalid or meaningless lines, you can avoid issues from invisible characters by not having anything meaningful on the first line of your file.


Yes, it’s widely used. Many text editors insert a UTF-8 BOM as the first character in a text file to signal that the encoding is UTF-8. It’s technically pointless since UTF-8 doesn’t depend on endianness, but since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint.

You can occasionally see it in git diffs as U+FEFF, or if you open a text file in a hex editor as EF BB BF


>since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint

Neither does any other of the hundreds of existing text encodings.

It's debatable how much of a magic number it's supposed to be anyway, considering that few people have insisted on having magic numbers in text files, and that you get the BOM at the beginning by simply naively converting a UCS-2/UTF-16 file codepoint by codepoint (and vice versa, enforce it to be there if you ever happen to do the conversion the other way around because of course you're conversion couldn't include that extra logic in it).


The nice thing about the BOM is you can't get it accidentally in an ASCII file - all the bytes have the upper bit set but all ASCII characters have that bit as zero. It makes an excellent magic number for that reason. It's probably just as unlikely to come up in other encodings that use the upper bit.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: