What to know about encodings and character sets

Animats · on June 27, 2015

Something they should have mentioned: put a

    <meta charset="utf-8" />

in all your HTML documents that are in UTF-8. Note that this has to be in the first 1024 bytes of the document. Otherwise, the browser has to invoke the "encoding guesser"[1], which will sometimes guess wrong. (W3C: "The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream.") The result will be occasional users seeing random pages in the wrong encoding, depending on browser, browser version, platform, and page content.

I recently saw the front page of the New York Times misconverted because they didn't specify an encoding, and the only UTF-8 sequence near the beginning of the document was in

   <link rel="apple-touch-icon-precomposed" sizes="144×144" ...

The "×" there is not the letter x, it's the Unicode multiplication symbol. This confused an encoding guesser. Don't go there.

[1] http://www.w3.org/TR/html5/syntax.html#determining-the-chara...

icebraining · on June 27, 2015

You can also add "; charset=utf-8" to your Content-Type header instead.

Animats · on June 27, 2015

Yes, but then it's separated from the document. If someone saves the file, they lose the charset info.

How reliable are CDNs and caching servers about preserving Content-Type headers?

nrinaudo · on June 27, 2015

While this is true, I find the meta tag to be a horrible pain in the ass.

If you have to parse some HTML that you get over an HTTP connection - you're writing a crawler, say, or you want to extract RDFa metadata, you have to deal with the following, surprisingly common case: both the header and the html document contain encoding information, and they disagree. The RFC states that you should trust the header, but in practice, that's certainly not always the case - nor even, in my experience, is it the case the majority of the time.

If you decide to use the meta tag, that means that you have already strarted decoding your byte stream, get the encoding, then need to re-interpret the bytes you've already read. I have seen a lot of pages that declared their encoding after the title tag.

What's worse, you can't know whether you have a meta tag until you've parsed the whole head, which can be huge with hundreds of kilobytes of inlined javascript and css.

The argument that you should just read the first 1024 bytes and assume utf-8 if nothing is found is just not satisfactory - I want the encoding of the documents I'm parsing to be correct all the time, not when the remote host follows the rules. If I'm writing a crawler, the remote host cares nothing about my needs and I'm the one suffering from my unwilingness to be flexible.

So, yeah. Don't use the meta encoding tag, and trust your user agent to save the html code in a sane (utf-8) encoding. There is no reason to store encoding information in an html file, just like I doubt your source code always starts with a preprocessor instruction declaring the encoding that the compiler should use.

JupiterMoon · on June 27, 2015

Hang on isn't this argument basically: "lots of people don't follow the standards now so no one should bother?"

nrinaudo · on June 28, 2015

Well, not exactly. If people truly followed the standards, there would be no need for the meta charset element: the RFC clearly states that encoding should either be specified in the header or default to iso latin 1. I can't recall whether it makes provisions for media type specific default charsets, but either way, if you follow the HTTP standard, you should not specify content encoding in your text document (this of course does not apply to binary formats that migh encode text).

So, to be a bit pedantic about it, my argument is that you should follow the standards and ignore / work around the hacks used to make life easier for people that don't know / don't care about encoding.

Note that I do not mean that as condescending - at some point, a lot of designers were writing HTML manually, and I don't expect them to know about encoding, just the same as they hopefuly don't expect me to know about... design stuff I'm really terrible at.

JupiterMoon · on June 28, 2015

Fair enough. Not condescending. I don't know much about HTML tbh.

However, if I were faced with your situation I would try to use whatever logic is used by firefox or chromium to work out encoding. After all designers are going to (should) test if things work on one/both of these right?

yaph · on June 27, 2015

> There is no reason to store encoding information in an html file.

That's simply wrong. If you use libraries like D3.js, that contain non-ASCII characters in the source code, and you do local development with a server that sends no encoding headers or even without using a server at all, your code won't work.

nrinaudo · on June 27, 2015

That is, as far as I know, not true.

If your file is in html5, then all browsers will assume a default encoding of utf-8, which is what you should be using anyway - unless you have a very good reason, such as your file contains a majority of kanji or kana.

If you're using something older, the official defaut is iso latin 1, but I believe all modern browsers will try utf-8 first - this is not something I've verified for myself, it works on my setup but I also configured my os to use a default utf-8 encoding, so I can't tell for sure.

I'm also unclear how this relates to d3.js - that's a javascript import, not an html one. Or do you mean inline javascript that uses d3.js?

icebraining · on June 27, 2015

I have to admit I didn't consider the use case of saving the HTML page.

Regarding CDNs, I know that Cloudflare passes them as-is, I don't know about the others.

PeterisP · on June 27, 2015

We work with a lot of multilingual text, and for "what to know about encodings and character sets" we have a very simple answer to that - a guideline called "use UTF8 or die".

It's not suitable for absolutely everyone (e.g. if you have a lot of Asian text then you may want a different encoding), but for our use case every single deviation causes lot of headache, risks and unneccessary work in fixing garbage data.

In simplistic terms what we mean by this guideline:

* in your app, 100% all human text should be stored UTF8 only, no exceptions. If you need to deal with data in other encodings - other databases, old documents, whatever - make a wrapper/interface that takes some resource ID and returns the data with everything properly converted to UTF8; and has no way (or at least no convenient way) to access the original bytes in another encoding.

* in all persistence layers, store text only as UTF8. If at all possible, don't even provide options to export files in other encodings. If legacy/foreign data stores need another encoding, then in your code never have an API that requires a programmer to pass data in that encoding - the functions "store_document_to_the_mainframe_in_EBCDIC" and "send_cardholder_data_to_that_weird_CC_gateway" should take UTF8 strings only and handle the encodings themselves.

* in all [semi-]public API, treat text as UTF8-only and document that. If your API documentation mentions a text field, state the encoding so that there is no guessing or assuming by anyone.

* in all system configuration, set UTF8 as the default whenever possible. A database server? Make sure that any new databases/tables/text fields will have UTF8 set as the default, so unless someone takes explicit action then user-local-language encodings won't accidentally appear.

* Whoever introduces a single byte of different encoding data is responsible for fixing the resulting mess. This is the key part. Did you write a data input function that passed on data in the user computer default encoding; tested it only on US-ASCII nonenglish symbols; and got a bunch of garbage data stored? You're responsible for finding the relevant entries and fixing them, not only your code. Used a third party library that crashes or loses data when passed non-english unicode symbols? Either fix the library (if it's OS) or rewrite code to use a different one.

peapicker · on June 27, 2015

From the article: "Overall, Unicode is yet another encoding scheme."

It is more than that - for instance, it includes algorithms as well... for instance, dealing with RTL languages with ordering and shaping rules (i.e. Arabic), how to know what to do when RTL languages are mixed with LTR (is that '.' at the end of '123' a decimal point, or a period? (determines if it goes to the right or the left or the sequence)) and how to know when data is equivalent despite being normalized or not, etc...

scottfr · on June 27, 2015

The linked article by Joel Spolksy is also great:

http://www.joelonsoftware.com/articles/Unicode.html

r4ferrei · on June 27, 2015

It's a great article indeed. I think it was after I read this one that I really started to understand what was going on with all that encoding stuff that I was already used to doing. Funny to look back.

ninjakeyboard · on June 27, 2015

I hate that I have to know this stuff. Working on implementing a spec today where handling character encoding is a requirement.

vorg · on June 27, 2015

> It basically defines a ginormous table of 1,114,112 code points that can be used for all sorts of letters and symbols. That's plenty to encode all existing, pre-historian and future characters mankind knows about. There's even an unofficial section for Klingon. Indeed, Unicode is big enough to allow for unofficial, private-use areas.

The private use areas only encode about 137,000 codepoints (U+e000 to U+f8ff & U+f0000 to U+10ffff) and are running out quickly. Most of U+e000 to U+f8ff is used by many different private agreements, and some pseudo-public ones like the Conscript registry which encodes Klingon, linked to in the article. Conscript also uses a large chunk of plane F to encode the constructed script Kinya, i.e. the 3696 codepoints in U+f0000 to U+f0e6f, see http://www.kreativekorp.com/ucsur/charts/PDF/UF0000.pdf . It takes up so much room because it's a block script like Korean Hangul and is encoded by formula just like Hangul. Each Korean Hangul block is made up of 2 or 3 jamo: one of 19 leading consonants, one of 21 vowels, and optionally one of 27 trailing consonants, giving a total of 19 * 21 * 28 = 11,172 possible syllable blocks, generated by formula into range U+ac00 to U+d7a3. Kinya also uses such a formula to generate its script, and I'm sure many other constructed block scripts will make their way into the quasi-official Conscript Registry. I'm even working on one of my own.

In fact, rather than filling up U+f0000 to U+10ffff, such conscripts only need to fill up the first quarter of it (i.e. U+f0000 to U+f7fff) for Unicode to run out of private use space, because the remainder (U+f8000 to U+10ffff) is needed for a second-tier surrogate system (see https://github.com/gavingroovygrover/utf88 ) to extend the codepoint space back up to 1 billionish codepoints as it was originally specified by Pike and Thompson until it was clipped back down to 1 million in 2003.

So Unicode is not "plenty to encode" or "big enough to allow for" all known, future, or private-use characters.

jekub · on June 27, 2015

> see https://github.com/gavingroovygrover/utf88

This is the most stupid way to extend UTF-8 I've seen. The only acceptable solution is to remove the restriction of using only four byte per sequence which would allow to encode these easily keeping all the advantages of UTF-8.

Doing it like they does add an additional layer of encoding and so a lot of complexity a room for bugs.

It was probably made for compatibility but a lot of software will do bad thing with these new "surrogate" pairs so this solution is not really more compatible in practice. And updating software to handle UTF-8 sequence longer than 4 bytes is a lot more easier than updating them to handle such encoding.

vorg · on June 27, 2015

> The only acceptable solution is to remove the restriction of using only four byte per sequence which would allow to encode these easily keeping all the advantages of UTF-8

I agree. Extending UTF-8 with surrogates like this is intended to be temporary, only used until the pre-2003 2.1 billion codepoint limit for UTF-8 and UTF-32 is reinstated by the Unicode Consortium. Then any software using UTF-88 can easily swap the encoding to the 1 to 6-byte sequences in "reinstated" UTF-8. This surrogation scheme is actually intended for UTF-16 to use as a second-tier surrogate scheme so it can encode the same number of codepoints as UTF-8 and UTF-32. I wrote all this under "Rationale" at the bottom of the linked page, did you read that far?

Hopefully, though, UTF-16 will be on its way out when pre-2003 UTF-8 and UTF-32 are reinstated so this surrogation scheme wouldn't even see much use there.

jekub · on June 27, 2015

But "temporary" is a thing who exists only in theory. In practice its always never or (almost) forever. As soon as a few applications start using this "new" form of UTF-8, some of them may have to keep supporting it forever.

Why not directly going for the pre-2003 UTF-8 encoding ? It would even put a bit of pressure for restoring them and would show that this is the good way. It is also the only way I think to convince people to start implementing it.

vorg · on June 27, 2015

> As soon as a few applications start using this "new" form of UTF-8, some of them may have to keep supporting it forever

Not if it's used through a 3rd-party library such as the Go-implementation of UTF-88 I've provided.

> Why not directly going for the pre-2003 UTF-8 encoding ? It would even put a bit of pressure for restoring them

Because it's not a valid encoding under the current scheme, whereas using surrogates with UTF-8 is, using as it does the 2 private use planes to implement the surrogates. The goal is for restoration by the Unicode Consortium, but based on their public utterances it's not going to happen easily or quickly, and in the meantime we need an encoding that's valid under the current scheme because it may need to be used for 10 or 20 years. Of course I could have used UTF-16 with a doubly-directed surrogate system but that would be even more error-prone, and I expect whatever 2nd-level surrogate system is eventually provided with UTF-16 will be legally available with UTF-8 and UTF-16 anyway.

UTF-88 is an attempt to showcase both a surrogation scheme implementable in current UTF-16 and the fact that UTF-8 is the best encoding.

keedot · on June 27, 2015

Interesting that I actually don't need to know this stuff. I think you'll find that MOST developers actually don't need to know this stuff. People seem to forget that the vast bulk of developers are for corporate and in house development, single language, being English.

I know this stuff because I like to understand how this works, but for all the dev's under me, there are probably a thousand concepts that I want them to understand before they start tackling encoding beyond knowing when to call the correct function.

vanous · on June 27, 2015

As being from a non English speaking country I have had to deal with many instances of issues caused by this way of thinking. Before unicode happened to be widely applied, using Linux was 100% harder for us. So sometimes it may see that some things aren't fundamental but they might actually be, you simply don't know it yet.

flohofwoe · on June 27, 2015

...and then you get strange bug reports that your code doesn't work on Windows machines of co-workers from other countries because they have non-ASCII characters in their login name, or worse, the problem makes it out into the wild (http://www.eurogamer.net/articles/2015-04-14-windows-usernam...).

I would actually recommend this text http://utf8everywhere.org/ over the OP, it offers a much simplified view and solution on the mess that was international text encoding in the past. It's better to just forget about codepages and ASCII, and just use UTF-8 everywhere.

icebraining · on June 27, 2015

Well, yes, I'd they know that texts have an encoding, how to detect it, and what functions to call to convert between them, I agree they don't need to understand how they actually work.

That said, I think the "we work in English" is a mistaken belief. Clients, employees, products, suppliers, etc, all will have names with non-ASCII characters.

My name is André; send me a letter or email with my name garbled, and I will mock your company publicly and, if you are a tech company, heavily consider dropping you for incompetence.

justincormack · on June 27, 2015

And then you acquire a Japanese subsidary, or get bought by a Korean company, or hire someone with a Polish name, or you work in a country that has an official language that is not representable in ASCII, such as New Zealand.

imaok · on June 27, 2015

One thing I'm still confused about. What exactly is happening when you copy paste some text from one app to another? What encoding will the copied text be in?

JoachimS · on June 27, 2015

Good, gentle introduction that goes through everything step by step. Turns to php at the end.

carsonreinke · on June 27, 2015

...never assume one byte per glyph

SilasX · on June 27, 2015

Sorry, but now I reflexively flag-on-sight any instance if this clickbaity, obviously overstated "every programmer needs to know about semiconductor opcodes/mainframe architecture/etc".

SFjulie1 · on June 27, 2015

PHP devs are so slow they just adopted utf8 and see its glory.

I myself "UTF8 or die!"d a long time ago and discovered it was not a good idea.

I will forget the problem of the parsing of the nth character, the string length vs the memory used, the canonization of strings for comparison. And go directly to 2 problems:

* There exists cases in which latin1 & utf8 are mangled in a same string. (ex http/SIP headers are latin1 and content can be utf8 and you may want to store a full HTTP/SIP transaction verbatim for debugging purpose), and it can store in iso-latin3 (code table for esperanto to be sarcastic), but will explode in utf8 unless you rencode it (B64)

* tools are partly UTF8 compliant: mysql (which is as good as PHP in terms of quality) is clueless about UTF8 (hint index and collation), and PHP too https://bugs.php.net/bug.php?id=18556 <--- THIS BUG TOOK 10 YEARS TO BE CLOSED

The whole point is developer don't understand the organic nature of culture, and especially of its writing and the diversity of culture.

They think that because some rules applies in their language it also applies in others: BUT

* PHP devs: lowercase of I is not always i (it can be i without a dot). It took 10 years to the dev to find where their bug was! * shortening a memory representation does not always shorten its graphical representation (apples bug with sms in arabian) * sort orders are not immutable (collation not only can vary from language to language but also according to the administrative context (ex: proper name in french)) * inflexions are hell and text size for error varies a lot (hence the unstability of windows 95 in french because error message where copied in a reserved page and the fixed size was less than the whole size of the domain's corpus... hence any contiguous block in memory (lower xor upper bound) would have its memory potentially corrupted)).

My point is UTF8 is not hell. Real world is complex. And it becomes hell when some dumb devs thinking that by manipulating strings that represents any language they know about any language.. and apply universal rules that are not.

Some problems can be solved by ignoring them. But with culture it is not the case.

And actually, unicode SUX because it is US centrics

* computers should be able to store all our pasts books and make them available for the future, even in verbatim mode. But unicode HAS not archeological character sets like THIS https://fr.wikipedia.org/wiki/S_long I don't care about the USA lack of history. I see no use in the computer if it requires to sacrifice our histories and cultures, * https://modelviewculture.com/pieces/i-can-text-you-a-pile-of... some people cannot even use it in their own language

Unicode suffers a lot of problem plus a conceptual one; it is has immutable characters AND directives (change the direction of the writing, apply ligatures)... that not only will create security concerns (one of the funniest being the possibility by adding a string to reedit silently text in on an effector (screen or printer))... We are introducing type setting rules in unicode.

For those who have used tex since a long time, the non separation of the almost programmatical typography and the graphens is like not separing the model and the controller.

Which actually also calls for the view (the effection) and thus the fonts. Having the encoding of the Slong does not tell you what it looks like unless you have a canonical representation of the codepoint as a graphem.

And since we are printing/creating document for juridical purpose we may like to control the view that ensures that the mapping of the string representation will not alter graphical representation in a way that can compromise its meaning. If someone signs in a box you don't want the signature to alter the representation anywhere or worse without notice.

The devil lies in the detail. Unicode is a Babel tower that may well crash for the same reason as in the bible: hubris.