Answering your "do you have a source" question, yeah: "the entire history of the...

electroly · on Aug 21, 2023

Chromium (and I'm sure other browsers, but I didn't test) will sniff character set heuristically regardless of the HTML version or quirks mode. It's happy to choose UTF-8 if it sees something UTF-8-like in there. I don't know how to square this with your earlier claim of "Browsers don't use utf-8 unless you tell them to."

That is, the following UTF-8 encoded .html files all produce document.characterSet == "UTF-8" and render as expected without mojibake, despite not saying anything about UTF-8. Change "ä" to "a" to get windows-1252 again.

    <html>ä

    <!DOCTYPE html><html>ä

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>ä

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html>ä

capitainenemo · on Aug 21, 2023

A simpler test FWIW.. type:

   data:text/html,<html>

Into your url bar and inspect that. Avoids server messing with encoding values. And yes, here on my linux machine in firefox it is windows-1252 too.

(You can type the complete document, but <html> is sufficient. Browsers autocomplete a valid document. BTW, data:text/html,<html contenteditable> is something I use quite a lot)

But yeah, I think windows-1252 is standard for quirks mode, for historical reasons.

slt2021 · on Aug 21, 2023

>data:text/html,<html contenteditable>

thank you, I learned nice trick today.

re windows1252 - this could be driven by system encoding settings, for most people it is 1252, but for eastern europe it is windows-1251.

when viewed from IBM z mainframe - encoding will be something like IBM EBCDIC

capitainenemo · on Aug 22, 2023

Well, I'm on Linux - system encoding set to UTF-8 which is pretty much standard there. But I think the "windows-1252 for quirks" is just driven by what was dominant back when the majority of quirky HTML was generated decades ago.

layer8 · on Aug 21, 2023

The historical (and present?) default is to use the local character set, which on US Windows is Windows-1252, but for example on Japanese Windows is Shift-JIS. The expectation is that users will tend to view web pages from their region.

kalleboo · on Aug 22, 2023

I'm in Japan on a Mac with the OS language set to Japanese. Safari gives me Shift_JIS, but Chrome and Firefox give me windows-1252

edit: Trying data:text/html,<html>日本語 makes Chrome also use Shift_JIS, resulting in mojibake as it's actually UTF-8. Firefox shows a warning about it guessing the character set, and then it chooses windows-1252 and displays more garbage.

ko27 · on Aug 21, 2023

Okay, it's good that we agree then on my original premise, the vast majority of websites (by quantity and popularity) on the Internet today are using UTF-8 encoding, and Latin-1 is being phased out.

Btw I appreciate your edited response, but still you were factually incorrect about:

> Browsers don't use utf-8 unless you tell them to

Browsers can use UTF-8 even if we don't tell them. I am already aware of the extra heuristics you wrote about.

> HTML file with latin-1 ... which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8

You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252

lelandbatey · on Aug 21, 2023

> You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252

I decided to try this experimentally. In my findings, if neither the server nor the page contents indicate that a file is UTF-8, then the browser NEVER defaults to setting document.characterSet to UTF-8, instead basically always assuming that it's "windows-1252" a.k.a. "latin1". Read on for my methodology, an exact copy of my test data, and some particular oddities at the end.

To begin, we have three '.html' files, one with ASCII only characters, a second file with two separate characters that are specifically latin1 encoded, and a third with those same latin1 characters but encoded using UTF-8. Those two characters are:

    Ë - "Latin Capital Letter E with Diaeresis" - Latin1 encoding: 0xCB  - UTF-8 encoding: 0xC3 0x8B   - https://www.compart.com/en/unicode/U+00CB
    ¥ - "Yen Sign"                              - Latin1 encoding: 0xA5  - UTF-8 encoding: 0xC2 0xA5   - https://www.compart.com/en/unicode/U+00A5

To avoid copy-paste errors around encoding, I've dumped the contents of each file as "hexdumps", which you can transform back into their binary form by feeding the hexdump form into the command 'xxd -r -p -'.

    $ cat ascii.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b2041534349493c2f7469746c653e3c2f686561643e3c626f64793e
    3c68313e4e6f74206d75636820686572652c206a75737420706c61696e20
    746578743c2f68313e3c703e4d6f7265207465787420746861742773206e
    6f74207370656369616c3c2f703e3c2f626f64793e3c2f68746d6c3e0a
    $ cat latinone.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b206c6174696e313c2f7469746c653e3c2f686561643e3c626f6479
    3e3c68313e546869732069732061206c6174696e31206368617261637465
    7220307841353a20a53c2f68313e3c703e54686973206973206368617220
    307843423a20cb3c2f703e3c2f626f64793e3c2f68746d6c3e0a
    $ cat utf8.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b207574663820203c2f7469746c653e3c2f686561643e3c626f6479
    3e3c68313e54686973206973206120757466382020206368617261637465
    7220307841353a20c2a53c2f68313e3c703e546869732069732063686172
    203078433338423a20c38b3c2f703e3c2f626f64793e3c2f68746d6c3e0a

The full contents of my current folder is as such:

    $ ls -a .
    .  ..  ascii.html  latinone.html  utf8.html

Now that we have our test files, we can serve them via a very basic HTTP server. But first, we must verify that all responses from the HTTP server do not contain a header implying the content type; we want the browser to have to make a guess based on nothing but the contents of the file. So, we run the server and check to make sure it's not being well intentioned and guessing the content type:

    $ curl -s -vvv 'http://127.0.0.1:8000/ascii.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /ascii.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

    $ curl -s -vvv 'http://127.0.0.1:8000/latinone.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /latinone.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

    $ curl -s -vvv 'http://127.0.0.1:8000/utf8.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /utf8.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

Now we've verified that we won't have our observations muddled by the server doing its own detection, so our results from the browser should be able to tell us conclusively if the presence of a latin1 character causes the browser to use UTF-8 encoding. To test, I loaded each web page in Firefox and Chromium and checked what `document.characterSet` said.

    Firefox (v116.0.3):
        http://127.0.0.1:8000/ascii.html     result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/latinone.html  result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/utf8.html      result of `document.characterSet`: "windows-1252"

    Chromium (v115.0.5790.170):
        http://127.0.0.1:8000/ascii.html     result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/latinone.html  result of `document.characterSet`: "macintosh"
        http://127.0.0.1:8000/utf8.html      result of `document.characterSet`: "windows-1252"

So in my testing, neither browser EVER guesses that any of these pages are UTF-8, all these browsers seem to mostly default to assuming that if no content-type is set in the document or in the headers then the encoding is "windows-1252" (bar Chromium and the Latin1 characters which bizzarely caused Chromium to guess that it's "macintosh" encoded?). Also note that if I add the exact character you proposed (ä) to the text body, it still doesn't cause the browser to start assuming everything is UTF-8; the only change is that Chromium starts to think the latinone.html file is also "windows-1252" instead of "macintosh".

bawolff · on Aug 21, 2023

> Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.

While technically latin-1/iso-8859-1 is a different encoding than windows-1252, html5 spec says browsers are supposed to treat latin1 as windows-1252.