Hacker News new | past | comments | ask | show | jobs | submit login

> You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252

I decided to try this experimentally. In my findings, if neither the server nor the page contents indicate that a file is UTF-8, then the browser NEVER defaults to setting document.characterSet to UTF-8, instead basically always assuming that it's "windows-1252" a.k.a. "latin1". Read on for my methodology, an exact copy of my test data, and some particular oddities at the end.

To begin, we have three '.html' files, one with ASCII only characters, a second file with two separate characters that are specifically latin1 encoded, and a third with those same latin1 characters but encoded using UTF-8. Those two characters are:

    Ë - "Latin Capital Letter E with Diaeresis" - Latin1 encoding: 0xCB  - UTF-8 encoding: 0xC3 0x8B   - https://www.compart.com/en/unicode/U+00CB
    ¥ - "Yen Sign"                              - Latin1 encoding: 0xA5  - UTF-8 encoding: 0xC2 0xA5   - https://www.compart.com/en/unicode/U+00A5
To avoid copy-paste errors around encoding, I've dumped the contents of each file as "hexdumps", which you can transform back into their binary form by feeding the hexdump form into the command 'xxd -r -p -'.

    $ cat ascii.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b2041534349493c2f7469746c653e3c2f686561643e3c626f64793e
    3c68313e4e6f74206d75636820686572652c206a75737420706c61696e20
    746578743c2f68313e3c703e4d6f7265207465787420746861742773206e
    6f74207370656369616c3c2f703e3c2f626f64793e3c2f68746d6c3e0a
    $ cat latinone.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b206c6174696e313c2f7469746c653e3c2f686561643e3c626f6479
    3e3c68313e546869732069732061206c6174696e31206368617261637465
    7220307841353a20a53c2f68313e3c703e54686973206973206368617220
    307843423a20cb3c2f703e3c2f626f64793e3c2f68746d6c3e0a
    $ cat utf8.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b207574663820203c2f7469746c653e3c2f686561643e3c626f6479
    3e3c68313e54686973206973206120757466382020206368617261637465
    7220307841353a20c2a53c2f68313e3c703e546869732069732063686172
    203078433338423a20c38b3c2f703e3c2f626f64793e3c2f68746d6c3e0a
The full contents of my current folder is as such:

    $ ls -a .
    .  ..  ascii.html  latinone.html  utf8.html
Now that we have our test files, we can serve them via a very basic HTTP server. But first, we must verify that all responses from the HTTP server do not contain a header implying the content type; we want the browser to have to make a guess based on nothing but the contents of the file. So, we run the server and check to make sure it's not being well intentioned and guessing the content type:

    $ curl -s -vvv 'http://127.0.0.1:8000/ascii.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /ascii.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

    $ curl -s -vvv 'http://127.0.0.1:8000/latinone.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /latinone.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

    $ curl -s -vvv 'http://127.0.0.1:8000/utf8.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /utf8.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html
Now we've verified that we won't have our observations muddled by the server doing its own detection, so our results from the browser should be able to tell us conclusively if the presence of a latin1 character causes the browser to use UTF-8 encoding. To test, I loaded each web page in Firefox and Chromium and checked what `document.characterSet` said.

    Firefox (v116.0.3):
        http://127.0.0.1:8000/ascii.html     result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/latinone.html  result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/utf8.html      result of `document.characterSet`: "windows-1252"

    Chromium (v115.0.5790.170):
        http://127.0.0.1:8000/ascii.html     result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/latinone.html  result of `document.characterSet`: "macintosh"
        http://127.0.0.1:8000/utf8.html      result of `document.characterSet`: "windows-1252"

So in my testing, neither browser EVER guesses that any of these pages are UTF-8, all these browsers seem to mostly default to assuming that if no content-type is set in the document or in the headers then the encoding is "windows-1252" (bar Chromium and the Latin1 characters which bizzarely caused Chromium to guess that it's "macintosh" encoded?). Also note that if I add the exact character you proposed (ä) to the text body, it still doesn't cause the browser to start assuming everything is UTF-8; the only change is that Chromium starts to think the latinone.html file is also "windows-1252" instead of "macintosh".



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: