Answering your "do you have a source" question, yeah: "the entire history of the web prior to HTML5's release", which the internet has already forgotten is a rather recent thing (2008). And even then, it took a while for HTML5 to become the de facto format, because it took the majority of the web years before they'd changed over their tooling from HTML 4.01 to HTML5.
> This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text
No, but I will create an HTML file with latin-1 text, because that's what we're discussing: HTML files that don't use UTF-8 (and so by definition don't contain UTF-8 either).
While modern browsers will guess the encoding by examining the content, if you make an html file that just has plain text, then it won't magically convert it to UTF-8: create a file with `<html><head><title>encoding check</title></head><body><h1>Not much here, just plain text</h1><p>More text that's not special</p></body></html>` in it. Load it in your browser through an http server (e.g. `python -m http.server`), and then hit up the dev tools console and look at `document.characterSet`.
Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.
Chromium (and I'm sure other browsers, but I didn't test) will sniff character set heuristically regardless of the HTML version or quirks mode. It's happy to choose UTF-8 if it sees something UTF-8-like in there. I don't know how to square this with your earlier claim of "Browsers don't use utf-8 unless you tell them to."
That is, the following UTF-8 encoded .html files all produce document.characterSet == "UTF-8" and render as expected without mojibake, despite not saying anything about UTF-8. Change "ä" to "a" to get windows-1252 again.
<html>ä
<!DOCTYPE html><html>ä
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>ä
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html>ä
Into your url bar and inspect that. Avoids server messing with encoding values. And yes, here on my linux machine in firefox it is windows-1252 too.
(You can type the complete document, but <html> is sufficient. Browsers autocomplete a valid document. BTW, data:text/html,<html contenteditable> is something I use quite a lot)
But yeah, I think windows-1252 is standard for quirks mode, for historical reasons.
Well, I'm on Linux - system encoding set to UTF-8 which is pretty much standard there.
But I think the "windows-1252 for quirks" is just driven by what was dominant back when the majority of quirky HTML was generated decades ago.
The historical (and present?) default is to use the local character set, which on US Windows is Windows-1252, but for example on Japanese Windows is Shift-JIS. The expectation is that users will tend to view web pages from their region.
I'm in Japan on a Mac with the OS language set to Japanese. Safari gives me Shift_JIS, but Chrome and Firefox give me windows-1252
edit: Trying data:text/html,<html>日本語 makes Chrome also use Shift_JIS, resulting in mojibake as it's actually UTF-8. Firefox shows a warning about it guessing the character set, and then it chooses windows-1252 and displays more garbage.
Okay, it's good that we agree then on my original premise, the vast majority of websites (by quantity and popularity) on the Internet today are using UTF-8 encoding, and Latin-1 is being phased out.
Btw I appreciate your edited response, but still you were factually incorrect about:
> Browsers don't use utf-8 unless you tell them to
Browsers can use UTF-8 even if we don't tell them. I am already aware of the extra heuristics you wrote about.
> HTML file with latin-1 ... which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8
You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252
> You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252
I decided to try this experimentally. In my findings, if neither the server nor the page contents indicate that a file is UTF-8, then the browser NEVER defaults to setting document.characterSet to UTF-8, instead basically always assuming that it's "windows-1252" a.k.a. "latin1". Read on for my methodology, an exact copy of my test data, and some particular oddities at the end.
To begin, we have three '.html' files, one with ASCII only characters, a second file with two separate characters that are specifically latin1 encoded, and a third with those same latin1 characters but encoded using UTF-8. Those two characters are:
Ë - "Latin Capital Letter E with Diaeresis" - Latin1 encoding: 0xCB - UTF-8 encoding: 0xC3 0x8B - https://www.compart.com/en/unicode/U+00CB
¥ - "Yen Sign" - Latin1 encoding: 0xA5 - UTF-8 encoding: 0xC2 0xA5 - https://www.compart.com/en/unicode/U+00A5
To avoid copy-paste errors around encoding, I've dumped the contents of each file as "hexdumps", which you can transform back into their binary form by feeding the hexdump form into the command 'xxd -r -p -'.
The full contents of my current folder is as such:
$ ls -a .
. .. ascii.html latinone.html utf8.html
Now that we have our test files, we can serve them via a very basic HTTP server. But first, we must verify that all responses from the HTTP server do not contain a header implying the content type; we want the browser to have to make a guess based on nothing but the contents of the file. So, we run the server and check to make sure it's not being well intentioned and guessing the content type:
Now we've verified that we won't have our observations muddled by the server doing its own detection, so our results from the browser should be able to tell us conclusively if the presence of a latin1 character causes the browser to use UTF-8 encoding. To test, I loaded each web page in Firefox and Chromium and checked what `document.characterSet` said.
Firefox (v116.0.3):
http://127.0.0.1:8000/ascii.html result of `document.characterSet`: "windows-1252"
http://127.0.0.1:8000/latinone.html result of `document.characterSet`: "windows-1252"
http://127.0.0.1:8000/utf8.html result of `document.characterSet`: "windows-1252"
Chromium (v115.0.5790.170):
http://127.0.0.1:8000/ascii.html result of `document.characterSet`: "windows-1252"
http://127.0.0.1:8000/latinone.html result of `document.characterSet`: "macintosh"
http://127.0.0.1:8000/utf8.html result of `document.characterSet`: "windows-1252"
So in my testing, neither browser EVER guesses that any of these pages are UTF-8, all these browsers seem to mostly default to assuming that if no content-type is set in the document or in the headers then the encoding is "windows-1252" (bar Chromium and the Latin1 characters which bizzarely caused Chromium to guess that it's "macintosh" encoded?). Also note that if I add the exact character you proposed (ä) to the text body, it still doesn't cause the browser to start assuming everything is UTF-8; the only change is that Chromium starts to think the latinone.html file is also "windows-1252" instead of "macintosh".
> Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.
While technically latin-1/iso-8859-1 is a different encoding than windows-1252, html5 spec says browsers are supposed to treat latin1 as windows-1252.
> This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text
No, but I will create an HTML file with latin-1 text, because that's what we're discussing: HTML files that don't use UTF-8 (and so by definition don't contain UTF-8 either).
While modern browsers will guess the encoding by examining the content, if you make an html file that just has plain text, then it won't magically convert it to UTF-8: create a file with `<html><head><title>encoding check</title></head><body><h1>Not much here, just plain text</h1><p>More text that's not special</p></body></html>` in it. Load it in your browser through an http server (e.g. `python -m http.server`), and then hit up the dev tools console and look at `document.characterSet`.
Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.