We all draw different lines for what feature is not common yet not rare enough to remove, so the context always matters.
For this subject of character encodings, browsers have already reduced their repertoire of suppported encodings for a good chunk of the last decade: the Encoding Standard [1] is substantially smaller than what it used to be (I know because I did implement all of them back then). Some of them (e.g. UTF-7, not exactly "removed" from the Standard because it was never added AFAIK) caused security issues but others (e.g. HZ) didn't and only got removed by the lack of usage. This kind of decision, either for or against, can be only backed with telemetry-like quantification anyway. No criticism in the issue was based on the concrete evidence; Henri Sivonen is right to say that he "can't address remaining non-Latin problems without seeing concrete examples that need addressing".
> Around thirty years ago I was a kid with a computer. I learned to program quite a few years before I learned English. I also used DOS without understanding English. I knew what to type to do things, but I didn't know what the words meant. I could start programs, I'd play in QBASIC, write small programs and amusements. To me "PRINT" was the word that made text appear on the screen. I learned years later the word meant something in English.
Wow, this is exactly what happened to me (probably between 8 and 10). I had an old 286 computer, where I learnt BASIC by reading programs already available on the computer.
I wrote programs, but had no idea what the keywords meant. I remember reading out loud "IF … THEN … ELSE" that I pronounced "if … ten … elsse", and my father in the room corrected my pronunciation, "it's if … then … else".
I was very surprised, because he didn't know BASIC at all. "How do you know this?"
"It's how we pronounced it in English. It means 'si … alors … sinon'." o_o
It was a revelation to me: the "keywords" I typed in BASIC programs were not just meaningless tokens for the BASIC language, they came from English and had a meaning outside BASIC.
Addressing the remaining non-Latin problems is extremely easy: just give users the ability to choose an encoding. Or rather, do not remove it.
Note that they are not actually removing any encodings themselves, so the argument from security / attack surface doesn't arise: the website can still send text in any of those encodings (or malformed but autodetected as such). This is purely about UX.
> Note that they are not actually removing any encodings themselves, so the argument from security / attack surface doesn't arise: the website can still send text in any of those encodings (or malformed but autodetected as such).
No. If the visitor is able to freely choose the page encoding, then the attacker can lure one into doing so by presenting a seemingly malformed partial text. So it will allow the attacker to control the victim page's encoding, only less effectively though.
---
I'm also very annoyed about the disproportional amount of criticism received by Firefox compared to Chrome here, because Chrome has proactively removed the encoding selection UI well before Firefox tried to do similar. And---to elaborate what I referred to the "telemetry-like quantification" before---Firefox concluded that the UI can't be completely removed without breaking a lot of existing pages using the telemetry [1], hence this "Repair Text Encoding" feature. It is hopelessly absurd to blame Firefox alone for this issue.
Chrome is Google's playground, so it's pointless to criticize - I long since stopped having any expectations of usability or feature-completeness. Firefox is (was?) supposed to be different.
The good thing is that we have Vivaldi, Opera's spiritual successor in many things including this one: if somebody needs a knob, it'll be there.
https://memex.marginalia.nu/log/36-localized-programming-lan...
We all draw different lines for what feature is not common yet not rare enough to remove, so the context always matters.
For this subject of character encodings, browsers have already reduced their repertoire of suppported encodings for a good chunk of the last decade: the Encoding Standard [1] is substantially smaller than what it used to be (I know because I did implement all of them back then). Some of them (e.g. UTF-7, not exactly "removed" from the Standard because it was never added AFAIK) caused security issues but others (e.g. HZ) didn't and only got removed by the lack of usage. This kind of decision, either for or against, can be only backed with telemetry-like quantification anyway. No criticism in the issue was based on the concrete evidence; Henri Sivonen is right to say that he "can't address remaining non-Latin problems without seeing concrete examples that need addressing".
[1] https://encoding.spec.whatwg.org/