Hacker News new | past | comments | ask | show | jobs | submit login

That's true, but you misunderstood what I meant.

The parent comment seemed to be implying that we should drop support for non-utf8 charsets.

To me, that rings like saying a website with 'charset=EUC-JP' (such as http://www.os2.jp/) should be broken, as in browsers should error out or display a large quantity of black boxes due to it using a non-utf-8 encoding.

I'm claiming the only reason the author thinks that's really viable is because in our western-centric world, we see mostly ascii and utf8. Things that, if you flip to only utf-8, both still look fine.

CJK websites, on the other hand, that are using the equivalent of ASCII will have to be manually upgraded to display correctly if browsers drop their support.

Sure, all their characters can be represented in utf-8, but there's large swathes of websites that will never be updated to a new charset, and it's only a western-centric view that can so blithely suggest breaking them all.




Windows-1252/ISO-8859-1 (the two charsets are so commonly conflated that it's often best to treat them as one) was the dominant [non-ASCII] charset of the web until around 2007 or 2008, and their prevalence more recently is only about 5%.

A collection of Usenet messages gathered in 2014 (see http://quetzalcoatal.blogspot.com/2014/03/understanding-emai... for full details) showed that out of 1,000,000 messages, about 530,000 were actually ASCII; 270,000 were ISO-8859-1 or Windows-1252; and only 75,000 were UTF-8. More modern numbers would probably show higher UTF-8 counts, although Usenet is notoriously conservative in terms of technology.

What I'm trying to elucidate here is that the rise of UTF-8 isn't because most text is ASCII, but because there's been a rather more concerted effort to default content generation to UTF-8 and treat other charsets only as legacy inputs. Well, with the exception of the Japanese, who tend to be strongly averse to UTF-8. (I've been told that Japanese email users would rather have their text get silently mangled than silently converted to UTF-8 because you're quoting an email with a smart quote [not present in any of the 3 Japanese charsets], whereas every other locale was happy changing the default charset for writing to UTF-8).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: