In 1978, the initial proposal for a set of "Universal Signs" was made by Bob Belleville at Xerox PARC. Many persons contributed ideas to the development of a new encoding design. Beginning in 1980, these efforts evolved into the Xerox Character Code Standard (XCCS) by the present author, a multilingual encoding which has been maintained by Xerox as an internal corporate standard since 1982, through the efforts of Ed Smura, Ron Pellar, and others.
Unicode arose as the result of eight years of working experience with XCCS. Its fundamental diferences rom XCCS were proposed oy Peter Fenwick and Dave Opstad (pure 16-bit codes), and by Lee Collins (ideographic character unification), Unicode retains the many features of XCCS whose utility have been proved over the years in an international line of communicating multilingual system products.
> Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is far below 2¹⁴ = 16,384.
...and yeah, that's probably true. It wasn't until several years later that they joined forces with the ISO10646 project, which aimed to encode every character ever used by any human writing system, and 16 bits turned out not to be enough.
> Conceptually, compatibility characters are characters that would not have been encoded in the Unicode Standard except for compatibility and round-trip convertibility with other standards. Such standards include international, national, and vendor character encoding standards. For the most part, these are widely used standards that pre-dated Unicode, but because continued interoperability with new standards and data sources is one of the primary design goals of the Unicode Standard, additional compatibility characters are added as the situation warrants.
Round-trip convertibility is important for interoperability, which is important for getting the standard accepted. A data encoding standard that causes data loss is not acceptable, and trying to impose such on the world would have rightfully failed.
Unicode being widely adopted (which is what is happening) is the solution to the problem you seem to have of Unicode having to add more codepoints for compatibility with other encodings. Even if the pre-Unicode encodings never go away, they've already taken their respective bites out of the codepoint space, and they won't take up any more of the Consortium's time.
The only way this becomes a live issue again is if someone develops a whole new encoding which is incompatible with the latest version of Unicode. And it gets enough traction to get on anyone else's radar. I simply can't see that as very likely.
[Unicode] didn't start the fire; it was always burning, since the world's been turning.
(I.e. the "never-ending madness" is in the continual desire of human cultures to invent new characters with unique semantics and then have them faithfully digitally encoded in a way that carries those semantics. Being upset that this makes the management of a universal text encoding a never-ending job, is a bit like being upset that the Library of Congress always needs reorganizing and expanding because it keeps getting new books.)
It's different in that nobody expects anyone to implement their own copy of the LoC. This constant expansion of the character set puts an enormous burden on a very large collection of developers. The inevitable happens - they just don't bother - making full Unicode support a pipe dream.
I'm perplexed in what you think "full support" for a novel Unicode codepoint entails.
Most of the properties of a novel codepoint that matter, are attributes that the codepoint has defined within various Unicode-Consortium-defined data files. If your custom implementation of Unicode support already parses those data files, then a new codepoint is essentially "absorbed" automatically as soon as the new version of the Unicode data files are made available to your implementation.
Are you just complaining that new codepoints don't automatically receive glyphs in major fonts?
To that, I would say: who cares? As I just said upthread — character encoding is about semantics, not about rendering. A character doesn't have to be renderable to be useful! The character is still doing its job — "transmitting meaning into the future" — when it's showing up as a U+FFFD replacement character! Screen readers can still read it! Search engines can still index on it! People who want to know what the character was "meant to mean" can copy and paste it into a Unicode database website! It still sorts and pattern-matches correctly! And so forth.
Rendering the codepoint as a glyph — if it even makes sense to render it standalone as a glyph (for which some codepoints don't, e.g. the country-code characters that only have graphical representations in ligatured pairs) — can come at some arbitrary later point, when there's enough non-niche, non-academic, supply-side usage of the codepoint, that there arises real demand to see that codepoint rendered as something. Until then, just documenting the semantics as a codepoint and otherwise leaving it un-rendered, works fine for the needs of the anthropologists and archivists who "legibilize" existing proprietary-encoding-encoded texts through Unicode conversion.
> “multiple encodings for the same code point”, “multiple code points for the same glyph”, and “an incomprehensible document describing all this nonsense”
One important factor for adoption was round trip idempotency. Unicode doesn’t simply encode every glyph but incorporates as many existing character sets as possible. (Of course “possible” is modulated by the fact there are/were competing encodings for the same language).
If you use a unicode-aware grep and use it to search for ‘.’, you want the result to be the same as `cat`, rather than having the result re-coded, objcopy-like.
Sure, this is less of a problem in 2023, but still a major problem that will continue indefinitely due to legacy files, legacy hardware, and small-resource devices. So there is no way in unicode to tell by inspection if ‘A’ is Latin or Greek..which is true also on the printed page.
Likewise the writing of different languages don’t merely have different appearance, but different semantics for the visually same shape. In some languages (e.g. French), diacritics are largely semantically silent: you can understand a sentence with them stripped out, though they can simplify reading in some cases by disambiguating some otherwise homonyms. In other languages (e.g. the German umlaut) they are simply contractions — fossils of medieval handwriting. Yet again, in Swedish they can be optional, like the acute accent or form completely different letters like ä. Capturing this is complicated and even discussing it requires, yes, an almost incomprehensible document.
And it has all been implemented but using those libraries is hard, because language processing is hard. In general I/O is hard.
Im not sure what you mean about “semantic meaning” but I do agree, mostly, with your other complaints.
There's one code point for the letter `a` and another code point for the mathematical symbol `a`, despite being visually identical. I.e. the code points have acquired semantic meaning.
Spy agencies use this so different versions of a document look identical when printed. Thus, they can use this to track down the mole who leaked a sensitive document.
Emoji are “fun” but it’s got to stop. These characters are going to be used until civilization collapses, so we can’t just keep being like “lol, let’s add a food that is only eaten in some regions and has been around for less than 100 years.”
And, let's be honest. Add some of the glaring omissions in emoji. There's no need to use the eggplant emoji between lovers if you can use the real thing. I mean seriously, even the ancient Egyptians knew this: https://unicode-explorer.com/c/130B8 and even more explicit: https://unicode-explorer.com/c/130BA
Of course it will be inappropriate in many situations. Well, guess what, so are many words. Emoji are no longer just fun. They're also a great way for people of different languages to understand each other. No need to make things more difficult to understand and there's also a lot of non-controversial cases for this (doctor my eggplant sparks when the water tap is open is not quite as clear :) )
Of course the level of explicitness can be decided by the font designers, and the big tech companies will no doubt stylise them tastefully. The same way the gun emoji is now a green water pistol.
But I really don't understand leaving this major topic of human interaction out of what is becoming a language. I'm really serious about this. Any service, website or app not wanting to support them can simply block them just like they already do certain words, and it's much easier to do than a filter on words which can be bypassed by misspelling, inserting spaces etc. This is how 'pr0n' became a word. With emoji you can't do this.
I'm a bit tired of the tech sector pushing American prudeness on the whole world. Here in Barcelona you can buy waffles in the shape of penises and vulvas with strategically placed white chocolate and eat them on the street but a tastefully stylised emoji is not ok? Different cultures have different standards and Unicode was supposed to represent everyone.
Not different from the fact that we have useless CC0/CC1 characters somewhere in Unicode. As long as bearable (i.e. not too many), having rare characters is okay.
Emojis are also great from a technical standpoint: because they are desirable to users the unicode consortium can (and has) leverage them to encourage text rendering engines to properly implement features necessary for various languages e.g. astral codepoints, ligatures, grapheme clusters, ...
Japanese telcos eventually implemented a conversion server for emojis from other telcos (before emojis in Unicode were formally accepted). It was just a matter of time.
That's much later than the initial Emoji assignments, and pretty expectable given much broad reach of availability. The earliest UTC decision regarding Emoji is [1] AFAIK.
Emoji was always carrier dependent and Japanese carriers always had conversion/filtering at carrier e-mail MTA, years before 2007. I think you have in mind one of later coordinated attempts at it.
Oh, I assumed the other meaning of PC (not Personal Computer). You are right, Google and Apple had to deal with them because emails went through their pipelines.
We're evolving from phonetic languages back to hieroglyphic ones, because somehow the Unicode committee imagines that 1,000,000 hieroglyphs with distinct meanings are better than 26 phonetic letters.
They hate them. So much that there have been multiple attempts to curb the future Emoji growth, including one based on WikiData QIDs by the president of the Unicode consortium [1]. Emojis persist because the current growth doesn't burden the consortium or vendors too much and no better solution was found.
> 26 phonetic letters
And I'm genuinely offended by this particular phrase. You can't even correctly spell my name in those 26 "phonetic" letters. (My preferred convention is a mere approximation.)
No need to be offended. I've clearly advocated Unicode16, which surely encompasses your spelling. Unicode16 is an achievable and implementable goal. I welcomed it as a great advance.
I do note that you use ASCII for your handle. Could that be because few will be able to figure out how to enter the code point for your name? If your handle includes code point 0x123456, how do you expect users to deal with it?
I only use an ASCII handle because this forum is English only and I don't want to waste my time answering stupid questions. My Korean colleagues can type Korean handles just fine, thank you for the consideration.
But still. I'm fortunate that my Korean name does fit in 2,350 common Hangul syllables, but there are people aren't (who are greatly disadvantaged). And way more Chinese and Japanese people have rare characters in their names. I even have a concrete data for them; the Japanese goverment defines the Koseki unified character set (戸籍統一文字) that complement common character sets and are intended to be used for family registers (戸籍). Therefore it should cover most person and location names ever used in Japan (and probably nothing else). There are 44,770 characters in this set, including 857 characters not yet encoded to Unicode [1]. So the reasonable character set supporting most Japanese proper names already require more than 50,000 distinct characters!
Unicode16 was never ready for long tails. The Unicode committee only realized this after a massive number of character assignments, which is their only regretable mistake in my opinion.
By this measure pretty much every common programming language is unimplementable and should be never advocated. People have implemented Unicode for decades. Only a few implemented "full" support (in some sense), but that's same for pretty much everything complex enough. Can you actually measure one meter [1] according to the International System of Units?
[1] Or one foot. I remind you that all current standardized definition of "foot" is based on the SI.
As others pointed out, no single font is required to support all available Unicode characters. In fact it is just close to impossible anyway, because it is extremely hard to harmoize all typographic conventions across all scripts. Unicode mostly does the next best thing: it gives a uniform framework to handle multilingual texts without knowing specifics. Even though your font can't render "월터 브라이트" programs can normalize, compare, search and segment it just like it does with the text "Walter Bright".
But if you like a pedantry, Noto fonts are close to be complete. They have to be used as a collection due to the OpenType limitation, but otherwise designed in unison. Most remaining characters are rarer CJK unified ideographs [1] but it is more about the priority issue; there are several fonts that cover the entire CJK unified ideographs as well.
Fonts, for example. That is a style, and should be set by the rendering instructions (like CSS) not the code point.
Semantics: there's a code point for the letter 'a' and another code point for the mathematical symbol 'a', although the glyphs are identical. Unicode points should not have semantic meaning - semantic meaning comes from the context in which the points are used.
> Semantics: there's a code point for the letter 'a' and another code point for the mathematical symbol 'a', although the glyphs are identical. Unicode points should not have semantic meaning - semantic meaning comes from the context in which the points are used.
Just the opposite. A codepoint looks different in different fonts, and so things that are "identical" in one font won't be "identical" in another font. Only the semantics are the same.
Printed "A" and cursive "A" are both, semantically, "latin capital A". Those semantics are very concrete and specific: they include its sort order, its conversion re: case-conversion, its joining/splitting behavior, its categoric matching in regexps, etc.
But the codepoint "latin capital A" expresses no preference over whether it should be rendered as a printed/roman A or a cursive A. Codepoints don't define glyphs; they only define semantics! The glyphs associated with codepoints are conventional, following the codepoint's description!
Further, codepoints tell accessibility technologies and other machine-agents what a codepoint "is." Greek letter π is text-to-speech'ed as a letter, and together with other greek letters, as a greek word. Mathematical symbol 𝜋 is text-to-speech'ed as the word "pi."
This is how every character encoding ever has always worked. Text on computers — text over telegraph lines, even! — has always, implicitly, been a set of abstract standardized semiotic codels, with effectively-arbitrary per-OS/per-machine/per-font-family/per-style/per-size(!) graphical renditions.
> Unicode has set itself an absurdly impossible task.
Again: "Unicode" is not the problem here. "Human language" is the "absurdly impossible task", and has been, from ~40,000BCE. Unicode is just a formalist lens through which the scope of the task is made legible.
(And I think you're being a slight bit scope-insensitive here. You should expect a project like "encoding all of human language" to be something that takes hundreds or thousands of years to get right. Unicode isn't a project, it's a megaproject, something that'll still be evolving fifteen human generations from now. But humanity as a whole has built many successful megaprojects! In the only-50-years-old realm of software engineering, we don't tend to rate things of this scope and scale as "possible" or "practical" or "tenable" — but Unicode fundamentally isn't a software engineering project; it's a language engineering project, like the project of the Académie Française, or of the creation of Simplified Chinese. Language engineering is, by definition, a project of indefinite length.)
I have already explained why mathematical alphanumeric symbols exist in the first place [1]. They are an extremely small portion of Unicode; you need way more examples.
By the way, if we indeed decide to encode characters according to glyphs, we would have to distinguish a double-storey `a` from a single-storey `a` (among others). They have clearly different glyphs, so why shouldn't we? Think about that.
But somehow mathematical books have done just fine without any need for such encodings. Meaning in written language is inferred from the context, not the symbol. This is because meanings are infinite in variety, and symbols are finite.
> Think about that
I did. That's what fonts are for. Fonts come in endless variety, and so cannot be encoded into Unicode. They are an orthogonal property of rendered text.
And so is formatting - it's an orthogonal property.
> But somehow mathematical books have done just fine without any need for such encodings.
Printed books have no reason to think about encodings. In fact, they can use one-off metal types (or images nowadays) for unusual characters. In return printed books cannot be easily searched if the entry doesn't appear in prepared indices.
> Fonts come in endless variety, and so cannot be encoded into Unicode.
But some fonts convey semantics. For example a normal text does not distinguish two variants of `a`, but IPA does, so Unicode does encode a single-storey `ɑ` separately and unify a double-storey `a` to the normal lowercase `a`. (p. 299 of the Unicode standard 15.0)
Semantics is arguably the only way to judge those cases. If fonts do not matter but glyphs do matter, how would know a particular glyph should be identically encoded to other glyph or not? I think you agree that a double-storey `a` and a single-storey `a` are same characters with different fonts. Semantics is how we decide the "same characters" part!
It's great that you can tell the mathematical contexts from the written language contexts, but can screen readers?
If I have the equation ax + by + c = 0, ideally a screen reader would read it as "a-x plus b-y plus c equals zero", not "axe plus bye plus ku equals zero". This is not a problem printed books have because printed books do not need to read themselves, and audio books do not have this problem because there's a human in the loop doing the reading. But these are not the only uses that matter.
> Printed books have no reason to think about encodings
Exactly! It's the visual that matters. If you cannot print Unicode text on a sheet of paper and know its meaning, then Unicode has gone into a ditch.
> printed books cannot be easily searched if the entry doesn't appear in prepared indices
Are you going to invent a new encoding for every distinct use of the letter 'a'? That's not remotely possible.
> But some fonts convey semantics
People invent new fonts every day. Hell, I've invented fonts! Your mission to encode them all into Unicode is an instantly hopeless task. Again, semantics are divined from context. I know a mathematical 'i' from the english letter 'i' and the french letter 'i' and the bullet point 'i' and the electric current 'i' by context, and so do you and everyone else. I even use the letter 'A' to represent an Army in my computer game Empire. Nobody has a problem divining the meaning. (and B for Battleship, S for Submarine , etc.)
I expect a good search function for my electronic book!
> Again, semantics are divined from context.
I gave an explicit counterexample. You can't distinguish an IPA `ɑ` (open back unrounded vowel) from an IPA `a` (open front unrounded vowel) with just a context---[hɑt] and [hat] are different pronunciations. So they have to be differently encoded, right? And due to its history, either the "normal" `a` glyph will exactly look like one of them, depending on the font. So we can't have perfectly distinct glyphs for distinct code points anyway. Maybe we need a better rationale, something like semantics...
We have many words that are spelled the same, but sound different based on context, like "sow". That isn't a problem. They don't have to be differently encoded. Besides, are you now planning on adding accents and dialect pronunciations, too?
I'm not talking about homonyms (but I wasn't clear enough about that, sorry). I'm talking about the exact alphabets describing pronunciations. Honestly I can't really differentiate [hɑt] and [hat] sounds, but their IPA renditions are clearly different and have to be differently encoded.
So you have two different pronunciations [hat] and [hat]. Which one is which? Do you really expect to say "[hat] as in General American hot" and "[hat] as in Received Pronunciation hat" every time? Isn't the IPA supposed to be an unambiguous transcription (so to say)?
It’s a fake problem if you just want to write the word “sow” or “hat”, sure, but if you want to write the IPA pronunciation of a word you need the IPA characters in your charset.
Wikipedia has those IPA pronunciations and they’re useful. How else would you do that without including those characters in Unicode? Would you rather have them as images?
If there are 5 encodings for the glyph 'a', how are you going to know which one to search for? And there's no guarantee whatsoever which encoding the author used, or if he used them consistently. Your search function is going to be unreliable at best.
After all, if I was writing algebra, I'd use the ASCII 'a' not the math glyph. So will most everyone else but the most pedantic.
> If there are 5 encodings for the glyph 'a', how are you going to know which one to search for?
There are two possible cases and Unicode covers both of them.
For the case of IPA `ɑ`, it is a distinct character from `a` (which doubles as an IPA) so results containing `ɑ` will be ignored. But this is not far from I want, because `a` and `ɑ` are distinct in IPA and the search correctly distinguishes them. I would have to type `ɑ` to get the other results, but this is an irrelevant matter.
For the case of mathematical alphanumerics, they do normalize to normal counterparts in the compatibility mapping (i.e. NFKC or NFKD). The compatibility mapping disregards some minor differences and would be useful in this case. Of course I would like to narrow the search so that only mathematical alphanumerics match (or the inverse)---this is why Unicode defines multiple levels of comparison [1]. Assuming the default collation table, all mathematical alphanumerics are considered different from normal letters only at the level 3, while IPA `ɑ` is different at the level 1 (but sorts immediately after all `a`s).
> If you cannot print Unicode text on a sheet of paper and know its meaning, then
then you've picked the wrong font for the job. This can happen even with ASCII, e.g. if you print out cryptographic secrets in base64 for use during disaster recovery and then discover you cannot distinguish O and 0 (capital o and zero) nor I and l (capital i and small L). (Hopefully you discover this immediately after printing, instead of after losing all your data.) That doesn't mean those identical glyphs should never have been encoded differently, it means font designers need to make them more distinct.
It's not true if you count current languages of East Asia, such as Chinese. Or does the document address that omission somehow?
EDIT: Skimming it, I don't see how the issue is addressed, except in a diagram that says,
* Modern-Use Ideographs: 54x256 = 13,824
* Added Ideographs: 64x256 = 16,384
Also it talks about deduplicating identical characters in Chinese, the Japanese alphabets, and Korean (Hangul).
More interesting is the full explanation of modern use:
Distinction of "moderm-use" characters: Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2^14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.
It's very interesting, and revealing about the author, to define use as newspaper and magazine publishing.
Unicode 88 (wrongly) assumed that Han characters in current use are small enough (see p. 3 for specifics). This turned out to be massively incorrect because they have an extremely long tail. Same goes for Hangul.
EDIT:
> Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.
I do believe that the initial estimation was not very off. IICore 2.2 [1] contains 9,810 Han characters across all relevant countries and even assuming that every character should be encoded in two ways (to account for simplified-traditional splits) and IICore had some significant omissions we could have just encoded ~40,000 characters. If everything went well according to their plan.
The reality however was that PUA for rare characters was an old concept---pretty much every East Asian character set has one---and proven so ineffective. PUA is by definition not interchangable, but rare characters are still in use and they have to be interchanged anyway. So people avoided them like a plague and argued for bigger character sets instad, East Asian character sets had to be ballooned up, and at the point of Unicode they became so large that a naive vision of Unicode 88 no longer worked.
Adding all of the dead languages bloated the system very quickly. Shame UCS2 turned out to be such a mistake. Beyond being not just big enough, trading off space for compute power turned out to be the wrong trade, CPUs increased much more rapidly than memory sizes and data especially transmission rates.
24- or 32-bit fixed length would be probably fine…
The issue is doing a unified codepoint right run into problems of accepting that zh-CN/zh-HK/zh-TW and ja-JP/kr-KR each uses slightly different definitions for every letters in common use, which is not just memory inefficient but also have to do with Taiwanese identity problem.
Much easier to keep it to just zh-CN and zh-TW(recently renamed zh-Hans and zh-Hant) with zh-CN completely double defined with ja-JP, and kr-KR dealt as footnotes on ja-JP, which is what they did.
> The issue is doing a unified codepoint right run into problems of accepting that zh-CN/zh-HK/zh-TW and ja-JP/kr-KR each uses slightly different definitions for every letters in common use, which is not just memory inefficient but also have to do with Taiwanese identity problem.
I'm not sure what exactly you wanted to say, but all of six or seven variants of Han characters are substantially different from each other. Glyphwise there are three major clusters due to the different simplification history (of the lack thereof): PRC, Japan and everyone else. (Yes, Korean Hanja is much similar to traditional Chinese.) Even when Taiwan didn't exist at all there would have been three major clusters anyway.
Also pedantry corner: ko-KR, not kr-KR. And zh-TW wasn't renamed to zh-Hant; it is valid to say zh-CN-Hant if PRC somehow wants to use traditional Chinese characters.
What I was trying to say is, to fix Unicode, we must accept that whatever many of those variants don't interchange, and therefore all require its own planes, which I believe had never been the Consortium's stance on the matter. And I think that should solve about half of problems with Unicode by volume.
Yeah, everyone seem to agree that the Han unification was a mistake. But I also think its impact was also exaggerated. The latest IVD contains 12,192 additional variants [1] for the original CJK unified ideograph block (currently 20,992 characters). They indeed represent most variants of common interest, so if the Han unification didn't happen and we assigned them in advance, we could have actually fit everything into the BMP! [2] Of course this also meant that a further assignment is almost impossible.
[2] Unicode 2.0 had 38,885 assigned characters (almost in the BMP) and ~6,200 unallocable characters otherwise (mostly because they were private use characters since 1.0). 38,885 + 6,200 + 12,192 = 57,277 < 2^16. A careful block reassignment would have been required though.
IVS is going to work out if everyone switches to always use it by default, but it's currently only used for special cases where the desired letter shape diverges from language default(the "bridge-shaped high" cases) rather than the-default and I don't see a perfect IVS world happening.
I mean, it's clearly tied to languages, so I think only viable options are either we split languages into separate tables into those reserved-for-unknown-reasons planes, or add a font selector instruction like ":flag_tw: :right_arrow: 今天是晴天 :left_arrow:", OR, if there's going to be a WWIII and one or more of Zh-Hans or ja-JP becomes obsolete, it's going to be not an important issue.
I was thinking more of Han Unification, which we now realize was a bad idea, and to this day makes it difficult if not impossible to correctly render a block of text that contains characters in more than one CJK language.
"A book written in Unicodes is called a 'Unibook', and the totality of materials available in Unicodes is known as the 'Universe'"—in another timeline...
It's a disappointment that we never used different conceptual "types" for text files with different encodings. You can sort of get there with MIME types and encoding data attached, but imagine if you had a filename convention like '.txt.ASCII' or '.txt.UTF8LE', or '.txt.ISO8859-1' so software would know exactly what to expect rather than trying to sniff byte-order marks or hueristics for guessing encoding and mojibake fixes.
I think if someone had tried to make encoding-in-filename a common convention it would have just resulted in people writing “.txt.ASCII” without understanding what that meant, and we’d have the same charset sniffing we have now plus another mess on top. (For example: when you have a “config.ini.UTF-16” and a “config.ini.WINDOWS-1252”, which should your program prefer?)
---
An anecdote: I was writing some code to send data to another company as an XML file. When I sent them a test file to validate, I got a complaint that it didn’t declare its encoding as UTF-8.
“Yes, our database uses UCS-2 for exports,” I replied. “Why, can you not read it?”
“Our system imported it just fine,“ they explained, “but it needs to say ‘<?xml version="1.0" encoding="UTF-8"?>’ at the top and when I looked at it with a text editor yours says something else there instead.”
“Yes, it doesn’t say UTF-8 because it’s not UTF-8. Sounds like your system is OK with that, though?”
“Yes, the rest of the file is fine, but we can’t move forward with go-live until you conform to the example in our implementation guide...”
Actually changing the encoding proved difficult (for reasons I no longer recall), so I ended up shoving in a string replacement to make the file claim the wrong encoding, along with an apologetic comment; their end apparently did sniffing (and the BOM was unambiguous) so this worked out fine, but it goes to show just how confused people can get about what a character encoding a file actually is.
I don't think having the extensions as a part of the file name was a good idea in the first place; I see it more as a practical compromise.
HTTP does not care about extensions in URLs at all, and you can serve any content type (plus information about its encoding) independently of the contents. Early file systems did not have capabilities this advanced, and through decades of backward compatibility we're kind of stuck with .jpg (or is it .jpeg? or .jpe?), .html (or .htm?), or .doc (is this MS Word? or RTF?).
> Early file systems did not have capabilities this advanced, and through decades of backward compatibility we're kind of stuck with
Some did. Some mainframes did. And Apple pre-OSX (HFS I think was the file system) did too.
The problem isn’t just backwards compatibility with DOS. It’s that most file transfer protocols don’t forward that kind of meta information. They don’t even share whether it’s executable, a system file or hidden (those that do have an attribute for hidden files). So the file type ended up needing to be encoded in the file name if just to preserve it.
HFS (and HFS+, and APFS) indeed has "resource forks", and supporting (or just not tripping over) them on non-Apple filesystems comes with a bag of surprises, even on Apple platforms. Just last week we've run into a problem with a script that globbed for "*.mp4" files to feed them into ffmpeg, but ran into and choked on the "._*.mp4"'s (which is how macOS stores the resource forks on FAT32/exFAT). The script was tested and worked fine on every Mac I tested it on, and everything was smooth until the customer plugged in an external HD to process videos off of. https://en.wikipedia.org/wiki/Resource_fork
So unfortunately the most reliable way of knowing exactly what you're working with, is lots of heuristics, and very defensive coding.
Resource forks are something different; the parent is referring to type and creator codes, which are part of the metadata for a file. (Type codes were also used to identify types inside the resource fork, which is maybe what you’re thinking of?)
Japanese, since Unicode is set up to display Japanese wrong (CJK unification - theoretically you can configure your application to display Chinese wrong instead, but no-one does that).
But I read up on it and it sounds like Unicode is widely used in Japan and the issue is mainly academic when it comes to old texts. It also seems to have been mitigated with selectors:
> Since the Unihan standard encodes "abstract characters", not "glyphs", the graphical artifacts produced by Unicode have been considered temporary technical hurdles, and at most, cosmetic. However, again, particularly in Japan, due in part to the way in which Chinese characters were incorporated into Japanese writing systems historically, the inability to specify a particular variant was considered a significant obstacle to the use of Unicode in scholarly work. For example, the unification of "grass" (explained above), means that a historical text cannot be encoded so as to preserve its peculiar orthography. Instead, for example, the scholar would be required to locate the desired glyph in a specific typeface in order to convey the text as written, defeating the purpose of a unified character set. Unicode has responded to these needs by assigning variation selectors so that authors can select grapheme variations of particular ideographs (or even other characters).
> the issue is mainly academic when it comes to old texts.
If it was just an academic concern, there could have been other solutions. It is more about how much people care about the exact glyph in display, and Japaneses were particularly pedantic in my experience (even more so than Chineses).
The Han unification itself was a well-intentioned but ultimately misguided effort to fit the massive Han characters into 16 bits. Another evidence that a 16-bit character set was so naive.
It’s not an academic or legacy text issue at all, systems and apps that deal with Japanese texts were usually configured to do Chinese wrong and vice versa, which worked until computer OS images became universal.
The Chinese Hanzi codespace and Japanese Kanji codespace are effectively reused for each other’s sets. That’s what it is.
Maybe when I have some time we'll have to go back to a practical and sane 16bit Unicode, at least for programming. Not for archeology.
Only the common scripts, proper identifiers (without the bugs unfixed for decades), forbidden mark combinations for which a single symbol exists. Of course without emoji, but also without all the silly supplementary planes.
Unicode arose as the result of eight years of working experience with XCCS. Its fundamental diferences rom XCCS were proposed oy Peter Fenwick and Dave Opstad (pure 16-bit codes), and by Lee Collins (ideographic character unification), Unicode retains the many features of XCCS whose utility have been proved over the years in an international line of communicating multilingual system products.