> Printed books have no reason to think about encodings Exactly! It's the visual...

lifthrasiir · on Sept 2, 2023

> It's the visual that matters.

I expect a good search function for my electronic book!

> Again, semantics are divined from context.

I gave an explicit counterexample. You can't distinguish an IPA `ɑ` (open back unrounded vowel) from an IPA `a` (open front unrounded vowel) with just a context---[hɑt] and [hat] are different pronunciations. So they have to be differently encoded, right? And due to its history, either the "normal" `a` glyph will exactly look like one of them, depending on the font. So we can't have perfectly distinct glyphs for distinct code points anyway. Maybe we need a better rationale, something like semantics...

WalterBright · on Sept 2, 2023

We have many words that are spelled the same, but sound different based on context, like "sow". That isn't a problem. They don't have to be differently encoded. Besides, are you now planning on adding accents and dialect pronunciations, too?

lifthrasiir · on Sept 2, 2023

I'm not talking about homonyms (but I wasn't clear enough about that, sorry). I'm talking about the exact alphabets describing pronunciations. Honestly I can't really differentiate [hɑt] and [hat] sounds, but their IPA renditions are clearly different and have to be differently encoded.

WalterBright · on Sept 2, 2023

> have to be differently encoded.

They don't. It's a fake problem.

lifthrasiir · on Sept 2, 2023

So you have two different pronunciations [hat] and [hat]. Which one is which? Do you really expect to say "[hat] as in General American hot" and "[hat] as in Received Pronunciation hat" every time? Isn't the IPA supposed to be an unambiguous transcription (so to say)?

iainmerrick · on Sept 2, 2023

It’s a fake problem if you just want to write the word “sow” or “hat”, sure, but if you want to write the IPA pronunciation of a word you need the IPA characters in your charset.

Wikipedia has those IPA pronunciations and they’re useful. How else would you do that without including those characters in Unicode? Would you rather have them as images?

WalterBright · on Sept 2, 2023

> good search function

If there are 5 encodings for the glyph 'a', how are you going to know which one to search for? And there's no guarantee whatsoever which encoding the author used, or if he used them consistently. Your search function is going to be unreliable at best.

After all, if I was writing algebra, I'd use the ASCII 'a' not the math glyph. So will most everyone else but the most pedantic.

lifthrasiir · on Sept 2, 2023

> If there are 5 encodings for the glyph 'a', how are you going to know which one to search for?

There are two possible cases and Unicode covers both of them.

For the case of IPA `ɑ`, it is a distinct character from `a` (which doubles as an IPA) so results containing `ɑ` will be ignored. But this is not far from I want, because `a` and `ɑ` are distinct in IPA and the search correctly distinguishes them. I would have to type `ɑ` to get the other results, but this is an irrelevant matter.

For the case of mathematical alphanumerics, they do normalize to normal counterparts in the compatibility mapping (i.e. NFKC or NFKD). The compatibility mapping disregards some minor differences and would be useful in this case. Of course I would like to narrow the search so that only mathematical alphanumerics match (or the inverse)---this is why Unicode defines multiple levels of comparison [1]. Assuming the default collation table, all mathematical alphanumerics are considered different from normal letters only at the level 3, while IPA `ɑ` is different at the level 1 (but sorts immediately after all `a`s).

[1] https://unicode.org/reports/tr10/#Multi_Level_Comparison

> After all, if I was writing algebra, I'd use the ASCII 'a' not the math glyph.

What I would type in TeX doesn't exactly relate to the actual encoding.

yorwba · on Sept 2, 2023

> If you cannot print Unicode text on a sheet of paper and know its meaning, then

then you've picked the wrong font for the job. This can happen even with ASCII, e.g. if you print out cryptographic secrets in base64 for use during disaster recovery and then discover you cannot distinguish O and 0 (capital o and zero) nor I and l (capital i and small L). (Hopefully you discover this immediately after printing, instead of after losing all your data.) That doesn't mean those identical glyphs should never have been encoded differently, it means font designers need to make them more distinct.