Hacker News new | past | comments | ask | show | jobs | submit login

I have already explained why mathematical alphanumeric symbols exist in the first place [1]. They are an extremely small portion of Unicode; you need way more examples.

By the way, if we indeed decide to encode characters according to glyphs, we would have to distinguish a double-storey `a` from a single-storey `a` (among others). They have clearly different glyphs, so why shouldn't we? Think about that.

[1] https://news.ycombinator.com/item?id=28294032




> I have already explained

But somehow mathematical books have done just fine without any need for such encodings. Meaning in written language is inferred from the context, not the symbol. This is because meanings are infinite in variety, and symbols are finite.

> Think about that

I did. That's what fonts are for. Fonts come in endless variety, and so cannot be encoded into Unicode. They are an orthogonal property of rendered text.

And so is formatting - it's an orthogonal property.


> But somehow mathematical books have done just fine without any need for such encodings.

Printed books have no reason to think about encodings. In fact, they can use one-off metal types (or images nowadays) for unusual characters. In return printed books cannot be easily searched if the entry doesn't appear in prepared indices.

> Fonts come in endless variety, and so cannot be encoded into Unicode.

But some fonts convey semantics. For example a normal text does not distinguish two variants of `a`, but IPA does, so Unicode does encode a single-storey `ɑ` separately and unify a double-storey `a` to the normal lowercase `a`. (p. 299 of the Unicode standard 15.0)

Semantics is arguably the only way to judge those cases. If fonts do not matter but glyphs do matter, how would know a particular glyph should be identically encoded to other glyph or not? I think you agree that a double-storey `a` and a single-storey `a` are same characters with different fonts. Semantics is how we decide the "same characters" part!


It's great that you can tell the mathematical contexts from the written language contexts, but can screen readers?

If I have the equation ax + by + c = 0, ideally a screen reader would read it as "a-x plus b-y plus c equals zero", not "axe plus bye plus ku equals zero". This is not a problem printed books have because printed books do not need to read themselves, and audio books do not have this problem because there's a human in the loop doing the reading. But these are not the only uses that matter.


> Printed books have no reason to think about encodings

Exactly! It's the visual that matters. If you cannot print Unicode text on a sheet of paper and know its meaning, then Unicode has gone into a ditch.

> printed books cannot be easily searched if the entry doesn't appear in prepared indices

Are you going to invent a new encoding for every distinct use of the letter 'a'? That's not remotely possible.

> But some fonts convey semantics

People invent new fonts every day. Hell, I've invented fonts! Your mission to encode them all into Unicode is an instantly hopeless task. Again, semantics are divined from context. I know a mathematical 'i' from the english letter 'i' and the french letter 'i' and the bullet point 'i' and the electric current 'i' by context, and so do you and everyone else. I even use the letter 'A' to represent an Army in my computer game Empire. Nobody has a problem divining the meaning. (and B for Battleship, S for Submarine , etc.)


> It's the visual that matters.

I expect a good search function for my electronic book!

> Again, semantics are divined from context.

I gave an explicit counterexample. You can't distinguish an IPA `ɑ` (open back unrounded vowel) from an IPA `a` (open front unrounded vowel) with just a context---[hɑt] and [hat] are different pronunciations. So they have to be differently encoded, right? And due to its history, either the "normal" `a` glyph will exactly look like one of them, depending on the font. So we can't have perfectly distinct glyphs for distinct code points anyway. Maybe we need a better rationale, something like semantics...


We have many words that are spelled the same, but sound different based on context, like "sow". That isn't a problem. They don't have to be differently encoded. Besides, are you now planning on adding accents and dialect pronunciations, too?


I'm not talking about homonyms (but I wasn't clear enough about that, sorry). I'm talking about the exact alphabets describing pronunciations. Honestly I can't really differentiate [hɑt] and [hat] sounds, but their IPA renditions are clearly different and have to be differently encoded.


> have to be differently encoded.

They don't. It's a fake problem.


So you have two different pronunciations [hat] and [hat]. Which one is which? Do you really expect to say "[hat] as in General American hot" and "[hat] as in Received Pronunciation hat" every time? Isn't the IPA supposed to be an unambiguous transcription (so to say)?


It’s a fake problem if you just want to write the word “sow” or “hat”, sure, but if you want to write the IPA pronunciation of a word you need the IPA characters in your charset.

Wikipedia has those IPA pronunciations and they’re useful. How else would you do that without including those characters in Unicode? Would you rather have them as images?


> good search function

If there are 5 encodings for the glyph 'a', how are you going to know which one to search for? And there's no guarantee whatsoever which encoding the author used, or if he used them consistently. Your search function is going to be unreliable at best.

After all, if I was writing algebra, I'd use the ASCII 'a' not the math glyph. So will most everyone else but the most pedantic.


> If there are 5 encodings for the glyph 'a', how are you going to know which one to search for?

There are two possible cases and Unicode covers both of them.

For the case of IPA `ɑ`, it is a distinct character from `a` (which doubles as an IPA) so results containing `ɑ` will be ignored. But this is not far from I want, because `a` and `ɑ` are distinct in IPA and the search correctly distinguishes them. I would have to type `ɑ` to get the other results, but this is an irrelevant matter.

For the case of mathematical alphanumerics, they do normalize to normal counterparts in the compatibility mapping (i.e. NFKC or NFKD). The compatibility mapping disregards some minor differences and would be useful in this case. Of course I would like to narrow the search so that only mathematical alphanumerics match (or the inverse)---this is why Unicode defines multiple levels of comparison [1]. Assuming the default collation table, all mathematical alphanumerics are considered different from normal letters only at the level 3, while IPA `ɑ` is different at the level 1 (but sorts immediately after all `a`s).

[1] https://unicode.org/reports/tr10/#Multi_Level_Comparison

> After all, if I was writing algebra, I'd use the ASCII 'a' not the math glyph.

What I would type in TeX doesn't exactly relate to the actual encoding.


> If you cannot print Unicode text on a sheet of paper and know its meaning, then

then you've picked the wrong font for the job. This can happen even with ASCII, e.g. if you print out cryptographic secrets in base64 for use during disaster recovery and then discover you cannot distinguish O and 0 (capital o and zero) nor I and l (capital i and small L). (Hopefully you discover this immediately after printing, instead of after losing all your data.) That doesn't mean those identical glyphs should never have been encoded differently, it means font designers need to make them more distinct.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: