Hacker News new | past | comments | ask | show | jobs | submit login
Two Dots Too Many (2008) (upenn.edu)
86 points by vilhelm_s on July 17, 2015 | hide | past | favorite | 38 comments



Atatürk gave Turkey a lot of great things. The alphabet isn't one of them. The dotted/dotless "i" in Turkish is just plain broken and there's no way to support it without 100% confidence in the text's language (which is impossible in the real world because multi-language users are a thing).

For anyone not familiar with the problem, the two variants of the letter "i" in Turkish are considered two separate letters but neither the dotless nor the dotted "i" is compatible with the non-Turkish "i" in lower and upper case:

* the lower case of upper case "I" is the dotless "ı" (not "i" as everywhere else)

* the upper case of lower case "i" is the dotted "İ" (not "I" as everywhere else)

This is basically a bug in the original specification (ca. 19th century) that was never fixed and now causes all kinds of problems for anyone trying to support it (because languages no longer live in isolation).


Language is what it is, and the "specification" was invented long before anyone needed precise algorithmic rules. IMHO, the real WTF is that Unicode decided to reuse the ASCII codepoints for "i" and "I" to encode the Turkish alphabet.

If you think about it, it doesn't make sense to use "LATIN SMALL LETTER I" as the lowercase form of "LATIN CAPITAL LETTER I WITH DOT ABOVE". Four new codepoints for Turkish ı/I and i/İ would have solved so many problems.


They didn't have a lot of choice, ISO/IEC 8859-3 had already established the wrong behavior by reusing i/I, and only adding ı and İ. It's all been doomed from that point on.


Exactly. These are two completely different letters that exist in Turkish, neither of them should be represented by a letter in English which does not exist in Turkish.


So... the Unicode standard was right to have separate codepoints for "greek capital letter omega" and "ohm symbol", and wrong to share the same point between "latin small letter s" and "seconds symbol"?


I would argue that it was because, because there is no such thing, to my knowledge, as capital and lowercase seconds.

In this way the only dangers from accidental capitalization would be capitonyms* and possibly someone thinking you were shouting depending on the level of your capital letter abuse. <sarcasm> In the latter case I think we can all agree they deserve to be stabbed, right? </sarcasm>

*https://en.wikipedia.org/wiki/Capitonym


But Unicode isn't in the business of cataloguing all the possible meanings in the world. They're a catalog of glyphs.

Looking into it further, I see that indeed "Cyrillic Capital Letter A", "Greek Capital Letter Alpha", and "Latin Capital Letter A" are different, despite not actually being different in any way. That supports the idea that "Latin small letter i" and "Turkish small letter i with dot" should be different points. But it also supports the idea that "Latin Capital Letter A", "English Capital Letter A", and "French Capital Letter A" should all be different points, which they're not. This approach is incoherent.

In Chinese writing, the same syllable shì is encoded differently depending on whether it means "be", "circumstance", "the world", "city", "style", "respectable person", "Shì, the surname", "room", "test", "look at", "indicate", "suitable", "knowledge", etc. And those examples I've listed are just the shì's that I personally know -- there are plenty more. This was, by any standard at all, a terrible, terrible decision (to be fair, Japanese writing is even worse). Why would you want Unicode to be more like that?


The reason Latin Capital Letter A is distinct from Cyrillic Capital Letter A and Greek Capital Letter Alpha but not English Capital Letter A or French Capital Letter A is that the latter two are actually the same thing whereas the first two are not.

For example, it might not be acceptable for a Cyrillic reader to find the Cyrillic A to be stylized the same way as a Latin A or Greek Alpha in multi-lingual text. Imagine a cursive handwriting font for example: the Latin letter "A" should likely appear in the same style as the other Latin letters whereas the Cyrillic letter "A" should likely appear in the same style as the other Cyrillic letters (but French and English -- e.g. in citations or quotes -- shouldn't be distinguished that way, except for punctuation).

The reason the s symbol for seconds isn't distinguished is that the SI explicitly defined the symbols as regular letters. Even the widely used cursive "l" (for litre/liter) is merely a legacy symbol (as is the capital "L", which is mostly used to avoid misreading it as the digit "1"). So while, for example, the sequence "km²" has a specific meaning of "square kilometers", it's intended to be represented as <k><m><superscript 2> not <SI prefix kilo><SI unit meter><superscript 2> (just like the English word "I" is represented as <Capital Latin Letter I>, not <English First Person Pronoun>).

The problem with Chinese writing is that each code point corresponds to a different glyph and there are many homophonic words (with different meanings and corresponding glyphs). Han Unification and CJK in general are complex subjects of their own, however -- even the experts seem to disagree on how CJK glyphs should be treated in detail (and it's not as easy as just saying Unicode is broken because it's written by white American men, because that's not at all true).


> The reason Latin Capital Letter A is distinct from Cyrillic Capital Letter A and Greek Capital Letter Alpha but not English Capital Letter A or French Capital Letter A is that the latter two are actually the same thing whereas the first two are not.

You say this, but you give no argument at all. What makes English A and French A the same thing in a way that English A and Cyrillic A aren't? From a Unicode perspective (at least, from the perspective that Unicode takes with respect to Chinese characters -- 青 and 靑 are different code points despite being the same character), "Cyrillic Capital Letter A" and "Cyrillic Handwritten Capital Letter A" would be different code points, since they look nothing alike (disclaimer: I don't know that in the particular case of A the printed form and the handwritten form look nothing alike, but I do know that in general the handwritten form of a cyrillic character may not resemble the printed form. Feel free to substitute some other letter where I say A).

> but French and English -- e.g. in citations or quotes -- shouldn't be distinguished that way, except for punctuation

Why not?

I have seen advocacy for using the distinction that Unicode already makes (!) between a fancily curved apostrophe that is part of the spelling of a word, and the same glyph, but used to end a quotation. This is stupid. You'd need to train everyone in the world (well, the relevant part of the world) to input the same thing differently depending on how they're using it. Even if you made the effort, it will never happen reliably -- people have enough trouble spelling words differently depending on how those words are used.

Finally, my original point about Chinese writing didn't refer to Unicode at all. Spoken Chinese is encoded into characters using a system that makes heavy reference to semantics. This is a terrible idea, and as a result Chinese speakers are severely retarded in the development of their ability to read. The claim I'm making is that mission-creeping Unicode to encode semantics rather than glyphs (much as written Chinese encodes semantics more than it encodes sound) is similarly a bad idea. The goal of Unicode is to represent writing, not meaning.


> You say this, but you give no argument at all. What makes English A and French A the same thing in a way that English A and Cyrillic A aren't?

"Latin" is used as the name of a writing system, which is used by both English and French. "Cyrillic" is also the name of a writing system, used by several languages. In some cases, there is a one-to-one relationship between named writing systems and languages, but not in these cases.

The distinction made in Unicode is between writing systems, not languages.

(Note that some valid renderings of a letter in one writing system may be similar to a letter in a different writing system, but the range of acceptable renderings is different between writing systems. This is generally not the case between languages using the same writing system.)


English and French do not use the same writing system. Here are some letters in the French alphabet that don't occur in the English alphabet (or the Latin alphabet!): ô é ç. (It is true that these are described in French as o, e, and c with diacritics. You can make this more rigorous by observing a language like German, or until recently Spanish, which promotes such marks as ö or "ch" to full letter status. Unless you're arguing that Spanish used to use a unique Spanish writing system, but recently switched to the more common Latin one?) If French and English use the same system, there is no argument that Turkish uses a different one.

You could argue that French A and English A are the same because they are identical by descent, but then you'd have to admit that Greek A and Cyrillic A are more of the same.

So I guess the question is: how are you defining "writing system"?


I mean, to be fair, Atatürk had no way to know that over a century later, people would be using things called computers to perform such strange operations as converting between lowercase and uppercase text.


Definitely not a bug in the alphabet.

It's a bit of a shame that Unicode doesn't cover the issue automatically, though. It could have custom code points for the pairs that convert case correctly without having to write special cases for handling the Turkish language.

https://en.wikipedia.org/wiki/Dotted_and_dotless_I


Yeah, the part of the Unicode standard that talks about case-mapping and how you might want to do it differently in Turkish has a strong undertone of "of course you aren't really going to do this".

There's almost a solution you could hack in if you could convince people and input methods to stop using plain-ASCII "i" and "I" in Turkish (which is quite a hurdle). You can stick a combining dot on the "i", which doesn't change its appearance, but now it uppercases correctly. But there's no "combining lack of dot" you can put on "I" to make it lowercase correctly.


The obvious parallel are modifiers in other language.

For example, umlauts in German can be considered as the regular letters A, O and U with diacritics[0]. In the same way, if the Turkish dotless upper case I was encoded as a separate letter from Capital Latin Letter I, Turkish ı could receive a "dot above" modifier to become i and Turkish I could receive the same modifier to become İ.

[0]: I'm German, so I'm aware that it's a bit more complex than that, but although we have separate "letters" for Ä, Ö and Ü, they're really just AE, OE and UE ligatures, respectively. You could argue that they're not because replacing them with the non-ligature version changes meaning but thanks to modern technology (and historical domain name restrictions) most Germans can deal with that kind of ambiguity just fine. Besides, the diaeresis in German umlauts is actually a vestigial lower case "e" (as can be seen in old manuscripts).


>* the lower case of upper case "I" is the dotless "ı" (not "i" as everywhere else)

>* the upper case of lower case "i" is the dotted "İ" (not "I" as everywhere else)

It does make sense though.


Actually, it's asymmetric in that the upper case of lower case "j" is the dotless "J".


Does it, though? Would you also say it would make more sense for the lowercase B to have two enclosed spaces instead of just one?


Would it make more sense to have just one glyph for a letter instead of two? Of course. We're not getting anything out of making the distinction.


Sure, but only if you can convince every other alphabet to either replace their lower case "i" with the dotless "ı" or their upper case "I" with the dotted "İ".


>The dotted/dotless "i" in Turkish is just plain broken

A Turkish person looking at the English alphabet can make exactly the same argument in the other direction - plus, why does English only have five graphemes for vowels?


Arguing in the other direction isn't as persuasive, because Turkish and English aren't the only two languages in existence. French, Dutch, Spanish, Italian, German, and many others use the same letter i and have more vowel sounds than vowel characters. That's nobody's fault in particular, though. Unlike Turkish, these languages did not have their writing systems dictated from on high.


For example, Dutch has the letter IJ.


... which actually has a ligature of its own, making it more similar to German umlauts (give or take a few hundred years of degrading handwriting shorthands).


That sounds awfully like reinventing grue/bleen.

https://en.wikipedia.org/wiki/New_riddle_of_induction


In principle I would agree, but the Atatürk alphabet was made up at the spot based on the existing Latin alphabets.


OK. But there are many, many aspects of our own language which don't really make sense. And that would go for pretty much all of them except I guess Esperanto.


Honestly though, what killed these people was a stupidity - surely if someone says something this nasty where there's an obvious ambiguity in the language, stabbing them over it without asking first is just ridiculous behaviour.


Completely agree. My language (spanish) has also accents (tildes) and tildes (virgulilla) and two dots (diéreses). But when texting, people don't always write them, so when an ambiguity arises you notice. They were also clearly having a fight through the phone, so even more reason for not having a completely correct text composed.

More to the point, they ended up dead because they were already quite upset and were sufficiently violent people to reach that point, so also not localization's fault!


Yeah, blaming this one on a weird character in text is a stretch. I'm not sure "honor killings" rank on the list of why double-checking localized fonts on devices is important.

The article does acknowledge you shouldn't kill someone until you investigate a possible misunderstanding...


Sure. And the bug should also be fixed.

Pretty much everyone here is changing the subject, so to speak, which I find telling in itself.


Telling of what, exactly?


Solid advise in this article: "Another is that it is best not to kill people who make you angry until you have carefully investigated the situation, if then."


> The Turkish newspaper Hürriyet reports a tragic consequence of the failure to localize cell phones.

Which assumes that murder is an appropriate response when someone calls someone else a whore...


Autocorrect kills


In this case (as reported), the message was correct, but the recipient's phone wasn't capable of displaying Turkish letters.


> Killing people kills

FTFY.


Wow, what a disaster.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: