Hacker News new | past | comments | ask | show | jobs | submit login

A few minor things missing in OP (mostly CJK related)

- Code point is sometimes not enough to determine the glyph. For example, U+5199 must look different in Simplified Chinese and Japanese. Typically this is handled by using different fonts, but more formally it should be marked with different lang attributes (in case of HTML).

- Top-to-bottom writing mode is still pretty much in use in Japanese. HTML support is poor, but it's common in PDF. Caveat: Latin alphabets are rotated by 90 degree (as explained in OP), but for punctuations, simply rotating the glyph isn't enough because the center line is going to be slightly off. You need special glyphs for rotated punctuation.

- Unlike Latin scripts, most Chinese and Japanese characters are free to break line at any point, but there are exceptions: contracted letters (e.g. ちょっ) cannot be split. So you'll end up with treating each letter as a separate word. (Contracted letters are treated as one word.)




Elaboration for "ちょっ"

Japanese hiragana chi, small yo, and small tsu.

Hiragana is a syllabary (not alphabet!) comprised of 46 sounds. Small hiragana can be used to modify the sound of the leading character to expand the set of available phonemes. Here, "Chi (small)yo" becomes "cho".

Small tsu is special. It represents a doubled or germinated consonant.

For example ちょっと (chotto) sounds like chot-to. Example:

https://m.youtube.com/watch?v=kh9Kk_RHn8Q

Japanese is a really fun language. If you've ever had an interest in learning it, I recommend it. Reading kanji is a joy in and of itself, because it has a pleasant inner logic that gets deeper the more you learn.


Japanese is one of the few languages with a worse writing system than English. Hats off to them for it. I do miss the pre-war spellings though, where they wrote things based on 10c pronunciation, eg AU is pronounced OH (like in French).

Tibetan is also in pretty desperate need of spelling reform. I’d love to hear replies about other terrible writing systems.


In the scheme of writing systems it's hard for me to see why English is so bad. In general writing (like spoken grammar) is so heavily used that all the "rough corners" get sanded smooth pretty rapidly and continuously. For example, though the histogram of Kanji/Hanzi has a very steep drop-off, it's not that hard for people to recognize and keep track of, because characters that were hard to read at first tend to get restructured or dropped.

As far as Indo-European languages go, languages like English and French are quite conservative in spelling and underlying meaning, which means they do diverge from pronunciation but retain semantic clues which help the reader unlock the meaning of unknown words. While languages that continuously reform to try to match pronunciation shifts (e.g. German) discard this info. In addition they have to pick one canonical pronunciation, and themselves typically retain various homophonic issues (e.g. how many ways can you write the sound of ä (I don't mean "ae")).

Even as a kid (should be "easy to learn", right?) I struggled far more with the variants of the devanagari alphabet even though it was so phonetic. But hundreds of millions of people use it every day without thinking about it.


English is pretty bad unless you are a total orthography history nerd.

For example, people routinely mispronounce the name of the President of China. (It’s closer to Shi than Zi.) Why? Because English has a strong preference for retaining native spellings even when the other language has a completely incompatible set of pronunciation rules. With Chinese this is extra silly because pinyin is only one of a million ways of romanizing Chinese and basically the least English compatible method.

English spelling is like a Hofstader puzzle: to be able to master it, you must master the spellings of all languages plus English itself recursively going back to the Great Vowel Shift. It’s not a good system.


Pinyin is requested by the Chinese government. They used to use Wade-Giles but changed iirc in the 70s and requested that others do so too.

It can’t really be called part of English in any meaningful way anyway.


We don't call the country "Zhongguo" in English, we call it "China". So why do we let them tell us to call it "Beijing" instead of "Peking"? It's craziness. We gave up a perfectly good name for "Canton" and replaced it with an unreadable mess that isn't even in Cantonese ("Guangzhou").

If people who speak Chinese want to use pinyin, more power to them. But English speakers need to stop just copying other people's romanizations even when the romanizations do not connect to our spelling system at all.


> For example, people routinely mispronounce the name of the President of China. (It’s closer to Shi than Zi.) Why? Because English has a strong preference for retaining native spellings even when the other language has a completely incompatible set of pronunciation rules.

This is objecting to the .01%--I find it hard to take seriously.

English has the really good feature that there is a phonic correspondence between spelling and pronunciation. It's not always one-to-one, and it's sometimes really weird, but it's generally there.

So, if someone sees Xi Jinping in writing, the utterance out of their mouth will be close enough for me to know what they are talking about.

However, even when someone completely botches the spelling of a word, there is almost always enough logic behind the error for the person looking at it figure out the actual word meant. That's a really nice feature in a language. (For example--I have seen phlegm or subtle spelled in all manner of ways, but I generally could tell what was meant.).

By contrast, lots of the articles talking about Chinese highlight all the really common words that native speakers can't even cough up the kanji for because there is so little phonic correspondence.


> English has the really good feature that there is a phonic correspondence between spelling and pronunciation

It's absolutely not the case and it's actually one of the most mocked feature of English on the internet. The pronunciation of the name of “Sean Bean” is a good example. As a non native speaker, I have had a lot of pain learning the pronunciation of words I've only seen written.

In American English, you have at least one direction easy (from pronunciation to spelling) but not in British English. And going from spelling to pronunciation is really hard in both dialects.


> And going from spelling to pronunciation is really hard in both dialects.

But you can try. And it will kinda work.

If you pronounce "Sean Bean" as "seen bean" (most likely), "shawn bawn", or "say-an bay-an" (that's gonna require some thought for a native ...) people will scrunch their brow, think a bit, maybe chuckle, and have an idea who you are talking about--especially if they've been around you for a couple days.

English is remarkably error tolerant. You may not get it right, but you will get your point across.

With kanji you can't even try. This isn't about whether English is easy. It's about that fact that the simple act of going from pronunciation to kanji or kanji to pronunciation simply isn't possible at all.

(And choosing a formal name as an example is just asking for exceptions. Japanese, for example, has a whole class of kanji used basically for nothing except names--good luck pronouncing those ...)


> With kanji you can't even try.

I don't want to defend kanji/hanzi as a writing system. It's also in debt to history in a crazy way. But as a non-native reader of Japanese, you can often figure out the pronunciation and meaning of unfamiliar characters based on how they look. A large majority, maybe 90%, of characters have a meaning-part and sound-part, and once you know the common roots, you can get pretty far by just looking at things. You can also move to Korean and Cantonese pretty easily because those preserved the pronunciations of old Chinese pretty well. Ironically, Mandarin did a pretty bad job of preserving old Chinese, so it's harder to match up.

E.g. 楽 (music) is old Chinese nguk, Japanese gaku, Cantonese ngok, Korean ak, and Mandarin yue(??).


> English is remarkably error tolerant

I don't know what's your primary language, but as a French person I can tell you that no, English isn't that error tolerant and people don't understand what you mean if your pronunciation isn't good enough.

I still have traumatic flashback of my younger self desperately trying to buy water (pronouncing “wa” as in waffles and “ter” as in territory) in Canada when I was 15 (and really thirsty)… I think I spent a whole 5 minutes before a French Canadian arrived and saved my day.


Did you try aqua? Most educated English speakers should have known that word. :-(


It's actually Sean. Pronounced shawn. Be thankful we didn't get more from Gaelic, where "Laoghaire" is pronounced leery. :)

Don't agree US English makes it easier, it's just differently idiosyncratic.

Extreme example - ghoti: Pronounced fish. The gh from tough, the o from women, the ti from nation. Works in American English. :p

If I hear in the US "cull err" and go to spell, I don't end up at color, hearing "sell fone" doesn't lead me to cell phone, etc.

Half the problems of English spelling are the three (four in US English, with Webster) significant attempts to simplify spelling over the centuries, and outsourcing some temporarily illegal printing to the Dutch.


> Extreme example - ghoti: Pronounced fish. The gh from tough, the o from women, the ti from nation. Works in American English. :p

http://zompist.com/spell.html

Initial 'gh' can never be pronounced 'f', final 'ti' can never be pronounced 'sh', and 'o' is only pronounced 'i' in one word.


> It's actually Sean.

Indeed. Fixed, thanks

> Don't agree US English makes it easier, it's just differently idiosyncratic.

I found it easier when learning, but YMMV.


I'll defer then, as you're the non-native speaker who learnt the hard way. :)

Just a lot I might have thought American English would take the opportunity to clean up, remain. Like all those silent h's from having printed the first English bibles in Holland. Ghost, aghast etc. There used to be a lot more - ghospel for one!


> you're the non-native speaker who learnt the hard way

Well, even native speakers need to learn it the hard way as a child! As far as I'm concerned, I really struggled with French spelling when I was younger ;).

> Like all those silent h's from having printed the first English bibles in Holland. Ghost, aghast etc.

That's a really cool story. Thanks!


Any time this general topic comes up, I like to drop the Chaos Poem as an example, for anyone who hasn't yet encountered it: http://ncf.idallen.com/english.html


Tibetan and French usually come up on lists of worst correspondence between orthography and pronunciation. Irish is also pretty bad. (It makes sense once you learn how it works, then stops making sense once you hear people of different dialects actually pronounce words.)


French is terrible in one direction: pronunciation to spelling is a nightmare, as most sounds can be written in a least 3 ways. «un» «um», «in», «im», «ain», «aim» and «ein» are pronounced the same, so if you hear this sound, good luck figuring out how to spell it.

But the other way around isn't really bad: a spelling almost always have an unambiguous pronunciation. (There are exceptions but they are few and far between)

At least if you don't count ancient French, which is still prominent in places name's, and have a totally arbitrary pronunciation. As a little game for French speakers, try pronouncing the name of the city of «Meung sur Loire» ;).


geminated not germinated haha it's not a seed :p


Now you’re just arguing seminantics.


> but more formally it should be marked with different lang attributes (in case of HTML).

Isn't the that point of UNIcode?

To unify all text into a single character set so that can exist side-by-side without messing with code pages?

Isn't this just code pages all over again?


"Han unification" (deciding that Chinese and Japanese characters were 'basically the same' and could be represented with the same set of codepoints) was a terrible idea, yes, mainly brought from the Unicode foundation not being run by people who spoke or wrote either language, and not wanting to "waste" their limited code points in the BMP on languages using an inconvenient number of characters.


Top-to-bottom writing is not just prevalent in Japanese but also in Chinese too. Pick up any professionally published book in Taiwan and chances are it's written top-to-bottom.


And Mongolian, Korean, Uyghur...

If you add support for archaic systems you have to handle vertical bottom to top and boustrophedon too.


Uh, vertical writing in Korea was famously abolished decades ago when the Chosun Ilbo, a prominent conservative newspaper notorious for its love with now almost dead hanja uses and vertical writing, finally gave up with that in March 2, 1999.


It’s true — actually I thought it had been abandoned in the 70s but I figured that was recent enough not to be “archaic” like some of the obscure bottom-to-top scripts.


Also Arabic mixed with English goes RTL for the Arabic parts and LTR for the English parts: https://docs.oracle.com/javase/8/javafx/user-interface-tutor... (bottom)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: