Hacker News new | past | comments | ask | show | jobs | submit login

Can't imagine it being difficult for Korean.



Korean is especially difficult. Chinese uses only hanzi and you have a limited set of logogram combinations that result in a word. Japanese is easier because they often add kana to the ends of words (for e.g. verb conjugation) and you can use that (in addition to the Chinese algorithm) to delineate words - there is no word* where a kanji character follows a kana character. Korean on the other hand uses only phonetic characters with no semantic component, so you just have to guess the way a human guesses.

* with a few exceptions, of course :)


Your comment is a mix of a few right things and lot of wrongs, so I’ll give more information so the casual reader doesn’t form an incorrect opinion based on it.

- Korean segmentation is way easier than Chinese and Japanese because it uses spaces between words (they are thus distinctive, not like Vietnamese which use it on syllable boundaries. Vietnamese consequently also requires segmentation)

- Chinese and Japanese segmentation are hard NLP problems that are not fixed, so they are in no way "easier" than the same take for other languages

- The limited valid combination of characters that form words in Chinese doesn’t mean segmentation is easy because there is still ambiguity in how sentence can be split. There is still no tool that produce "perfect" result

- difference in scripts is indeed used in some segmentation algorithms for Japanese, but that doesn’t solve the issue totally

- the phonetic/non-phonetic parts of Chinese characters, have been used in at least one researcher paper (too lazy to find the reference again, it didn’t worked well anyway) but are not in state of art method. So contemporary Korean not using a lot of Hanja anymore has no influence on the difficulty of segmenting it


Curious where you came up with "phonetic characters with no semantic component". It's just an alphabet with spaces. With each block representing a syllable. It's easier than Latin.


Yes, Hangul is one of the two scripts known to me designed with.proper logic and regularity, which makes it easy to use. (The other is Tengwar.)


Korean uses spaces.


Old korean lacked spaces. Modern korean uses spaces


I don't speak Korean, but my understanding is that some applications don't just use spaces because of how they're used in loan words. The Korean for "travel bag" has a space in it but you might want it as one token, for example. There's a fork of MeCab that has some different cost calculations related to whitespace for use with Korean.


That's interesting, but it would be nice to see the actual example. Not sure why you would use the loan word for travel bag, or what Hangul you're talking about exactly.


Sorry, "travel bag" is just an example I remember someone mentioning before. You can see the Mecab fork with example output here, but the docs are all in Korean so I can't really follow it.

https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: