How does Chrome decide what to highlight when you double-click Japanese text?

JonathonW · on May 8, 2020

ICU (International Components for Unicode) provides an API for this: http://userguide.icu-project.org/boundaryanalysis

Assuming Blink is using the same technique for text selection as V8 is for the public Intl.v8BreakIterator method, that's how Chrome's handling this-- Intl.v8BreakIterator is a pretty thin wrapper around the ICU BreakIterator implementation: https://chromium.googlesource.com/v8/v8/+/refs/heads/master/...

chch · on May 8, 2020

Doing a bit more deep diving into the ICU code, it looks like the source code for the Break engine (used by Chinese, Japanese, and Korean) is here: https://github.com/unicode-org/icu/blob/778d0a6d1d46faa724ea...

and then according to the LICENSE file[1], the dictionary :

   #  The word list in cjdict.txt are generated by combining three word lists
   # listed below with further processing for compound word breaking. The
   # frequency is generated with an iterative training against Google web
   # corpora.
   #
   #  * Libtabe (Chinese)
   #    - https://sourceforge.net/project/?group_id=1519
   #    - Its license terms and conditions are shown below.
   #
   #  * IPADIC (Japanese)
   #    - http://chasen.aist-nara.ac.jp/chasen/distribution.html
   #    - Its license terms and conditions are shown below.
   #

It's interesting to see some of the other techniques used in that engine, such as a special function to figure out the weights of potential katakana word splits.

[1] https://github.com/unicode-org/icu/blob/6417a3b720d8ae3643f7...

erjiang · on May 8, 2020

According to your first link, BreakIterator uses a dictionary for several languages, including Japanese. So I guess the full answer is something like:

Chrome uses v8's Intl.v8BreakIterator which uses icu::BreakIterator, which, for Japanese text, uses a big long list of Japanese words to try to figure out what is a word and what isn't. I've worked on a similar segmenter for Chinese and yeah, quality isn't great but it works in enough cases to be useful.

peterburkimsher · on May 9, 2020

What did you use for your Chinese word segmentation? I wrote Pingtype myself to help me learn by splitting words and translating them in parallel.

https://pingtype.github.io

TheChaplain · on May 8, 2020

For firefox there's a bug discussing the pros and cons of using ICU for boundaries, so at least they are aware of the issue.

https://bugzilla.mozilla.org/show_bug.cgi?id=820261

wwarner · on May 8, 2020

was looking for that last link myself -- thanks!

trnglina · on May 8, 2020

Firefox, in contrast, breaks at script boundaries, so it'll select runs of Hiragana, Katakana, and Kanji. Not nearly as useful, and definitely makes copying Japanese text especially annoying.

knolax · on May 8, 2020

Personally I find double click highlighting to be a useless feature in any language, but the Firefox approach is superior imo. Breaking at script boundaries is predictable behavior the user can anticipate whereas doing some janky ad hoc natural language processing invariably results in behavior that is essentially random from a user perspective. I tried out double click highlighting on some basic Japanese sentences on Chromium and it failed to highlight any of what would be considered words.

It's not like English highlighting does complex grammatical analysis to make sure "Project Manager" gets highlighted as one chunk and "eventUpdate" gets highlighted as two chunks, most implementations just breaks at spaces like the user expects.

trnglina · on May 9, 2020

I use double-click highlighting, and the reason is mostly selecting passages of text when editing. Double-click highlighting makes it so I don't have to find the precise character boundaries for the first and last words in the passage. Instead, I can just double click the first word, roughly move my mouse to hover over the last, and copy or delete that entire passage.

Firefox's approach is fairly useless in this regard. Even if it's predictable from a technical perspective, it's not predictable for a reader who naturally processes semantic breaks rather than technical ones. Unlike in English, where a space is both semantic and visual, hiragana-kanji boundaries often don't mean anything. As a result, for me at least, Firefox's breaks feel a lot more random than Chrome's, which, while dodgy, are often fine.

Having used Firefox as my main browser since 2006, I remember when I discovered this feature in Chrome, and being shocked at how much of an effect that minor improvement had for me. It's not a deal-breaker, certainly, but it's become my one big annoyance with Firefox.

fomine3 · on May 9, 2020

I'm Japanese and I agree that Firefox behavior makes sense. For example, a text with kanji like "その機能拡張は、", A word "機能拡張" consists of "機能" + "拡張" words. In Chrome, double-click makes individual part (like 機能) selected which is rarely wanted behavior for me. In Firefox, the whole word (機能拡張) is selected, which is wanted mostly.

kohtatsu · on May 9, 2020

It's pretty painless to select the next word though. It's best for the computer to err on the side of caution in this case.

[Edit: Ah, it might not be common knowledge but you can drag on the double click to select more words. You can also hold alt while editing text to jump words, and even combine alt+shift and alt+delete. Windows and Apple differ on the boundaries wrt whitespace, but I'm fairly sure they're both alt. Wordjumps while editing is an unappreciated superpower imo.]

Also FWIW Safari also supports this.

https://files.catbox.moe/c3bzva.mov

fomine3 · on May 9, 2020

I know drag after double click action but I don't use double click to select action for Japanese text and always select text by dragging because I prefer predictable operation.

I don't know Alt + Shift/Delete action, how is it useful?

frosted-flakes · on May 9, 2020

Word-jumping behaviour with the arrow keys on Windows uses Ctrl, not Alt. Holding Shift selects individual characters, while Ctrl+Shift selects words.

taneq · on May 9, 2020

I think all of this just highlights (hah) that the way we think of human-language strings needs to change. They're not a stream of characters, they're their own thing with complex semantic rules. They should be represented as an opaque blob only manipulated via an API or passed to an OS for rendering etc.

Machine-readable strings can still just be an ASCII byte array, but we need to keep the two separate.

microcolonel · on May 9, 2020

I feel like this is a conclusion you could only reach by having an irrational compulsion to defend the deficiencies of Firefox, and not being a regular user of the Japanese language.

knolax · on May 9, 2020

I don't even use Firefox. You know what's really irrational? Trying to find ulterior motive in an argument instead of addressing its substance.

Wowfunhappy · on May 9, 2020

I don't use/speak Japanese, but the GP's conclusion makes complete sense to me.

Consistent behavior > inconsistent behavior, almost always. If I can anticipate what my computer is going to do, I can plan around it, even if it means a bit of extra manual work.

BigJono · on May 9, 2020

That doesn't work if the consistent behaviour is completely useless. If double click selected English tokens grouped by which half of the alphabet they came from, nobody would ever double click anything, they'd just click-drag highlight.

Wowfunhappy · on May 9, 2020

Are spaces completely meaningless in Japanese? I was under the impression they separated phrases.

microcolonel · on May 9, 2020

Spaces have zero importance to the Japanese language itself, but they are occasionally used like punctuation. e.g. some YouTube video titles will use spaces around names of things that could be hard to parse as not part of a sentence, another example is when you're typesetting phrases in lyrics.

In general, whitespace characters have no place or significance inside a Japanese sentence, and most of the whitespace in Japanese typesetting is built into punctuation marks.

Furthermore, even most punctuation is optional in Japanese. The full stop 。 and comma 、 are mostly a matter of preference, sometimes spaces are used in place of full stops or commas.

jack1243star · on May 9, 2020

I actually like this behavior more, since it is predictable. Sometimes it just works for occasional looking up proper nouns, and you can already tell if it won't.

trnglina · on May 9, 2020

I think it depends on how engrained the double-click highlight is for you. For me, I double-click by default, since I almost always want to select at a word boundary in English. As a result, when I need to select Chinese or Japanese text, I'm always annoyed when my double click (which, in my mind, should always select a word) selects a nonsensical sub-sentence instead, and I have to then re-select it manually.

naniwaduni · on May 9, 2020

When you double-click, say, the word "double-click", do you expect to highlight either "double" or "click", depending on which part of the word I happened to click on? Or do you expect to highlight the entire compound? Does your expectation change when the phrase is written with spacing: "double click"? When written without spacing: "doubleclick"? Does your perception of the word boundaries change between these cases?

Now consider another anglophone arbitrarily; how likely does it seem to you that they would share the exact same set of expectations of the above with you? Because if they don't, then one of you is going to have their expectations violated.

Now give it a try: do the highlights match your expectations? (Moreover: are those expectations consistent with where you placed the word boundaries?)

If your expectations of what double-click highlight does for English is consistent with what actually happens when you double-click, it is deeply unlikely that the causality goes word boundary -> highlight expectation; if they do, it's more likely that double-click highlight has influenced your sense of word boundaries.

But it's even more likely that you your expectations of double-click highlight don't actually match up with your intuition for word boundaries, and you just have a pretty good model of how double-click highlight behaves independently of where you think the word boundaries are.

(The notion of word boundaries in this context is fraught anyway. Even setting aside the problem of compound words, the phonological word boundaries of Japanese don't match up with the lexicographical ones, in both directions.)

trnglina · on May 9, 2020

I agree, there's definitely flexibility in what I expect, and what other people will likely expect. Here's the thing though: when I double-click "double-click", I could reasonably expect "double" "click" or "double-click" would be selected. I wouldn't expect it to select everything between the last and next punctuation characters (as an imperfect analogy).

For me at least, double-clicking is like fuzzy string matching. It's an imprecise behaviour by nature, but should be designed to be useful. Firefox's behaviour in the case of Japanese, but especially Chinese, often isn't useful, not because it doesn't perfectly reflect linguistic expectations, but because it usually selects too much--meaning you can't correct your selection by moving the cursor back and forth, and instead, have to stop and select using a different method.

Take the following passage:

もうこんなことは終わってほしいと願うばかりだ

Different people may break apart もうこんなことは differently, but it's hard to argue that it's one word. Chrome (and Word) separate it like this: もうこんなことは, and whether or not you agree that each one of those parts are words, simply splitting もうこんなことは into parts semi-logically makes it smoother to edit (for me, at least! I understand most people might not be as dependent on double-click selection).

The same behaviour for Chinese is so useless that it's not even worth mentioning, since you'll always select either a large part of a sentence, or an entire sentence at once. I mean, I love the word 匹配的近似度用如下方法来度量.

naniwaduni · on May 13, 2020

Yeah, that behaviour seems spectacularly useless for Chinese.

zeroimpl · on May 8, 2020

It also prevents the dictionary lookup gesture of macOS from working in Firefox, since it selects the whole sentence and looks that up (which fails).

Wowfunhappy · on May 9, 2020

Side note, this already doesn't work if you don't have a Touchpad / Magic mouse. The normal workflow is Right Click ==> Look Up, but Firefox overrides macOS's normal right click menu. :(

Firfox is just not a very good macOS citizen, sadly.

spacehunt · on May 9, 2020

You can still select the text and choose Look Up from the Services menu.

Wowfunhappy · on May 9, 2020

Oh, there is that, but it's too far away to be useful.

If I could set up a custom keyboard shortcut for the service, it would work great—but support for custom keyboard shortcuts also doesn't work in macOS Firefox[1]! (Unless you set a combination which Firefox doesn't already use for something, and it already uses everything sensible, so you have to set something long and annoying.)

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1333781

polm23 · on May 9, 2020

OP here, surprised to see this took off.

I actually work with Japanese tokenization a lot - I took over maintenance of the most popular Python Mecab wrapper last year, and I have another Cython-based wrapper that I maintain.

Word boundary awareness for Japanese is a pretty uncommon feature in applications, so I was surprised to see the feature had been in Chrome all along, even if it's buried and the quality has issues.

Anyway, thanks to everyone who tracked down the icu imlementation and the relevant part of Chrome!

LikeAnElephant · on May 8, 2020

This is often determine by Unicode and not the browsers specifically (though some browsers could override the suggested Unicode approach).

Each unicode character has certain properties, one of which is whether that character indicates a break before / after itself.

I've done extensive research on this for my job, but unfortunately don't have time to do the whole writeup here. Here are several resources for those who are interested

Info on break opportunities:

https://unicode.org/reports/tr14/#BreakOpportunities

The entire Unicode Character Database (~80MB XML file last I checked)

https://unicode.org/reports/tr44/

The properties within the UCD are hard to parse, here's a reference if you're interested:

https://unicode.org/reports/tr14/#Table1

https://www.unicode.org/Public/5.2.0/ucd/PropertyAliases.txt

https://www.unicode.org/Public/5.2.0/ucd/PropertyValueAliase...

Overall, word / line breaking in Unicode in no-space languages is a very difficult problem. Where the UCD says there can be a line break isn't where a native speaker would put one. In order to do it correctly you have to bring in Natural Language Processing, but that has its own set of complexities.

In summary: I18N is hard!

erjiang · on May 8, 2020

The library that Chrome uses seems to use a dictionary[0], since you can't determine word boundaries in Japanese just by looking at two characters.

Your first link also says:

> To handle certain situations, some line breaking implementations use techniques that cannot be expressed within the framework of the Unicode Line Breaking Algorithm. Examples include using dictionaries of words for languages that do not use spaces

[0] posted in another top-level comment: http://userguide.icu-project.org/boundaryanalysis

LikeAnElephant · on May 8, 2020

Yea, CJK (Chinese, Japanese, Korean) breaking is particularly complex. Google has done a lot of work, and have this open source implementation which uses NLP. It's the best I've personally come across:

https://github.com/google/budou

jcampbell1 · on May 8, 2020

I wrote a Chinese Segmenter that is available on the web:

https://chinese.yabla.com/chinese-english-pinyin-dictionary....

It does basic path finding, and then picks the best path based on the following rules:

1) Fewest words

2) Least variance in word length (e.g. prefer a 2,2 character split vs a 3-1 split)

3) Solo Freedom (this is based on corpus analysis which tags characters with a probability of being a 1 character word. For example 王家庭 (this is either "Wang Household" (王家庭) or "Prince's courtyard" (王家庭) and we split as Wang Household, because Wang 王 is a common name that frequently appears in isolation, and 庭 is less likely to be in isolation. It is interesting that solo freedom works better than comparing the corpus frequency of "Prince" 王家 vs "Household" 家庭.

It works reasonably well. A surprising number of people use it every day.

thaumasiotes · on May 8, 2020

> the corpus frequency of "Prince" 王家

What? 王家 doesn't mean "prince". Or at least, there is no such dictionary entry in the ABC dictionary or in the 汉语大词典. I would expect 王家 to mean "prince's household", in the same way that 皇家 means "imperial household".

汉语大词典 has two glosses for 王家:

1. 犹王室，王朝，朝廷。 [Equivalent to "royal family"/"royal court".]

2. 王侯之家。 [An aristocratic household.]

There's no problem with the concept of the phrase 王家庭 meaning "prince's courtyard", since a home can easily contain a courtyard. But the phrase should arguably be segmented 王-家-庭. (Or not -- there's very little to distinguish the idea 'one word, "the prince's household"' from 'two words, "prince"/"household"'.) Regardless of that choice, the courtyard is being associated with a household, not a person.

numpad0 · on May 9, 2020

So to illustrate the situation the input would be like “Royal, House, Hold” and whether it’s supposed to be “Royal-Household” or “Mr. Royal’ household” or “Royal House, hold ... ” is up to context right

thaumasiotes · on May 9, 2020

Correct, though just to be clear, "hold" doesn't correspond to anything in the Chinese. (I only mention this because on a character-by-character basis, "royal" and "house" are fairly decent glosses of 王 and 家.)

"King" is also a decently well-known English surname, so we can draw a pretty close analogy to the distinction between 王家, the royal household, and 王家, the 王 family. Compare the English sentences

1. That's the king's house. [The king lives there]

2. That's the Kings' house. [The Kings live there]

jcampbell1 · on May 9, 2020

My Chinese ability rounds down to non-existant, and I apologize for constructing a questionable example. Wang as a family name, and Wang as king or (royal-) was the first thing that came to mind. Since you know Chinese well, do you know of a good example of a 1-2, vs 2-1 split where both are reasonable but one is much more likely to be semantically correct?

I have been told there are quite long sentences that humorously can be segmented two ways and have completely different meanings, but I can't find any. My favorite in English is expertsexchange.com had to change the domain to experts-exchange.com :)

thaumasiotes · on May 9, 2020

> I apologize for constructing a questionable example

No need; as an example of segmentation, the one you've presented is fine. The biggest problem, mostly irrelevant, is that "prince's courtyard" is pretty archaic. I was objecting to the translation-in-passing of 王家 as "prince". It's defensible to segment the "prince's courtyard" sense 1-1-1, but it's also defensible to segment it 2-1.

I don't have an example of the type of you're looking for to hand, but I will see about finding one.

peterburkimsher · on May 9, 2020

Thanks for sharing! I also wrote a Chinese segmentation and parallel translation program, with colours for the tones. Mine only takes one definition for the literal translation though, unlike yours which only handles a single line but with more translations.

https://pingtype.github.io

krackers · on May 8, 2020

I don't think their own lexing backend is actually open-source, as Budou just relies on a choice of 3 backends (MeCab, TinySegmenter, Google NLP) to do the lexing. I'm assuming Google NLP performs the best, but that isn't free and certainly not open source.

Wowfunhappy · on May 8, 2020

Does this mean that highlighting doesn't work properly when you're offline?

krackers · on May 8, 2020

I think the implementation used in Chrome is different from the ones used in Budou. The implementation in Chrome is dictionary based as one of the parent threads mentioned, and that is completely open-source albeit probably doesn't produce as good a result as their homegrown NLP stuff.

oehtXRwMkIs · on May 8, 2020

Can't imagine it being difficult for Korean.

Asooka · on May 8, 2020

Korean is especially difficult. Chinese uses only hanzi and you have a limited set of logogram combinations that result in a word. Japanese is easier because they often add kana to the ends of words (for e.g. verb conjugation) and you can use that (in addition to the Chinese algorithm) to delineate words - there is no word* where a kanji character follows a kana character. Korean on the other hand uses only phonetic characters with no semantic component, so you just have to guess the way a human guesses.

* with a few exceptions, of course :)

tasogare · on May 9, 2020

Your comment is a mix of a few right things and lot of wrongs, so I’ll give more information so the casual reader doesn’t form an incorrect opinion based on it.

- Korean segmentation is way easier than Chinese and Japanese because it uses spaces between words (they are thus distinctive, not like Vietnamese which use it on syllable boundaries. Vietnamese consequently also requires segmentation)

- Chinese and Japanese segmentation are hard NLP problems that are not fixed, so they are in no way "easier" than the same take for other languages

- The limited valid combination of characters that form words in Chinese doesn’t mean segmentation is easy because there is still ambiguity in how sentence can be split. There is still no tool that produce "perfect" result

- difference in scripts is indeed used in some segmentation algorithms for Japanese, but that doesn’t solve the issue totally

- the phonetic/non-phonetic parts of Chinese characters, have been used in at least one researcher paper (too lazy to find the reference again, it didn’t worked well anyway) but are not in state of art method. So contemporary Korean not using a lot of Hanja anymore has no influence on the difficulty of segmenting it

oehtXRwMkIs · on May 8, 2020

Curious where you came up with "phonetic characters with no semantic component". It's just an alphabet with spaces. With each block representing a syllable. It's easier than Latin.

nine_k · on May 9, 2020

Yes, Hangul is one of the two scripts known to me designed with.proper logic and regularity, which makes it easy to use. (The other is Tengwar.)

qiqitori · on May 8, 2020

Korean uses spaces.

NikolaeVarius · on May 9, 2020

Old korean lacked spaces. Modern korean uses spaces

polm23 · on May 9, 2020

I don't speak Korean, but my understanding is that some applications don't just use spaces because of how they're used in loan words. The Korean for "travel bag" has a space in it but you might want it as one token, for example. There's a fork of MeCab that has some different cost calculations related to whitespace for use with Korean.

oehtXRwMkIs · on May 9, 2020

That's interesting, but it would be nice to see the actual example. Not sure why you would use the loan word for travel bag, or what Hangul you're talking about exactly.

polm23 · on May 9, 2020

Sorry, "travel bag" is just an example I remember someone mentioning before. You can see the Mecab fork with example output here, but the docs are all in Korean so I can't really follow it.

https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/

swang · on May 8, 2020

Yes. This seems to work even when you pass it Chinese while maintaining ja-JP as the language

  function tokenizeJA(text) {
    var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
    it.adoptText(text)
    var words = []
  
    var cur = 0, prev = 0
  
    while (cur < text.length) {
      prev = cur
      cur = it.next()
      words.push(text.substring(prev, cur))
    }
  
    return words
  
  }

  console.log(tokenizeJA("今天要去哪裡?"))

still seems to parse just fine. so most likely just using the passed input to parse.

yorwba · on May 8, 2020

The underlying library is actually using a single dictionary for both Chinese and Japanese https://github.com/unicode-org/icu/tree/7814980f51bca2000a96...

LikeAnElephant · on May 8, 2020

Yep, the browser has the UCD info built into it (a simplification... but basically). Similarly our mobile devices and various backend languages have the same data backed into it.

This is where there are sometimes discrepancies between how a given browser or device would output this data, as it could be working off of an outdated version of Unicode's data.

Some devices even overwrite the default Unicode behavior. There are just SO many languages and SO many regions and SO many combinations thereof that even Unicode can't cover all the bases. It's all very fascinating from an engineering perspective.

erjiang · on May 8, 2020

It turns out that's because ICU uses a combined Chinese/Japanese dictionary instead of separate dictionaries for each language. Which probably is a little more robust if you misdetect some Chinese text as Japanese and vice-versa.

darkerside · on May 8, 2020

> The quality is not amazing but I'm surprised this is supported at all.

I find this line hilarious for some reason. Reminds me of the line about being a tourist in France, "French people don't expect you to speak French, but they appreciate it when you try"

curiousgal · on May 8, 2020

Being a resident of France I have to say that French people are the opposite of that. Same for Japanese people. The only people who get genuinely excited as you butcher their language are Arabic speakers I've noticed.

toast0 · on May 9, 2020

As a tourist in Paris about 10 years ago, I felt I had better interactions when I butchered the language and then switched to english after the person I was speaking with did, than when I just started off in english.

I only knew a handful of phrases though, so anything off script and I was pretty lost.

mattigames · on May 8, 2020

I hate that, how are they supposed to improve their language level if not by trying to speak it.

darkerside · on May 8, 2020

Yeah, I think the intent of the phrase is to convey the condescending attitude you might encounter, but expressed it in a subtle way. Maybe too subtle!

superasn · on May 9, 2020

Same for Indian people. Everyone gets super excited when foreigners speak even one single sentence in Hindi.

whoisjuan · on May 8, 2020

Was OP referring to the fact that V8 namespace is available inside JSFiddle?

Because I was a bit surprised about that and made me wonder if opening this JSFiddle on Safari would work at all (I’m on a phone so I can’t test).

darkerside · on May 8, 2020

That's funny. Not at all the way I read it, but it could totally be read that way. Made me do a double take.

Still, I think my original reading is correct, because I don't think there is any issue with the "quality" of v8 inside of jsfiddle. While imagining Chrome doing its best to identify real words in long strings of Japanese text and failing spectacularly just made me laugh again.

polm23 · on May 9, 2020

OP here, I was not referring to it working JSFiddle, just the idea of Chrome shipping with a Japanese tokenizer. I've known people to do stuff like compile MeCab with emscripten to get Japanese tokenization in Electron, so the fact that the functionality was there already (even with lower quality) was surprising.

yftsui · on May 8, 2020

The fiddle doesn't work on Safari.

TypeError: Intl.v8BreakIterator is not a function. (In 'Intl.v8BreakIterator(['ja-JP'], {type:'word'})', 'Intl.v8BreakIterator' is undefined)

dlivingston · on May 8, 2020

Here’s a brief write-up [0] on techniques and software for Japanese tokenization.

[0]: http://www.solutions.asia/2016/10/japanese-tokenization.html...

simplicio · on May 9, 2020

I've been doing some work parsing Vietnamese text, which has the opposite problem. Compound words (which is most of the vocabulary) are broken up into their components by spaces, indistinquishable from the boundaries between words.

enos · on May 9, 2020

Is that why the name of the country is sometimes spelled with a space, "Viet nam"?

freddie_mercury · on May 9, 2020

Yes, that's how it is written in Vietnamese. To oversimplify: Vietnamese words are a collection of single syllables that are always separated by a space when writing.

"Viet Nam" is also, actually, the "official" English way to write it. (Check how the UN puts it on all their stuff.) However, most Europeans don't do that in their languages, so it usually gets written as Vietnam even by Vietnamese when they're writing European languages.

oh_sigh · on May 8, 2020

Given this property of Japanese text, is there wordplay associated with a string of characters with double/reverse meanings depending on how the characters are combined?

polm23 · on May 9, 2020

Not exactly double / reverse meanings, but there are several setences that are traditionally used to test Japanese tokenizers. Most of them are tongue twisters, like this one:

すももももももももの内

"Japanese plums (sumomo) and peaches are both kinds of peaches"

In speech intonation would make the word boundaries here clear, but in writing it looks odd.

This isn't a tongue twister, but tokenizers often fail on it:

外国人参政権は難しい問題だ

"Voting rights for foreigners is a complex problem."

外国人 (foreigner) / 参政権 (gov participation rights) is the right tokenization, but a common error is to parse it as 外国 (foreign) / 人参 (carrot) / 政権 (political power).

needle0 · on May 9, 2020

Yes. One that comes to mind is "[name]さんじゅうななさい" which can be either interpreted as "[name]-san, 17 years old" or "[name], 37 years old", depending on whether you interpret the さん(san) as an honorific or part of a number. (The sentence would usually be written in a combination of hiragana and kanji, but is intentionally written in all hiragana here to ensure ambiguity.)

needle0 · on May 9, 2020

Another one: "この先生きのこるには", which should be broken up at "この先/生きのこるには" to mean "To keep surviving going forward", but since 先生 (teacher) is such a common word, it jumps at your eyes and turns the sentence into "この先生/きのこるには" which means the nonsense "For this teacher to mushroom(verb)". Usually this doesn't happen because the "survive" part is written with more kanji as 生き残る, but here it is written in hiragana to make the きのこ(mushroom) part visible and further mess with lexing.

In both cases, some liberties have been taken with notation to intentionally encourage silly mis-readings; It happens much less often in ordinary text.

AlchemistCamp · on May 9, 2020

Word segmentation in Japanese isn't that difficult, for the most part. The mixed scripts help quite a bit.

The really interesting question is how does Chrome decide what to highlight when you double-click Thai text? It's a non-breaking, (baroquely) phonetic script and the training set is much smaller.

peter303 · on May 8, 2020

I wonder if this applicable to Chinese. In Chinese one to four characters comprise a word. In post-revolutionary Chinese all the characters in a sentence are run together and you have to mentally parse the words in your mind. (Its worse in pre-revolutionary Chinese where neither sentences nor paragraphs are punctuated.) (Pre-medieval european languages used to run all their letters together without word or sentence breaks. Torah Hebrew still does this.)

peterburkimsher · on May 9, 2020

Yes, I wrote https://pingtype.github.io for Chinese word segmentation to help me learn.

1024core · on May 8, 2020

So what is the most accurate way to tokenize CJK text?

edflsafoiewq · on May 8, 2020

By hand using a native speaker.

tobyhinloopen · on May 8, 2020

Is there an API for that?

cferr · on May 8, 2020

Yes, but it hasn't been written yet.

edjroot · on May 9, 2020

You joke, but...

https://blog.ycombinator.com/scale/

syrrim · on May 8, 2020

you could try something with mturk, I'm sure.

1024core · on May 9, 2020

Surely someone must've come up with rules for it by now?

earthboundkid · on May 9, 2020

Are there rules (as opposed to heuristics) for splitting up English, like “words—separated haphazardly—by em–dashes?”

wikibob · on May 8, 2020

Safari seems to do this too.

dirtydroog · on May 8, 2020

TIL: Some languages do not have spaces

SenHeng · on May 9, 2020

Fun fact, Japanese originally didn’t have punctuation. It was a European introduction.

https://www.tofugu.com/japanese/japanese-punctuation/

emilfihlman · on May 8, 2020

From a very quick cursory look, Japanese seems to have some sort of commas and periods. Shouldn't it be just simple to adopt spaces, (and proper commas and periods and parentheses, which they actually use already) too?

4bpp · on May 8, 2020

Who would be both interested in and able to champion such a change, though? Spaced Japanese text would look at least as jarring to someone who is used to its present state as English text following the German capitalization pattern (all Nouns to be capitalised in all Contexts) would to an English speaker, and I doubt the present anglophone world could be persuaded to adopt the latter change even if it turned out to convey immeasurable advantages to computational processing of text.

corey_moncure · on May 8, 2020

Sort of but then no one can agree on things like whether particles are part of the word or not. To the fluent reader, the kanji/kana switching breaks up the text enough to lex it. I'm non-native and actually reading all-kana is much more difficult for me than the usual mixture.

Many all-kana texts employ spaces. All-kana with no spaces is a nightmare.

LikeAnElephant · on May 8, 2020

Agree with this, then you come across a more philosophical debate: should the very nature of written language change to adapt to the needs of technology, or the other way around?

On one hand, language has always been an amorphous thing and has changed throughout history. On the other could some changes change the very nature and heart of it? No one agrees on those trade-offs, and I argue likely never will.

Asooka · on May 8, 2020

> should the very nature of written language change to adapt to the needs of technology

This has been always the case. I mean, think of how modern latin letter forms developed. Think back to the Roman Empire, when the letters were etched in stone or wax tablets, with only capitals (before the invention of minuscule), drawn with straight lines that are easy to chisel. Think of the later periods when many books were written by scribes using pen and ink, resulting in changes to letters so they could write entire words in one hand motion. Later still, we have the printing press and its demands that each letter be a discrete unit, doing away with the wavy flow of words. Today most people write using print letters simply because the majority of writing you would encounter is in print, so that's what's easiest to read. Those changes probably don't register to you as egregious as the one parent proposes, due to the fact that the person enforcing the change and the person developing the technology are one and the same - a native speaker modifying his own language to fit the tools at hand. Which I think brings us to a resolution - the native speakers of the language are the ones to decide how to change it to suit their everyday needs, including making it easier to produce and consume with the tools of the day.

thaumasiotes · on May 9, 2020

> Think back to the Roman Empire, when the letters were etched in stone or wax tablets, with only capitals (before the invention of minuscule), drawn with straight lines that are easy to chisel.

To choose some random capital Roman letters, how about... SPQR? Those got inscribed all the time.

Roman letters don't really show any bias towards straight lines. Don't confuse the fact that we split the Roman letter V into two letters U/V with the idea that they didn't carve curved letters. Check out this awesome plaque from the year 90: https://www.timesofisrael.com/in-a-bronze-inscription-a-remn...

earthboundkid · on May 9, 2020

Interestingly, it’s easier for me to read that bronze plaque from the first century than a typical manuscript from the fifteenth century, in part because in the early modern period, there was a purposeful revival of the Roman inscription style for writing. https://en.wikipedia.org/wiki/Roman_square_capitals

LikeAnElephant · on May 8, 2020

> Which I think brings us to a resolution - the native speakers of the language are the ones to decide how to change it to suit their everyday needs, including making it easier to produce and consume with the tools of the day.

Agree with you here, it's well outside my rights as an English speaking American to have an opinion of any merit. However I would still struggle to call handing it over to native speakers a resolution. Many native speakers have incredibly strong feelings on both sides of the argument; and often for very good and valid reasons.

I'm glad the Unicode Consortium exists, because I am CERTAIN I don't want to be the decider or facilitator of these discussions. Way above my pay grade.

knolax · on May 8, 2020

If the Japanese language needs to change, it won't be because some Western developer found it too hard to develop some obscure unused highlighting feature for it. The HN crowd's fetishistic need to change everything for the sake developer convenience really reflects poorly on programmers as a profession.

thaumasiotes · on May 9, 2020

> To the fluent reader, the kanji/kana switching breaks up the text enough to lex it. I'm non-native and actually reading all-kana is much more difficult for me than the usual mixture.

Don't leap to conclusions about the benefit of breaking up the text. For example, reading Mandarin pinyin is much more difficult than reading characters, but neither of those involves a mixed script. (And heck, the pinyin might have word spacing! [1])

The reason is pretty obvious - characters contain a lot more than just phonetic information. That makes things harder on the writer, and easier on the reader.

[1] Pinyin produced for foreign learners usually has word spacing. Pinyin produced for Chinese children usually doesn't.

bluquark · on May 8, 2020

The squarish glyph-shapes aren't a good fit for spaces. Scripts with spaces have unevenly shaped letters so that one can quickly scan a word by its length and pattern of ascenders/descenders. In Japanese spaces do not aid this type of scanning, and conversely the uneven densities of the glyphs allow for fast scanning without the aid of spaces.

nayuki · on May 8, 2020

Spaces have been tried in Japanese. If you look at pure-kana (hiragana/katakana) texts aimed at grade schoolers and foreign language learners, there are spaces because without them there would be too much parsing ambiguity. Text with full kanji for adults do not use spaces, and this does not hurt comprehension at all.

thaumasiotes · on May 9, 2020

> If you look at pure-kana (hiragana/katakana) texts aimed at grade schoolers and foreign language learners, there are spaces because without them there would be too much parsing ambiguity.

This cannot actually be true; the spoken sentences do not indicate word breaks, but grade schoolers (and toddlers) can parse them just fine.

syockit · on May 9, 2020

Because there are accents in spoken sentences. Now if you would include accent marks in the letters it might make it easier to parse. Sadly, as far as I know there are no standard accent marks for kana.

thaumasiotes · on May 10, 2020

While the prosodic contour of a sentence does help in parsing, there's a reason no writing system has ever taken notice of it. It's not necessary.

The most obvious way to see this would be to think about synthesized speech, which almost never uses natural prosody. Think of the voice of Stephen Hawking, or just a splicing of some prerecorded options. ("At the tone, the time will be / TWO / TWENTY / THREE / and / TEN / seconds.")

Freak_NL · on May 8, 2020

You want to change a written language for the express purpose of enabling computers to be able to detect word boundaries easier?

bluquark · on May 8, 2020

Note that Japanese and Chinese already changed to commonly use horizontal left-to-right text (in addition to vertical top-to-bottom/right-to-left text, which is still usual especially in "proper published typesetting") in large part because computers handled that much better for decades.

cooper12 · on May 8, 2020

It's not so much that because they handled it better, but because that was how they handled it. Even to this day vertical text layout in CSS is primitive.

Early Japanese on computers also used half-width kana but now it's all properly fullwidth.

Don't mistake technical limitations for choice.

BigJono · on May 9, 2020

I wonder if there's any Japanese sites out there that fully use the traditional style? Text top to botttom, right to left, and with the horizontal axis unbound (to the left) instead of the vertical.

needle0 · on May 9, 2020

I'm sure there are some extremely design-heavy sites that do it, but by then you would be fighting the DOM/CSS pretty hard, as default behaviors for stuff like underline position and 縦中横 (short bursts of horiz text within vertical text, for things like two-digit numbers) seem to be sketchy at best, and many things still don't seem to be uniform across browsers. UX would be affected as well, as things like text selection or right-clicking would feel quite awkward as it is so exceedingly rare. While us Japanese are fully capable of reading both horizontally+LtR and vertically+RtL, it's not in our expectations for the latter to appear in online web text.

Asooka · on May 8, 2020

It's not entirely unprecedented. Modern cyrillic shapes are in part influenced by the printing press - when it was being imported in Russia, they didn't want to mill an entire new set of characters, so they repurposed the latin ones that look like cyrillic characters and only added the ones that were missing. Of course, parent's suggestion will never pass, as it will require everyone everywhere to start using spaces, which is more trouble than it's worth to native language users.

anonymfus · on May 9, 2020

Where did you get that idea? Peter I created Civil Script a few centuries after adoption of printing press in Russia.

emilfihlman · on May 8, 2020

It's just lighthearted pondering on the issue. Written language evolves, or is that only a standard applied to $YOUR_LANGUAGE when defending people not learning it?

I think having spaces is useful, too, for learning speed and expression.

rafi_kamal · on May 8, 2020

Similar argument also applies to English. English isn't phonetic which makes it hard to pronounce correctly unless you are learning from a native speaker. But I don't see written English to evolve in a way that makes it easier to pronounce for new learners.

mring33621 · on May 8, 2020

I'm a life-long English speaker and am 100% OK with changing the language to make it easier to learn.

zeroimpl · on May 8, 2020

Let’s make phase 1 be the unification of American and British spelling. That should be easy enough right?

Piskvorrr · on May 9, 2020

It's not like they couldn't learn English. Oh, while we're at it, let's also force English into ASCII. And let's make it Lojban for good measure.

knolax · on May 8, 2020

Double-click highlighting barely makes any sense in English, let alone in a language that doesn't use spaces. The fact that mobile browsers treat long presses as a double press for urls has been the bane of my existence. I doubt any native Japanese speakers use this feature.