Hacker News new | past | comments | ask | show | jobs | submit login
Unsupervised Machine Translation Using Monolingual Corpora Only [pdf] (arxiv.org)
115 points by stablemap on Nov 2, 2017 | hide | past | favorite | 33 comments



Neat! If I'm reading this right (smarter people please correct me if I'm wrong), the process they used is:

1- Train a system that translates language A sentences into a representation space, and can translate back from that space.

2- Train a second system that does the same, but with language B, onto the same representation space.

3- Train an adversarial system that tries to look at the representation space and identify which language the sentence came from, retraining the language translation systems to try to fool that recognizer. Retrain the models to try to not be recognized by this third system.

The best way to 'hide' from the recognizer is to have a very similar distribution, to make it use similar points in the representation space for the same concepts.

Brilliant stuff.


I wonder if this relies on the sentences of the corpora referencing similar concepts. I'd be interested to see if it worked on a corpus of modern English and a corpus of classical Latin or Greek.


> the model starts with an unsupervised naive translation model obtained by making word by word translation of sentences using a parallel dictionary learned in an unsupervised way

So they DID have a parallel dictionary at first. How they learned it is another problem.


Temporal drift of concepts is definitely a thing - there's a good paper from Stanford's NLP group showing it in word embedding for English.

The interesting thing is that this model might automatically correct for it (although of course it couldn't translate things for which no concept exists).


I believe they also add a loss term corresponding to translating A to B using last iteration's model, then translating that B-sentence back to A and comparing to the original sentence.


It seems like there's a hard limit on how well this approach can work, defined by the extent to which the distribution of words in texts uniquely identifies their meaning without reference to any external semantics.

With monolingual corpora and no semantic grounding, all you can say is that certain words (or letters, morphemes, or whatever) covary in certain ways that are similar to the ways in which other words covary in the other language. It's super interesting that this is enough to identify some translation pairs across languages.


In particular I wonder if it can get "left" and "right" the correct way around.


I'd imagine you can potentially get that by connections with strength, weakness, dexterity, etc.


I'm not sure I understand you... But the parent's comment is quite interesting, how could the system end up with a correct left/right understanding or our world, instead of a mirror world? Without experiencing the outside world, it seems impossible!


I guess there might be the occasional reference to the heart being on the left, or to most people being right handed. So among human languages it would eventually learn. But it wouldn't work as Star Trek's "universal translator": aliens wouldn't share these traits with us. I suppose we could get lucky and find an alien textbook describing CP violations.


The remark is that the two words, while apparently interchangeable, in actual use are not. One is often used with positive meaning related to dexterity and so on. Also, they have different political meanings. So they are not at all symmetric in actual use, much like every other word pair


Not all of those differences are cross-cultural.


They definitely work in a lot of languages.

https://en.wiktionary.org/wiki/derecho#Spanish https://en.wiktionary.org/wiki/direito https://en.wiktionary.org/wiki/diritto https://en.wiktionary.org/wiki/sinister#Latin https://en.wiktionary.org/wiki/dexter#Latin https://en.wiktionary.org/wiki/laevus#Latin https://en.wiktionary.org/wiki/%CE%BB%CE%B1%CE%B9%CF%8C%CF%8... https://en.wiktionary.org/wiki/recht#German

and many languages related to these. Some Slavic languages have a close cognate between words meaning 'right (direction)' and words meaning 'right (correct)', although I didn't immediately find examples in current use on Wiktionary where the two forms are identical. A "dated" example is Polish

https://en.wiktionary.org/wiki/prawy#Polish

Edit: a Finnish friend pointed out that this works in Finnish, which is European but not related to any of the other languages above.

https://en.wiktionary.org/wiki/oikea#Finnish

I don't know if these linguistic phenomena definitely occur in non-European languages, but I think that many people in the Middle East, many parts of Africa, and South Asia consider the left hand unlucky or impure not only because it's weaker in most people but also because it's conventionally used when going to the toilet. It would be a little surprising to me if that attitude didn't show up in languages from those places too.

Edit: it looks like it works in Arabic too

https://en.wiktionary.org/wiki/%D8%B4%D9%85%D8%A7%D9%84#Arab... https://en.wiktionary.org/wiki/%D8%A3%D9%8A%D8%B3%D8%B1#Arab... https://en.wiktionary.org/wiki/%D8%A3%D9%8A%D9%85%D9%86#Arab...


This is a good point. Still, for instance, the binomial left/right to indicate political position is common across many languages


Excellent example!


Very interesting. Can a language define all it's concepts with language alone? I.e

Can a language map all it's concepts to words and other concepts defined by just words.

Is there some sort of Godel's incompleteness theorem for language?


> Can a language map all it's concepts to words and other concepts defined by just words.

I don't see why natural language would have this property. It is certainly possible to construct codes such that the distribution over codewords does not have this property.


I also don't see why it would, but then that would of course be a limit to the idea in the paper.

I'm amazed that what the paper shows even works.


It seems extremely unlikely to me that it wouldn't. I conjecture that no such word exists in any widely used human language. This would be a word who's meaning is not contained in the span of meanings of all other words in the language.

Can you think of one in english (or your native language)? You won't find it in any dictionary, since all those words have descriptions in terms of other words.


The example I've been thinking about was 'pain', which, according to Wikipedia, is "an unpleasant sensory and emotional experience associated with actual or potential tissue damage." The native Dutch word "Gezelligheid" has no direct English translation and thus apparently has a wiki page in English [0]. Now I wonder what the relevance is of the distance between word vectors from different languages relating to nearly the same concept (e.g Gezelligheid to Cozyness)

[0]: https://en.wikipedia.org/wiki/Gezelligheid


Gullible?


easily deceived


actually, it's theoretically a hard problem -- https://aclanthology.info/pdf/J/J86/J86-4003.pdf

I would have thought that they needed to 'ground' (https://en.wikipedia.org/wiki/Symbol_grounding_problem) at least one shared reference word/concept so they could proceed to define everything relative to that and thereby communicate effectively and know how to map different languages to the same latent space. But I guess perhaps one can infer/approximate it statistically -- https://cis.upenn.edu/~ccb/publications/discriminative-bilin...


This is pretty interesting, and something I've played around with too (although not to the extent they have - I was playing with aligning word embedding and using them for cross lingual tasks).

Google released a paper[1] doing zero-shot translation between unseen pairs. That relied on a shared representation which they called "interlingua", and that seems quite similar to this

[1] https://research.googleblog.com/2016/11/zero-shot-translatio...


https://en.wikipedia.org/wiki/Interlingual_machine_translati... is an old concept, used to be seen as The Right Way (as opposed to the pragmatic way) of doing MT back in the day, before the success of (very pragmatic, low-knowledge) statistical approaches.

I find it interesting that in Google's system, they didn't set out to build an interlingua, but instead built a system that happened to end up with something like one …


This is pretty cool! But I would've liked to see some eval on smaller datasets (or just a graph of BLEU vs training set size), since the main use for something like this is where you don't have parallel data, in which case the monolingual resources are likely to be much smaller too. A few years back I was trying different methods to create bilingual dictionary word-pair candidates (to be post-edited by a linguist – checking is faster than writing) for Saami languages, and tried aligning pure word vectors, but the existing corpora were too small to get good vectors, so I gave up that avenue :(

Another neat approach is https://www.isi.edu/natural-language/mt/RLILdecipher.pdf , specifically designed for low-resource languages that happen to have a higher-resource related language (e.g. Haitian Creole–French). They train a source language character-based "cipher model" (a weighted finite state transducer that turns source characters/character pairs into target characters/character pairs), to maximise probability over a target language language model.


What an awesome idea. I wonder how well this technique would work for different data types. Text and audio for example.


Most audio translation is first transcribed to a text representation, I think, and then that text is what's translated. If you assume a perfect speech recognizer (for the sake of argument), then the end translation should be the same.

Or am I misunderstanding your point?


>Or am I misunderstanding your point?

Only that you are talking about specifics, while I was talking in generality. (IMO) unsupervised learning of semantic relationships is a 'big deal'. This seems to be a very general method that would extend to any pair of data streams, but that hasn't been proven out yet.


Feels reminiscent also of CycleGAN (https://github.com/junyanz/CycleGAN)


At first I thought this was really insightful now it seems like just redefinition of meaning.

An analogy is an experiment where a scientist says "I can prove these two people have ESP and can pick the same #". He sits them in two separate rooms and has her first pick a number. He goes to the second room and has him pick a number. 17? No higher. 24? No higher. 52? No lower. 43? Yes exactly!!

"See, these two people both came up with the same number without any communication with each other. ESP!"

If they want to show this language learning is powerful, demonstrate that once trained it can now be applied to a third language without any new "adversarial feedback".

Or it reminds me of the fake "we can communicate faster than light via quantum entanglement" claims. But then the caveat comes that they can only figure out what the value of the bit of info they communicated was causally.


I think you missed something.

Previously, machine translation required being trained on a bilingual corpus, that is, a corpus of the same set of sentences in eg English and French. These corpora are pretty hard to come by and expensive to produce.

The paper describes a technique to use two monolingual corpora instead, i.e. one set of sentences in English and a different set of sentences in French. That's way easier to find.

It's far from just a definitional trick.


>> These corpora are pretty hard to come by and expensive to produce.

Actually there are lot of texts translated by qualified translators in several languages, for political reasons: EU's commission websites, perhaps some other countries official websites.

You can have a look at Linguee [0] which uses this to provide translation suggestions:

[0] https://www.linguee.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: