1) More accurate, compared to hyperbole like e.g. "Bridging the Gap between Human and Machine Translation" we have right there in the title the domain: news.
2) A more impressive result. This result is on an independently set up evaluation framework, compared to Google's which used their own framework.
These researcher appear to have been much clearer about what they're actually claiming, and also used more standard evaluation tools (Appraise) and methodology rather than something haphazardly hacked together.
The outputs of Microsoft Research are really good. At least in my field, it is one of the few places where if they published something you can be sure of being able to reproduce the results using only what is described in the paper, no secret sauce required.
The issue with that evaluation was that machines were much better at not making some trivial mistakes humans don't care about (eg transcribing umm, err, etc), but were more likely to get the meaning wrong. Kudos to MS for doing the error analysis and publishing that info, but I found the reporting of it misleading.
I don't read that claim being made. Microsoft Research has been generally left alone to do good computer science, in which "left alone" has included "by the marketing department"
As impressive as it may be, these people should refrain from claiming 'human-like' translation from a system that has no way of 'knowing' anything about context, other than statistical occurrences.
It is certain that, on occasion, the system will make such mistakes as stating the opposite of what is being said in the first place, or attribute one action to the wrong person, and what not. Perhaps on average it's as good as a person, but this system will make mistakes that disqualifies it from being used without a bucket of salt.
Let's just define that "human like" in context of machine translation from now on mean "with full legal responsibility". Then let's see who claim their translator is "human like".
If you defined it that way, not even human translators would meet the standard. Treaties and other official documents published in multiple languages always specify one as the "official" one for purposes of legal interpretation and that, in the event of conflict or confusion, the translations are subservient to it. Setting a bar for AI performance so high that even humans don't reach it seems unhelpful.
Actually most international treaties specify all language versions to be equally authentic. Multilingual contracts on the other hand generally have a single authoritative version.
Humans, on the other hand, have all kinds of biases (intentional and unintentional, conscious or not) that creep in because they do have context.
The important thing is that we build systems that account for the process-based problems, not that we build components that are perfect. Things that matter more are how frequently these mistakes happen? What's the impact? Can we eliminate these errors with multiple layers of processes designed to identify the exploits?
This was much the point that John Searle made in his criticism of AI prognostications. Ultimately I think he'll be shown to be wrong but the time frame for this will be (I suggest) much longer than is currently touted.
Neural networks modify their behaviour by changing their data. Any non-trivial program does that. Some GOFAI programs (e.g. Eurisko) could modify their instructions, not just their data.
Searle's argument is confusing, but how the program in the Chinese Room is implemented doesn't matter. His argument is solely against strong AI. He claims that the Chinese Room (or a suitably programmed computer) cannot be conscious of understanding Chinese in the same way that people can. He doesn't deny that a suitably programmed computer could, in principle, behave as if it understood Chinese, even if it wasn't conscious of anything at all.
However, the machine translation program mentioned in the article behaves as if it understands Chinese only within the limited context of the translation. It wouldn't be able to answer wider questions about things mentioned in an article it had just translated.
Previously it was thought that machine translation systems would have to understand the text they were translating in the way a person does, to produce a useful translation, but that's now shown not to be true. Without hindsight, it's a surprising result, but less surprising when you think of translation as pattern recognition, and think of how a person might go about translating text on a highly technical subject they don't understand.
It's a completely meaningless and plain stupid argument though. There's no reason to believe human consciousness is special and that "understanding" Chinese is at all related to it. One can speculate that the perception of consciousness is simply the product of some form of introspective sensory system and internally-directed actions, of which the former ironically clearly doesn't quite work when it comes to language (or it wouldn't be as much of a reverse-engineering exercise). Nothing really stops you from throwing that into your system and having it consciously understand chinese, assuming it already maintains state.
You can however make a fairly solid argument that a CNN alone (as used in image/object recognition) is fundamentally incapable of dealing with images (but maybe not language), on the assumption that it can be faithfully described as Satan's boolean satisfiability problem, then by virtue of complexity theory it can only be solved in constant time with a sufficiently massive lookup table (which there wouldn't be enough atoms in the universe to store). Microsoft are actually dealing with this in their system by repeatedly applying the network and revisioning the text.
Regardless though, accurate NLP is going to come down to managing to codify how humans deals with objects, concepts and actions, because that's what the languages encode; GOFAI wasn't really too off (and the original effort was doomed from the start by the state of hardware and linguistics). Consider how distinguishing objects as masculine-feminine-neuter and animate-inanimate(-human) is universal (but doesn't necessarily affect the grammar), and that the latter is based purely on how complex/incomprehensible the behaviour of something is (unlike grammatical gender which seems to be fairly arbitrary). Of course that's arguable, but you can see animacy appear in english word choices (unrelated to anthromorphic metaphors) and in how "animate" objects tend to be referred to as having intent. You could try and figure all this out the wrong way around using statistical brute force and copious amounts of text, but that's pretty roundabout isn't it?
(Also, the assumption that an objects animacy is determined by predictability offers a pretty concise explanation of why the idea of human consciousness being produced by simpl(er) interacting systems often fails to compute so spectacularily, why most programmers appear to be immune to that, and also why the illusion that image recognition CNNs perform their intended function is so strong (regardless of how useful they are, the failures make it blatant that they're only looking at texture and low-level features, and are extremely sensitive to noise, which is the opposite of what anyone intended))
I agree, the argument seems to come down to saying that human consciousness can't be replicated because human consciousness stems from a "soul" (a non-physical and undetectable element of someone that's at the root of their consciousness).
The amount of attention this argument has received has made me wonder whether the "rigor" used by philosophy departments is mostly just a way to obfuscate bad arguments.
That's not his argument at all. His argument is that just because you can do some task doesn't mean you "understand" it.
You don't need some fancy philosophy and complex thought experiments to see what he means. Just look at how people learn math. You can do calculations by memorizing algebraic rules, but that's not the same as understanding why those rules exist and what they mean. Even though you will calculate answers correctly in both cases, we all know there is a qualitative difference between them.
Back to Searle. His argument is that everything computers do is analogous to rote memorization and that transition to understanding requires something computers don't have.
Whether you buy his argument, two things are clear. First, there is a difference between just producing results and understanding the process. We all experienced this difference. It's all theoretical as long as you stick to simple tests (like multiple-choice exams), but becomes relevant when you suddenly expand the context (like requiring the student to prove some theorem instead of doing a calculation). Second, we also know that for humans this difference isn't just quantitative. Memorizing more algebraic rules and training in their application will not automatically result in students gaining understanding of mathematical principles.
Thanks for rebutting Chathamization's gross misrepresentation of Searle's argument.
> It's all theoretical as long as you stick to simple tests (like multiple-choice exams), but becomes relevant when you suddenly expand the context (like requiring the student to prove some theorem instead of doing a calculation).
Not even expanding the context changes the situation. The proof of a theorem can be memorized without any understanding just as easily as algebraic rules.
> Second, we also know that for humans this difference isn't just quantitative. Memorizing more algebraic rules and training in their application will not automatically result in students gaining understanding of mathematical principles.
This is correct and the same principle applies not just to humans but to computers too (which was the point of Searle's argument). No amount of computation is going to make a computer aware or understand the meaning of the symbols. Ultimately "meaning" is our perceptual awareness of existence but that is a long proof for another day.
Any finite computational system can be implemented as a lookup table in the manner that Searle suggests. But that's not essential to the argument. You can imagine that the man in the room is following instructions for simulating a (finite approximation of) a Turing machine, or any other computational device that you like.
Impressive, but there's an easy formula for getting these systems to make mistakes. Just input a sentence with some kind of long distance dependency. For example, DeepL gets agreement right in English to Spanish translations when the two things that agree are close together:
I like soup -> COMO sopa
They eat soup -> COMEN sopa
Impressively, it can even get agreement correct across clause boundaries in many cases. But if you do wh-movement through two or more clauses, you're usually out of luck:
Which boys does he say he believes eat soup?
->
¿Qué chicos dice que cree que COME sopa? [should be COMEN]
It doesn't really matter very much in practice if an MT system makes mistakes like this, but they are mistakes that you can rely on humans not to make systematically.
Hmm, that seems to have been fixed now. Here's a nastier case that shows a similar effect. (Nastier because the agreeing indirect object in Spanish is a subject in English.)
Which boys does she know like cheese?
->
¿A qué chicos sabe que LE gusta el queso? [Should be LES]
AI researchers are certainly fully aware of such issues, and there have been various models trying to encode long term dependency information, e.g. LSTM.
DeepL is what I use for European language translations. I wish they added more languages and maybe a nice app like Google Translate. It blows every other translation service out of the water.
I can also confirm, it's very impressive. Not sure how their setup differs from Google Translate but I've read articles translated with it and only knew it was translated when it was written at the end of the article that it was using DeepL.
Translate "sentences of news" is very different to translating an entire article, which is obviously what's interesting.
Is anybody in MT or text comprehension/generation really working on systems that construct a model/"understanding" of the bigger narrative in a longer-running text? Even just to be able to do correct anaphora resolution across sentence and paragraph boundaries, but intuitively also WSD seems easier if you've got some sort of abstract context over more than just a sentence.
I think Google translate already has this. I was translating some text into German a few days ago, and after a few sentences I used a word that made it clear that I was talking about a specific type of contract appointment, and it went back and adjusted earlier sentences to use more precise terminology. You only notice this when you a) speak the language you're translating into somewhat; b) actually type/compose the message in the Google Translate text box; and c) are typing something idiomatic enough that such specific phrases can be inferred. So I guess it's just something you wouldn't normally notice.
Either way, I was mightily impressed, to the point where my wife had to roll her eyes and say 'yeah yeah I understand it now' to get me to drop it. (I'm just easily excited I guess.)
I've found that google translate works decently (as far as these automated translations can be expected to work) from/to English, however translating from a different pair of languages the results are often very off in my experience. In particular it seems it has some sort of internal bias, as if it always an english-like intermediary representation.
For instance, if you ask google to translate the Portuguese "báculo" into French it gives you "personnel". It's nonsense as far as I can tell, a báculo is a "crosier of a bishop"[1]. So what's going on here? Well if you translate it from PT to EN it gives you "staff" and suddenly it starts making sense, because while staff means "A long, straight, thick wooden rod or stick, especially one used to assist in walking" (which fits báculo) it can also mean "The employees of a business" which is an accurate definition for french "personnel". And I believe that's how you end up with the nonsensical PT -> FR translation.
Similarly Google used to be confused by the tu/vous (informal/formal) distinction that exists in many languages but not in English. At some point the portuguese "tu és" would be translated in french by the formal "vous êtes" instead of the informal "tu es". This appears to have been fixed however, I can't reproduce it at the moment.
Conjugations don't fare so well however, for instance imperfect past french "je chantais" is translated into portuguese preterite "eu cantei" even though "eu cantava" would make more sense I think. Obviously with such small phrases I can't really be too harsh on google's bad grammar, they're probably not optimizing for that case.
That sounds extremely interesting. I had not noticed that feature before. Do you happen to have some example input at hand that triggers such an adjustment?
People have certainly worked on moving beyond sentence boundaries, although what is meant by understanding is always a bit nebulous. Certainly we need to make sure whatever process we are using has a sufficiently rich internal knowledge representation. One piece of work is this: https://github.com/chardmeier/docent/wiki which is a document level phrases-based statistical machine translation decoder. There have also been special purpose evaluation tasks which include correct pronoun resolution e.g. https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT... .
If you're interested, probably the DiscoMT workshops are a good starting point for some things people have tried.
I haven't read the paper to see how these are combined but it makes intuitive sense that using multiple training methods can lead to better performance. That is to say, to more effectively search the weight space of the network.
I find these types of "match human performance" claims to be ridiculous, especially when it comes to Chinese -> English translations. Translation is both an art and a science, requiring nuanced understanding of the languages, cultures, and context. It also demands quite a bit of creativity. No translation tool I've tried has come even close to matching human performance of a good human translator, including microsoft's tools. AI will need to reach the point where its understanding of language, culture, context, and creative ability matches that of humans to truly be capable of "human performance" in translation.
After reading the paper, my takeaway is that humans aren't really very good at translation either. None of the methods scores higher than 70% in the evaluation and that includes several different human translations (whose performance varies greatly depending on how they were sourced). So while matching the quality of the average human translator is a great milestone, there's still lots of room to improve.
Most humans who attempt to translate are not actually translators in the proper sense. Specifically, they are not fully competent in both the source and target language and usually have no formal training in translation. Translation is hard, but there are competent people out there who do it extremely well. Sadly, as an industry it's not taken as seriously as it should be, and most of the people who are actually doing translations do not have the appropriate skill set.
I can offer some insight on this, as I've dealt with translations of papers before (Chinese to English). There have been countless times where I've been asked to check the translation of a paper and ended up retranslating it completely. I was often curious as to why the original translation was bad - it was typically either of the following two reasons: 1. The so-called translator was just an overworked research assistant who knows a bit of English. 2. The translation was outsourced to a dodgy company that spends more on marketing than on their translators, usually hiring college grads who did English-related majors.
Some people without a specific cultural understanding happen to translate the content perfectly well, too. Experience and constant mastering of translational skills is what counts. If experienced and skilled translators are cooperating in developing process, it could be more than possible to get to a human performance.
It's obvious that there are limits to how well machine translation can work unless the models have sensory grounding. I wonder if the problem is that people haven't figured out how to do sensory grounding or that the hardware is still too slow for it to work.
About translators solely reliant on NN. The thing is, while 70% of output can be well passable, some of the rest can be very weird if original input was not learned. Like a string of gibberish turning into 10 full sentences.
I'd suggest addressing non-English to non-English translations first, which is usually limited in most engines out there compared to translations to/from English.
It is:
1) More accurate, compared to hyperbole like e.g. "Bridging the Gap between Human and Machine Translation" we have right there in the title the domain: news.
2) A more impressive result. This result is on an independently set up evaluation framework, compared to Google's which used their own framework.
Compare further the papers: https://arxiv.org/pdf/1609.08144.pdf https://www.microsoft.com/en-us/research/uploads/prod/2018/0...
These researcher appear to have been much clearer about what they're actually claiming, and also used more standard evaluation tools (Appraise) and methodology rather than something haphazardly hacked together.