NLP's ImageNet Moment: From Shallow to Deep Pre-Training

narrator · on July 9, 2018

I was at a deep learning conference recently. The topic of how AI can improve healthcare came up. One panelist said that a startup they were working with wants to help doctors use AI to use NLP to send claims to insurance companies in a way that won't be rejected. Another panelist said that he was working with another startup that wants to use AI and NLP to help insurance companies reject claims.

I think in the future we'll see their AI fighting against our AI in an arms race similar to the spam wars. The one with the most computing power and biggest dataset will win and humans will be at their mercy.

jahabrewer · on July 9, 2018

> One panelist said that a startup they were working with wants to help doctors use AI to use NLP to send claims to insurance companies in a way that won't be rejected. Another panelist said that he was working with another startup that wants to use AI and NLP to help insurance companies reject claims.

Sounds like GAN in meatspace.

vokep · on July 10, 2018

Yep, They compete endlessly, while we enjoy hyper-accurate decisions on these things, leading to greater efficiency of both.

bigiain · on July 10, 2018

Yes.

But quite possibly "greater efficiency" according to a fitness function that's not accurately mapped onto "keeping humans alive"...

I wonder if this'll end up in an equivalent state to the "tank detection neural net" which learned with 100% accuracy that the researchers/trainers had always taken pictures of tanks on cloudy days and pictures without tanks on sunny days? ( https://www.jefftk.com/p/detecting-tanks )

Who'd bet against the doctor/insurer neural net training ending up approving all procedures where, say, the doctor ends up with a kickback from a drug company - instead of optimising for maximum human health benefit?

Rainymood · on July 10, 2018

>But quite possibly "greater efficiency" according to a fitness function that's not accurately mapped onto "keeping humans alive"...

Since when was this ever the case? Especially in America? The US healthcare system is NOT built around providing adequate care for everyone, as far as I've read/heard.

Full disclosure: West-EU citizen here

aoeuid · on July 10, 2018

Note that that story is apocryphal: https://www.gwern.net/Tanks

mockingbirdy · on July 9, 2018

> humans will be at their mercy.

It was always like this. In my opinion it doesn't make a difference if some guy is more intelligent and is therefore able to suppress others or if he uses an AI that is more intelligent. For me the result is the same: I get rekt.

rasz · on July 11, 2018

There exists a pain threshold you dont want people to cross due to your automation, otherwise your developers/executives risk being killed by customers. Luddites, Ted Kaczynski, Nasim Aghdam, etc.

Google/Apple shuttle buses are being shot up with pellet guns today, imagine what happens when big AI corps openly work against population. Google AI/Amazon Rekognition protests suggest at least some employees have a shred of self awareness and survival Instinct.

sgt101 · on July 9, 2018

Meanwhile, Thomas Sandholm's company built and deployed and AI system that supports all organ transplants in the USA.

No deep learning though.

rusbus · on July 9, 2018

For more detail plus working code, lesson 4 of the fast.ai course uses this technique to obtain (what was at time of writing) a state of the art result on the imdb dataset:

http://course.fast.ai/lessons/lesson4.html

By training a language model on the dataset, then using that model to fine tune the sentiment classification task, they were able to achieve 94.5% accuracy

jph00 · on July 9, 2018

Well spotted - this is where I first created the algorithm that became ULMFiT! I wanted to show an example of transfer learning outside of computer vision for the course but couldn't find anything compelling. I was pretty sure a language model would work really well in NLP so tried it out, and was totally shocked when the very first model best the previous state of the art!

Sebastian (author of this article) saw the lesson, and was kind enough to complete lots of experiments to test out the approach more carefully, and did a great job of writing up the results in a paper, which was then accepted by the ACL.

dantheman · on July 9, 2018

For anyone just getting started on this I can't recommend fast.ai enough. It's extremely well done and I found it very intuitive. You are able to quickly apply some very advanced techniques.

sweezyjeezy · on July 9, 2018

I want to like fast.ai more, but in my opinion their code quality is just not good enough. Everything is badly named, badly explained / not explained at all, and if you need to adapt any of their code to work in different domains, you are going to have a bad time.

jph00 · on July 9, 2018

The naming convention is fully documented and based on decades of research : https://github.com/fastai/fastai/blob/master/docs/style.md

It's fine if you don't like it (although it may be you're just not quite used to it yet), but I'm not sure it's fair to call it "bad".

Everything is fully customisable in pytorch and this is explained with many examples in part 2 of the course. Written documentation is being worked on as we move towards a first release later this year (currently the library is still pre-release - it works fine and is used at many big and small companies, but there's still much to do to get it to a v1.0).

meesterdude · on July 9, 2018

I agree with the poster you're replying to - the naming is not intuitive or approachable without that previous body of knowledge. It's enough that I would call it bad - I would flag the crap out of it in a code review.

I'm glad you linked to the style guide - i've often wondered where certain names come from.

It's not all bad. K for key, V for value, i for index... no disagreement. But the seeming aversion to any variable name longer than 2-3 characters might be great in mathematics, but makes for a nightmare of code which is focused more on writing. It doesn't need to go the extreme of UIMutableUserNotificationAction; but at least use words.

AI is already clever enough; it doesn't need to be made more cryptic with poor naming conventions. I get that for people in the industry it may be fine - but for it to break out into general use it'll need to be more social; which means simpler & clearer verbage.

sweezyjeezy · on July 10, 2018

> It's fine if you don't like it (although it may be you're just not quite used to it yet), but I'm not sure it's fair to call it "bad".

It is totally fair to call a naming convention where all variables are 2 or 3 letter acronyms bad, at the very least un-pythonic, and certainly not suitable for an educational tool. If you think that it's good practice to make your code shorter at the expense your users needing an abbreviation guide, you're going to make adoption a much harder sell, especially if you're pitching at python devs. Unlike the last few decades, we now live in an age of autocomplete, there is absolutely no need for this.

Here is a few lines from the imdb ipython notebook (https://github.com/fastai/fastai/blob/master/courses/dl2/imd...), that gave me a headache just to look at -

trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)

val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)

md = LanguageModelData(PATH, 1, vs, trn_dl, val_dl, bs=bs, bptt=bptt)

The cognitive load put on someone unfamiliar with your code is not acceptable:

LanguageModelData takes two LanguageModelLoaders as arguments, and then later in the code it produces a model? Surely these two classes are named the wrong way round? You haven't explained what 'bs', or '1' are supposed to be. To figure out/ remember what the other variables are, you need to chase up and down the notebook, which would easily be avoided if you just gave them descriptive names. You only use these variables once or twice, so I don't even understand what is gained by making their names so terse.

newen · on July 11, 2018

Regarding the short variables names, I actually like it a lot. It's nice because you don't have to read as much.

Once you are familiar enough with the abbreviations, it becomes much easier to read. Just from looking at this code, I can already tell you bs is batch size, bptt is back-projection through time. And then probably dl is data loader, lm is language model, md is model data. No idea what vs or 1 is but I can just look at the docs for LanguageModelData, which takes what? less than 30 seconds to read?

I think it's worth it because you can just look at one line of code and know what's going on. Instead of parsing through a paragraph of code that does the same thing.

elyase · on July 10, 2018

fast.ai is great and I also want to love it but I find the coding style extremely unconventional and uninviting to the point that I only use the library when there is really no better alternative (ex for the lr finding features). Variable naming and wildcard imports would be my main complains. But thanks a lot Jeremy for the course and for posting the style guide.

mikert5671 · on July 9, 2018

Why is this in the comments section of every AI article?

cs702 · on July 9, 2018

The title is a little too click-baity for my taste ("has arrived," huh?), but I think the OP is unto something.

It is now possible to grab a pretrained model and start producing state-of-the-art NLP results in a wide range of tasks with relatively little effort.

This will likely enable much more tinkering with NLP, all around the world... which will lead to new SOTA results in a range of tasks.

zawerf · on July 9, 2018

Do you have links for these pretrained models? The only one I am aware of is OpenAI's where they fine tuned a Transformer architecture for 1 month on 8 gpus:

https://blog.openai.com/language-unsupervised/

https://github.com/openai/finetune-transformer-lm

sebastianruder · on July 9, 2018

You can find ELMo here: https://github.com/allenai/allennlp/blob/master/tutorials/ho...

And ULMFiT here: http://nlp.fast.ai/category/classification.html

cs702 · on July 9, 2018

For those who don't know, Sebastian Ruder is a coauthor of the ULMFiT paper: https://arxiv.org/abs/1801.06146

romaniv · on July 9, 2018

>It is now possible to grab a pretrained model and start producing state-of-the-art NLP results in a wide range of tasks with relatively little effort.

Are there any applications/websites where this can be seen in action? It's increasingly hard to judge how good state-of-the-art really is from research papers.

canada_dry · on July 9, 2018

5 yrs ago (mostly for fun) I tried out the 'state-of-the-art' - at the time - NLTK sentiment analyzer to correlate stock market changes with a variety of news/info sources.

I put it on the shelf because the sentiment analysis just wasn't up to snuf (i.e. the bias differentiation was too weak).

Might be time to try again!

bravura · on July 9, 2018

Correct me if I'm wrong, but I don't think NLTK has ever had state of the art anything.

dang · on July 9, 2018

Ok, we've replaced "has arrived" with a sub-heading above.

acganesh · on July 9, 2018

Pretrained models have enabled so much in CV, excited to see similar shifts in the language world.

A great supplement is Sebastian’s NLP progress repo: https://github.com/sebastianruder/NLP-progress

neuromantik8086 · on July 9, 2018

Not to be too obtuse, but isn't WordNet (you know, the project that inspired the creation of ImageNet) "an ImageNet for language"? It seems kind of weird to bring up ImageNet within the context of NLP and not mention WordNet once.

sebastianruder · on July 9, 2018

WordNet (as you probably know) is a database that groups English words into a set of synonyms. If you consider WordNet as a clustering of high-level classes, then you could argue that ImageNet is the "WordNet for vision", meaning the clustering of object classes. The article uses a different meaning of ImageNet, namely ImageNet as pretraining task that can be used to learn representations that will likely be beneficial for many other tasks in the problem space. In this sense, you could use WordNet as an "ImageNet for language" e.g. by learning word representations based on the WordNet definitions. This is something people have done, but there are a lot more effective approaches. I hope this helped and was not too convoluted.

neuromantik8086 · on July 9, 2018

Does WordNet know that the word "ImageNet" refers to both a database and a pretraining task? :)

wodenokoto · on July 9, 2018

No, it does not know that, or anything else about "ImageNet"

http://wordnetweb.princeton.edu/perl/webwn?c=8&sub=Change&o2...

claytonjy · on July 9, 2018

I don't think WordNet has been much of a thing in NLP, especially nothing like ImageNet has been in CV. WordNet is only simple word-to-word relationships. "NLP" tends to denote more syntactical, phrase- or sentence-level text analysis; bag-of-word tools like WordNet or TF-IDF are not often considered "true" NLP, but might be called text mining instead.

wodenokoto · on July 9, 2018

The article seems to confuse the ImageNet database with the deep models trained on ImageNet, such as AlexNet and LeNet.

jph00 · on July 9, 2018

The phrase "Imagenet moment" is generally used to refer to the success of deep learning in the ILSVRC 2012 competition, which used the Imagenet dataset. This is the case in this article.

neuromantik8086 · on July 10, 2018

If a phrase is generally used by specialists it's not generally used.

andreyk · on July 9, 2018

TLDR the standard practice of using 'word vectors' (numeric vector representation of words) may soon be superceded by just using entire pretrained neural nets as is standard in CV, and we have both conceptual and empirical reasons to believe language modeling is how it'll happen.

Helped edit this piece, think it is spot on - exciting times for NLP.

JPKab · on July 9, 2018

Definitely excited by this, but wish the article was a bit more detailed.

jph00 · on July 9, 2018

The ULMFiT, ELMO, and OpenAI Transformer papers are all quite readable and linked from the article. Sebastian and I also wrote an introduction to ULMFiT here: http://nlp.fast.ai/classification/2018/05/15/introducting-ul...

JPKab · on July 9, 2018

Thanks!

YeGoblynQueenne · on July 9, 2018

>> In order to predict the most probable next word in a sentence, a model is required not only to be able to express syntax (the grammatical form of the predicted word must match its modifier or verb) but also model semantics. Even more, the most accurate models must incorporate what could be considered world knowledge or common sense.

So, the first sentence in this passage is a huge assumption. For a model to predict the next token (word or character) in a string, all it has to do is to be able to predict the next token in a string. In other words, it needs to model structure. Modelling semantics is not required.

Indeed, there exist a wide variety of models that can, indeed, predict the most likely next token in a string. The simplest of those are n-gram models, that can do this task reasonably well. Maybe what that first sentence above is trying to say is that to predict the next token with good accuracy, modelling of semantics is required, but that is still a great, big, huge leap of reasoning. Again- structure is probably sufficient. A very accurate model modelling structure, is still only modelling structure.

It's important to consider what we mean when we're talking about modelling language probaiblistically. When humans generate (or recognise) speech, we don't do that stochastically, by choosing the most likely utterance from a distribution. Instead, we -very deterministically- say what we want to say.

Unfortunately, it is impossible to observe "what we want to say" (i.e. our motivation for emitting an utterance). We are left with observing -and modelling- only what we actually say. The result is models that can capture the structure of utterances, but are completely incapable of generating new language that makes any sense - i.e. gibberish.

It is also worth considering how semantic modelling tasks are evaluated (e.g. machine translation). Basically, a source string is matched to an arbitrary target string meant to capture the source string's intended meaning. "Arbitrary" because there may be an infinite number of strings that carry the same meaning. So what, exactly, are we measuring when we evaluate a model's ability to map between to of those infinite strings chosen just because we like them best?

Language inference and comprehension benchmarks like the ones noted in the article are particularly egregious in this regard. They are basically classification tasks, where a mapping must be found between a passage and a multiple-choice spread of "correct" labels, meant to represent its meaning. It's very hard to see how a model that does well in this sort of task is "incorporating world knowledge" let alone "common sense"!

Maybe NLP will have its ImageNet moment- but that will only be in terms of benchmarks. Don't expect to see machines understanding language and holding reasonable conversations any time soon.

DoctorOetker · on July 11, 2018

I fully agree and while you probably word it much better than me, I made a somewhat similar argument at https://news.ycombinator.com/item?id=16961233 if you are interested...