Hacker News new | past | comments | ask | show | jobs | submit login
You can probably use deep learning even if you don't have a lot of data (beamandrew.github.io)
417 points by deepnotderp on June 5, 2017 | hide | past | favorite | 78 comments



This post doesn't even mention the easiest way to use deep learning without a lot of data: download a pretrained model and fine-tune the last few layers on your small dataset. In many domains (like image classification, the task in this blog post) fine-tuning works extremely well, because the pretrained model has learned generic features in the early layers that are useful for many datasets, not just the one trained on.

Even the best skin cancer classifier [1] was pretrained on ImageNet.

[1]: http://www.nature.com/articles/nature21056


This is how the great fast.ai course begins - download VGG16, finetune the top layer with a single dense layer, get amazing results. The second or third class shows how to make the top layers a bit more complex to get even better accuracy.


Saving someone a Google.

http://course.fast.ai


Can't recommend the course highly enough!


I'm skimming through the content and it seems really great! I'm interested in the last lesson (7-Exotic CNN Arch), but I'm afraid of missing other cool stuff in past lessons.

What do you suggest for someone who has experience with Deep Learning?

EDIT: found this wiki with the course notes: http://wiki.fast.ai/index.php/Main_Page

One can use it as a guide to avoid missing anything.


I think one of the strengths of the course is that Jeremy shows parts of the process of working on a ML problem. If you have time, I recommend watching earlier lessons, even if you know the theoretical aspects of the content covered.


Same!! On lesson 2 and already feel like I know so much! The top down approach to learning is great! Will actually read the book "Making Learning Whole" that inspired them to follow this approach.


The 30 minute overview has me excited about going through it. I really like how they have structured it. Thanks to all for recommending it!


Totally agree.

Similarly for word embeddings like word2vec, GLoVE, fasttext etc in the case of NLP.

I think this is fundamental - if you teach a human how to recognize street signs, you don't need to show them millions of examples - just one or a few of each is enough because we build on reference experiences of past objects seen through life experience to encode the new images as memories.


Would you care to elaborate?

My main doubt comes the fact that language meaning may vary between different contexts, but I am no expert and am earnestly curious about using NLP and ML with not-that-big data.


Vector representation are useful for many natural language processing tasks. In word embedding like word2vec, GLoVE, fasttext for each word the algorithm learns an associated vector in dimension n (x1, x2, .., xn). A word maybe close to another in one dimension (or a subspace) but far away in another one. Moreover good representation allows meaningful vector space arithmetic: Queen - Women == King word representations are typically trained on very large unlabeled data, but once the algorithm learns the features you can use them for your small dataset. EDIT: Add more explanations.


This blog post[1] is a great example of positive unintended consequences of deep learning:

> We were very surprised that our model learned an interpretable feature, and that simply predicting the next character in Amazon reviews resulted in discovering the concept of sentiment.

[1]: https://blog.openai.com/unsupervised-sentiment-neuron/


It is mentioned at the end of the post:

> You don’t need Google-scale data to use deep learning. Using all of the above means that even your average person with only a 100-1000 samples can see some benefit from deep learning. With all of these techniques you can mitigate the variance issue, while still benefitting from the flexibility. You can even build on others work through things like transfer learning.


To be fair to the original article, his assertion was more along the lines of you can't train a deep net without lots of data. As the second article shows, that isn't true in the general case. However, it is certainly true for creating any of the interesting models you think of when you think of deep nets (i.e., Inception, word2vec, etc.). You just can't the richness of these without a lot of data to train them.


Deep learning can do pretty well when its not pre-trained as well.

I have this data set that is word counts for top 5k words, 5000 observations training, 5000 hold out. I consider this data pretty small.

SVM with rbf kernal can get around 87-88% accuracy, but a histogram kernal can get around 89.7% accuracy with a little feature engineering.

Tensorflow, after tuning some parameters, can also get around 89.7% accuracy as well.


Transfer learning is efficient (minimal training time) and useful for most classification tasks across various domains. Some of the first models I've used were built on Inception/ImageNet and I recall being thoroughly impressed by the performance.


This only works though if the pretraining and your training, both are with the data in the same domain. Even in that you will have issues if the data was from same domain but different in representation, e.g. 2D and 3D image datasets.


What would you consider the best resource for learning how to do this in Python? I have a smaller set of image data where I'd like to identify the components of (i.e. 'house', 'car', etc.).


Watch the very first 1 or 2 videos from http://fast.ai . They show how to do this in about seven lines of python and five minutes worth of model training.


Does transfer learning apply to seq2seq as well ?


You probably can... but is that really the issue?

I think the problem isnt that you cant solve problems with small amounts of data; its that you can't solve 'the problem' at a small scale and then just apply that solution at large scale... and that's not what people want or expect.

People expect that if you have an industrial welder than can assemble areoplanes (apparently), then you should easily be able to check it out by welding a few sheets of metal together, and if it welds well on a small scale, it should be representative of how well it welds entire vehicles.

...but thats not how DNN models work. Each solution is a specific selection of hyperparameters for the specific data and specific shape of that data. As we see here, specific even to the volume of data available.

It doesnt scale up and it doesn't scale down.

To solve a problem you just have to sort of.... just mess around with different solutions until you get a good one. ...and even then, you've got no really strong proof your solution is good; just that its better than the other solutions you've tried.

Thats the problem; its really hard to know when DNN are the wrong choice, vs. you're just 'doing it wrong'


Andrew Beam's post offered very persuasive evidence that Jeff Leek's intuition (deep learning yields poor performance with small sample sizes) is incorrect. The error bars and the consistent trend of higher accuracy with a properly implemented deep learning model, particularly with smaller sample sizes, is devastating to Leek's original post.

I think this is a fantastic example of the speed and self-correcting nature of science in the internet-age.

As an aside, @simplystats blocked me on Twitter, which I assume is in response to this tweet: https://twitter.com/ErickRScott/status/871586233599893505 and it seems that I'm likely not the only one blocked: https://twitter.com/jtleek/status/871693250947624961

What's most concerning about @simplystats blocking activity is the chilling effect it has on discourse between differing perspectives. I've tried to come up with a rationale for why highlighting the most recent evidence in reply to someone who sympathized with Leek's original post (btw, @thomasp85 liked the tweet) is grounds for blocking , but I can't come up with a reasonable idea.

Further aside, is irq11 Rafael Irizarry?

Update: after emailing the members of @simplystats they have removed the block on my account and offered a reasonable explanation. SimplyStats is a force for good in the world (https://simplystatistics.org/courses/) and I look forward to their future contributions.


It's an interesting conversation but really weakened by failing to take on the generalization problem head on. This is something I see in a lot of discussions about deep nets on smaller data sets, whether transfer or not. The answer "it's built in" is particularly unsatisfying.

The plots shown certainly should raise the spectre of overtraining - and rather than handwaving about techniques to avoid it, it would be great to see a detailed discussion of how you convince yourself (i.e. with additional data) that you are reasonably generalizable. Deep learning techniques are no panacea here.


Ppl keep saying a lot without even thinking is a relative term. For images a lot means enough to get to x percentage accuracy, for OCR for a single font, a lot means 26 letters + special chars and numbers. Stop saying a lot blindly like every one underatands.


...but why would you?

The fact that there are people "getting their jimmies up" on questions of training massively paramterized statistical models on tiny amounts of data should tell you exactly where we are on the deep-learning hype cycle. For a while there, SVMs were the thing, but now the True Faithful have moved on to neural networks.

The argument this writer is making is essentially: "yes, there are lots of free parameters to train, and that means that using it with small data is a bad idea in general, but neural networks have overfitting tools now and they're flexible so you should use them with small data anyway." This is literally the story told by the bulleted points.

Neural networks are a tool. Don't use the tool if it isn't appropriate to your work. Maybe you can find a way to hammer a nail with a blowtorch, but it's still a bad idea.


I think you're missing the point. The jimmies are getting rustled up because someone provides false information about the performance to make his own argument seem better. This is something anyone should be against.


but they didn't.

the writer makes an unconvincing claim that the original post was wrong. the data presented shows only that if you try really hard and get lucky enough, you can probably do as well as a simple regression in this case.

the author himself admits that deep learning is probably misapplied here, and that training with such small data is difficult, at best. which again brings us back to the important question (i.e. the point being made by the original post): why would you ever do this?


if you try really hard and get lucky enough, you can probably do as well as a simple regression in this case.

Maybe you aren't familiar with deep learning: but this isn't "trying really hard." This is doing basic stuff that anyone using deep learning probably knows.

And deep learning doesn't just "do as well" as the simpler model. It does meaningfully better at all sample sizes.


Define "meaningfully better." Perhaps you mean statistically significantly better? It may have better accuracy, but it has significantly less interpretability. What does it capture that regression couldn't capture? At least with regression you can interpret the relationship between all of the variables and their relative importance by looking at the coefficients of the regression. With deep learning, the best approaches for explanation are to train another model at the same time that you use for explanation. Additionally, it was proven that a perceptron can learn any function, so in some senses the "deep" part of deep learning is because people are being lazy because at least you could get a better interpretation of the perceptron. I don't mean to imply that there's not a place for deep learning, but I think this isn't a great refutation of the argument that fitting a deep model is somewhat inappropriate for a small dataset.


The model we are comparing against makes 10X as many errors.

I hadn't imagined someone would argue that's not a meaningful difference.

Though the difference is statistically significant too.


Not sure what kind of argument that is. If something overfits it will have less error, does that make it better? It may mean it would generalize a lot less when run on more data. Whether or not something is meaningful depends on what you take the meaning to be.


Not the OP, but it wanted to point out it has 10X less error on the holdout sample so it is not simply overfitting.


It doesn't matter that it's on the holdout, he's partitioning an already small dataset into 5 partitions and talking about the accuracy in using 80 points to predict 20 points. The whole argument is usually that in the law of large numbers you can now have a statistically significant difference in accuracy. When you're predicting 20 points each with 5 (potentially different) models you likely don't have enough to talk about statistical significance.


We tried to mirror the original analysis as closely as possible - we did 5-fold cross validation but used the standard MNIST test set for evaluation (about 2,000 validation samples for 0s and 1s). We split the test set into 2 pieces. The first half was used to assess convergence of the training procedure while the second half was used to measure out of sample predictive accuracy.

Predictive accuracy is measured on 1000 samples, not 20.


Honest question.. Who cares about interpretability if you're optimizing for predictive power?

Also, DL can be interpretable in different domains, much like any non-linear classifier (are you hating on random forests too for the same reason?) It just takes more work vs. looking at linear coefficients.


This is an area that fades in and out of focus with such venues as the Workshop on Human Interpretability in Machine Learning (WHI) [1]. It's becoming increasingly important when it comes to auditability and understanding of what is actually learned by algorithm. Avoiding classifiers from learning to discriminate based on age, race, etc [2] or in domains where it's important to know what the algorithm is doing such as medicine. Work in understanding DL is not really interpretable in any domain, typically they train another (simpler, less accurate) model and use that to explain what the model is doing or use perturbation analysis to try to tease out what it is learning. If all you care about is getting the right answer and not why you get that answer maybe it doesn't matter.

I wouldn't say I'm hating on DL nor that I hate on random forests, or ensembles, etc., but when you have very little data fitting an uninterpretable, high dimensional model might not be the right answer, in my opinion, see [3].

[1] https://arxiv.org/html/1607.02531v2 [2] https://arxiv.org/abs/1606.08813 [3] https://arxiv.org/abs/1601.04650


"it does meaningfully better at all sample sizes"

maybe you aren't familiar with reading graphs, but no, it really doesn't. one graph with mostly overlapping error bars does not inspire great confidence.

also, it isn't at all clear to me that the cross-validation method employed is sufficient to rule out overtraining. nor is it clear that the differences claimed are replicable or generally applicable enough to make a counter argument to the original post.


It might be that people want to learn to use deep learning, but they find the NIST AND other sets boring and want to learn on a problem they find interesting, but still produce something that seems to work.

Plus, learning to learn how to learn on less can only help the field of learning. That's the goal of one shot learning right?


Obviously, you do whatever you like when you're playing around, but when you find yourself in that situation (i.e. "I want to learn tool X, but all of my problems are inappropriate for X"), it's an indication that you're misusing/misunderstanding X.

Again, with the forced metaphors: if I buy a new welder, I'm probably suddenly very keen on welding things. That doesn't mean I'm learning how to weld. A big (the biggest?) part of learning a tool is learning when to use it.


In my very limited experience, i hear about cool technology X, and have crazy fantasies about how it actually works. Picking up X and applying it to a problem, no matter how poorly informed, always taught me more about X than anything else.

When you get your welder, you kinda have to weld a lot of things to see when it's effective and when it's not. Everybody gets a free pass with the first few months with a new toy. The only way to learn when it's appropriate is to screw up a few times.


Here is an approach that worked for us well in our startup, that may help other teams who read this thread. We learned it through making a few mistakes in our decision making process.

Whenever we have a new feature which cannot be implemented using existing frameworks, tools, in house technology or existing expertise, we inform the managers to add an extra 2 week to our schedule to evaluate as many options as we can. It is really hard to fit a lot of tools in that time so all the team picks up the work, even the ones that has no past experience or theoretical knowledge on the topic. It actually helps to have those ones in the research group. They are often the ones to be able to tell "since i have no idea in the expertise, i instead searched for this company who apparently ditched this tool because they suffered from this and that". Others who try to acquire the theory on the other hand is able to argue like "X seems to be better than Y". Once we have enough Xs, we already have use cases of X that is proven to be useless. In that attempt, mostly there remain only one X or even none. We either pick up that remaining X or an X that is less scary but a little more boring.

Boring is good because a team can argue on a boring thing more easily. Those arguments produce quality code that remains to be used more than a year. A year or two is enough to allocate more research time for the topic, which eventually helps to find or implement a tool-set that can live much longer.

Here are some examples we used more than a year and ditched or soon will ditch:

Stock Tesseract Server Side OCR -> Properly Trained Server Side Tesseract + Image preprocessing on the mobile device

Rethinkdb change feeds -> Postgres LISTEN

Bluebird.js -> ReactiveX

Ubuntu -> CentOS

Forever -> Docker

Edit: The reason why I am posting under your comment is in some cases, mistakes can hurt a company in a way that is unrecoverable. My advice my not apply to pet projects.


Yes, sure. Playing with something is a good way to learn that thing.

But if you find yourself saying "I really like using this backhoe, but I'm finding that most of my hole-digging problems are too small for a backhoe. Is there a blog post on using backhoes to make sandcastles?", you have perhaps wandered off the path to enlightenment.

I'm really enjoying these metaphors though.



Small question here, what are SVNs in this context?


support vector machines. a non-linear classifier that was the hot thing in the AI world before neural networks became practical.


They're great because they immediately go to the best solution. One epoch is all you need, no matter the size of the SVM.

They suck because they're limited to one layer.

But there's a good case to be made that machine learning introductions should always be done like this : linear classifier (when that works) -> SVM (when that works) -> NN -> Deep learning.


I contend that all real work should be done this way. And don't forget things like bayes' nets and random forests, which get no love.

in other words, KISS.


>___< I only know Bayes Nets and Random Forest.

I love tree base algorithm they are so good in many context compare to neural network. You try doing that in the medical field where there are very little data since it's so costly to do r&d on human.

Also with Bayesian Network you can at least know how to explain things. Neural Network is a magic black box.

---

edit: also regression but meh I think if you know random forest then you should know regression. Other wise you don't really know random forest.


> I love tree base algorithm they are so good in many context

random forest is delightful in that the algorithm has very few parameters, the default values of parameters are generally okay, and it generally tends to do something reasonable when you throw it at "real world data" with missing data / categorical variables / useless noise features in the input / etc.

Provided a single decision tree does not overfit to then a random forest wont overfit either.


You missed one of the most important advantages of an ensemble tree model: since each tree is grown independently of another , you can have full parallelization during training.


To go even simpler, KNN also works really well if you can get a good weighting for your inputs and measurement noise is low. And KNN works with online/soft-RT datasets as well, where constant learning is required.


I have done a lot of work with classification algorithms, and KNN doesn't get nearly the love it deserves. It rarely turns in exceptional performance, but when used with mahalinobis distance it is extremely robust.


> But there's a good case to be made that machine learning introductions should always be done like this : linear classifier (when that works) -> SVM (when that works) -> NN -> Deep learning.

Do you know of an ML introduction course/book/site that follow this order?


You can get Linear -> SVM -> Neural Networks on Andrew Ng's Machine Learning course at Coursera CS229A. You could then go advanced with CS229 from Stanford "Academic Earth".


I feel like people are cutting out gradient boosting. For me, GBM usually perform better and have faster training time than SVM


s/SVN/SVM


oops, typo.


Of course you can use it but does it perform better than "shallow" methods such as Gaussian processes, SVMs, and multivariate linear regression ? either through theoretical or empirical evidence ?


The original post used a linear regression (and apparently misimplemented an intended logistic regression); this post sees better results at all sample sizes with a proper deep learning approach.


Maybe what you should do is deep learn some data and then do some deep learning with your deep learned [deep] data.

Deep.


aka Wisdom of Crowds


Also, not to mention transfer learning is a big one.

Case in point, the silicon valley "not hotdog" classifier which they stopped at hotdog or not due to lack of training data when in reality they could've just used a pre trained net on imagenet. Lol, I was literally cringing through that episode so hard xD


The developer behind the real-world app mentioned why they did not use pretraining, and why you can't always use pretraining: https://news.ycombinator.com/item?id=14347513

> We ended up with a custom architecture trained from scratch due to runtime constraints more so than accuracy reasons (the inference runs on phones, so we have to be efficient with CPU + memory), but that model also ended up being the most accurate model we could build in the time we had. (With more time/resources I have no doubt I could have achieved better accuracy with a heavier model!)


You could still have pre trained the custom architecture.


This debate is meaningless for several reasons:

1. The original argument is a strawman. What do they mean by "data"? Is it survey results, microarrays, "Natural" images, "Natural" language text or readings from an audio sensor? No ML researcher would argue that applying complex models such as CNNs is useful for say survey data. But if the data is domain specific, such as Natural Language text, images taken in particular context, etc. using a model and parameters that exhibit good performance is a good starting point.

2. Unlike how statisticians view data (as say a matrix of measurements or "Data Frame"), machine learning researchers view data at a higher level of representation. E.g. An image is not merely a matrix but rather an object that can be augmented by horizontally flipping, changing contrast etc. In case of text you can render characters using different fonts, colors etc.

3. Finally the example used in the initial blog post, of predicting 1 vs 0 from images is itself incorrect. Sure a statistician would "train" a linear model to predict 1 vs 0, however I as an ML researcher would NOT train any model at all and would just use [1] which has state of the art performance in character recognition in widely varying conditions. When you have only 80 images, why risk assuming that they are sampled in an IID manner from population, instead why not simply use a model thats trained on far larger population.

Now the final argument might look suspicious but its crucial in understanding the difference between AI/ML/CV vs Statistics. In AI/ML/CV the assumption is that there are higher level problems (Character recognition, Object recognition, Scene understanding, Audio recognition) which when solved enable us to apply them in wide variety of situations where they appear. Thus when you encounter a problem like digit recognition the answer an ML researcher would give is to use a state of the art model.

[1] https://github.com/bgshih/crnn


Are you using some accepted definition of Statistics that I'm unaware of? Because I always thought Machine Learning was a branch of Statistics.


Machine learning is absolutely a branch of statistics.


If you were to simply look at the overwhelming evidence (rather than resorting to use of vague words "absolutely" etc.) e.g. department affiliations of ML researchers (presenting at ICML/NIPS) and practitioners (at Google/FB/MSFT/etc) you would find that ML is firmly a branch of CS.


Why are you trying to deduce the nature of ML by looking at org charts? You do realize ML predates all the titles you listed above? Org charts are nuanced things often politically driven. You are looking at the wrong things of you want to truly learn instead of appeasing your personal biases.

Computer science largely deals with algorithmic space and time complexity. Look up the Wikipedia page on logistic regression and tell me if you see computer science there.

It is true that software giants are using and making huge breakthroughs in ML, specifically deep learning. That's largely because these companies invest heavily in bringing in the top ML researchers to their teams. Do you know the cost of deep mind? Individual engineers on the team have had multimillion dollar golden handcuff contracts from what I heard.

Yes, software engineers can utilize the developed theory of ML and apply it to applications. For example, look at the new Google image search. You have a lot of interesting filter options now that require machine learning. Google speech recognition for instance has gotten so much better because Google moved off GMM's to dnn for the decoding.

So just because Google and FB have interests in ML, doesn't mean that ML is a CS discipline. That's such faulty logic.


>> Machine Learning was a branch of Statistics

Thats incorrect.

""" Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives "computers the ability to learn without being explicitly programmed." """

-- Wikipedia


I don't personally know the state of the field in 1959 nor the accuracy of the wikipedia page, but the phrase 'Machine Learning' as it is currently being used, is closer to statistics than is is to learning.


Its even closer to systems, optimization etc. In fact I would argue that methods like backpropagation, dropout, batch-normalization, LSTM, RNNs make it much closer to Optimization rather than Statistics.


That's fair enough, although I'm still not really sure why statisticians have to think about data in such a different way to machine learning researchers. I feel that if a statistician didn't look at the bigger picture of what their data is actually about and whether there are existing techniques to tackle that problem, they'd make a pretty terrible statistician.


I dont know if the commentator's assertions are correct, but my anecdotal experience follows: A statistician asks me if adding a feature could make a model worse (R^2), and I am like of course! He gets testy, and snidely chimes back 'i don't know why you think that could happen'. And I think 'I don't know why you don't think you can over fit data...'

then it hit me, statisticians thinking revolves around running the model on the entire data set, where my thinking revolves around how it performs on test data


There is a good paper by Leo Breiman [1] which discusses the exact issue that you mentioned. E.g. Statisticians believe that there is a "True" model that generates the data, and the errors observed are merely noise. ML on the other hand does not assumes existence of a "True" model, the assumption is that only the Data is source of the truth and any model that predicts the data with lowest error is preferable subject to performance on test/cross-validation etc. This is a powerful approach and distinguishes ML as separate field from statistics.

My favorite example is predicting prices of real estate. A statistician will build a multi-level model that takes into account various effects (Zip code level, city level, school district level, year/month of acquisition etc.) and then build a regression model. The errors then would then be simply noise, an ML approach would be simply using weighted K-Nearest Neighbors, with geographic location as part of distance metric. Sure there are no effects to adjust but the K-NN regression model can account for difficult to capture quirks of geography by being able to represent local non-linear decision surface.

http://projecteuclid.org/euclid.ss/1009213726


There's a fairly simple difference that causes these holy wars. Machine learning likes to make distribution-free models. Statistics likes to study which distributions we can approximately fit to data, or falsify powerfully for data.


What do you mean by distribution-free? Because there's still all of the Bayesian forms of machine learning, and even withing neural networks a lot of work is based around distributions (see generative NNs like variational autoencoder).


Probabilistic machine learning is usually considered to be mostly statistics. The two classes aren't separated cleanly by a high-margin hyperplane, but we can clearly see the structure within the clustering ;-).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: