I think you're missing the point. The jimmies are getting rustled up because someone provides false information about the performance to make his own argument seem better. This is something anyone should be against.
the writer makes an unconvincing claim that the original post was wrong. the data presented shows only that if you try really hard and get lucky enough, you can probably do as well as a simple regression in this case.
the author himself admits that deep learning is probably misapplied here, and that training with such small data is difficult, at best. which again brings us back to the important question (i.e. the point being made by the original post): why would you ever do this?
if you try really hard and get lucky enough, you can probably do as well as a simple regression in this case.
Maybe you aren't familiar with deep learning: but this isn't "trying really hard." This is doing basic stuff that anyone using deep learning probably knows.
And deep learning doesn't just "do as well" as the simpler model. It does meaningfully better at all sample sizes.
Define "meaningfully better." Perhaps you mean statistically significantly better? It may have better accuracy, but it has significantly less interpretability. What does it capture that regression couldn't capture? At least with regression you can interpret the relationship between all of the variables and their relative importance by looking at the coefficients of the regression. With deep learning, the best approaches for explanation are to train another model at the same time that you use for explanation. Additionally, it was proven that a perceptron can learn any function, so in some senses the "deep" part of deep learning is because people are being lazy because at least you could get a better interpretation of the perceptron. I don't mean to imply that there's not a place for deep learning, but I think this isn't a great refutation of the argument that fitting a deep model is somewhat inappropriate for a small dataset.
Not sure what kind of argument that is. If something overfits it will have less error, does that make it better? It may mean it would generalize a lot less when run on more data. Whether or not something is meaningful depends on what you take the meaning to be.
It doesn't matter that it's on the holdout, he's partitioning an already small dataset into 5 partitions and talking about the accuracy in using 80 points to predict 20 points. The whole argument is usually that in the law of large numbers you can now have a statistically significant difference in accuracy. When you're predicting 20 points each with 5 (potentially different) models you likely don't have enough to talk about statistical significance.
We tried to mirror the original analysis as closely as possible - we did 5-fold cross validation but used the standard MNIST test set for evaluation (about 2,000 validation samples for 0s and 1s). We split the test set into 2 pieces. The first half was used to assess convergence of the training procedure while the second half was used to measure out of sample predictive accuracy.
Predictive accuracy is measured on 1000 samples, not 20.
Honest question.. Who cares about interpretability if you're optimizing for predictive power?
Also, DL can be interpretable in different domains, much like any non-linear classifier (are you hating on random forests too for the same reason?) It just takes more work vs. looking at linear coefficients.
This is an area that fades in and out of focus with such venues as the Workshop on Human Interpretability in Machine Learning (WHI) [1]. It's becoming increasingly important when it comes to auditability and understanding of what is actually learned by algorithm. Avoiding classifiers from learning to discriminate based on age, race, etc [2] or in domains where it's important to know what the algorithm is doing such as medicine. Work in understanding DL is not really interpretable in any domain, typically they train another (simpler, less accurate) model and use that to explain what the model is doing or use perturbation analysis to try to tease out what it is learning. If all you care about is getting the right answer and not why you get that answer maybe it doesn't matter.
I wouldn't say I'm hating on DL nor that I hate on random forests, or ensembles, etc., but when you have very little data fitting an uninterpretable, high dimensional model might not be the right answer, in my opinion, see [3].
maybe you aren't familiar with reading graphs, but no, it really doesn't. one graph with mostly overlapping error bars does not inspire great confidence.
also, it isn't at all clear to me that the cross-validation method employed is sufficient to rule out overtraining. nor is it clear that the differences claimed are replicable or generally applicable enough to make a counter argument to the original
post.