Interesting post, but there are several errors, or at least suggestions that don't make sense to me:
> Doing so will result in very good evaluation scores and make the user happy but instead he/she will be building a useless model with very high overfitting.
Testing on your training data doesn't in-and-of-itself lead to overfitting, but it will hide overfitting if it does exist (and is a terrible practice for that reason).
> At this stage only models you should go for should be ensemble tree based models.
Not sure why this should be the case. Many ensemble models are very memory hungry & slow, Random Forests being a good example. They are flexible and have minimal assumptions, sure, but that doesn't mean you shouldn't try any other modeling technique, especially if you have domain knowledge.
> We cannot apply linear models to the above features since they are not normalized.
Not true at all. You won't be able to meaningfully compare coefficient magnitudes, but you can certainly apply linear models.
> These normalization methods work only on dense features and don’t give very good results if applied on sparse features.
Since normalization is done feature-by-feature, there is no reason why this should be true.
> generally PCA is used decompose the data
If your latent features add linearly, ok, but do they? Is it meaningful to have negative values for your features? If not, consider using something like sparse non-negative matrix factorization.
Lastly, using this approach means you have a HUGE model & parameter space to search through. Because of this you will need a ton of data to get meaninigful results.
He seems to be treating machine learning methods as a bag of tricks. That's ok so far as it goes, but in my experience it's much more valuable to try and understand your data, and the process that generates it, and then build a model that tries to reflect that data generation process.
> Testing on your training data doesn't in-and-of-itself lead to overfitting, but it will hide overfitting if it does exist (and is a terrible practice for that reason).
I read that as pointing out the danger of using training data for hyper-parameter search validation. In practice if you're aggressively tuning your model and feature choices using your training data for validation you're almost certain to have overfitting. I think it actually makes sense to have 3 data sets if you can afford it.
> Not sure why this should be the case. Many ensemble models are very memory hungry & slow, Random Forests being a good example. They are flexible and have minimal assumptions, sure, but that doesn't mean you shouldn't try any other modeling technique, especially if you have domain knowledge.
I read this as "Try random forests first. If you get good enough results, you're done." He mentions 9 models later and says "In my opinion, and strictly my opinion, the above models will out-perform any others and we don’t need to evaluate any other models."
> Not true at all. You won't be able to meaningfully compare coefficient magnitudes, but you can certainly apply linear models.
The documentation for StandardScaler says "For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected."
> Since normalization is done feature-by-feature, there is no reason why this should be true.
StandardScaler re-centers around the mean. That can turn a SparseMatrix into a NotSoSparseMatrix. In a framework that supports only in memory matrices that's probably not what you want, even if it might work. He mentions how to deal with it.
Can't comment on PCA.
Overall, this seems like practical advice from someone with a lot of experience on Kaggle. Kaggle isn't real life so I'm sure there are better things to do for real models. It's still probably reasonable to take his advice in context instead of assuming it's wrong.
A lot of the criticisms here are about inefficiencies in the approach. This is true, but the approach here does generalise well (provided you have sufficient data).
It's worth noting that Abhishek is working on the "automatic data science" problem[1] and is viewing thing through that lens.
AutoML is an interesting area - there have been workshops at ICML, and good progress is being made in the area.
This talk is very interesting: they do automatic model selection, pipeline stacking (pick the right sequence of pre-processing stages to place in the pipeline) and hyperparameter optimization. They use meta-learning to improve search speed.
You can overfit the data if you're tuning the hyperparameters outside the CV loop. Ideally, you'd want to have a two-layer CV loop, whether it's k-fold CV, LOOCV, bootstrapping, or whatever. Occasionally this might be computationally infeasible, but it's the most reliable way to do it.
"..., but in my experience it's much more valuable to try and understand your data, and the process that generates it, and then build a model that tries to reflect that data generation process."
There are tasks where I feel happy to accept a black box model; image recognition, speech transcription. In these tasks understanding the model is really important for further research, but it seems to me to be ok to say "look it just works".
When it comes to data mining or task like predicting customer responses to offers I believe that arbitrary models that have high predictive power are dangerous. This is because you can't know when changes in your domain (like a new competitor or a shift in demographics) are liable to derail it, and experience shows that this can lead to really bad decisions from something that people have come to trust.
Thank you bringing all these up- whenever something is said to be a one size fits all solution in machine learning, for anything as a matter of fact, I would tend to be wary.
This is a great post as far as it goes. While Abhishek mentions Keras for neural networks (and Keras is a great Python library), he doesn't really go into the cases where deep neural networks are more useful than other algorithms, and how that changes a data scientist's workflow.
DNNs are really well suited to unstructured data, which isn't the kind he highlights. One reason for that is because they automatically extract features from data using optimization algorithms like stochastic gradient descent with backpropagation. What that means is they bypass the arduous process of feature engineering. They help you get around that chokepoint, so that you can deal with unstructured blobs of pixels or blobs of text.
Because unstructured data is most of the data in the world, and because DNNs excel at modeling it, they have proven to be some of the most useful and accurate algorithms we have for many problems.
Here's an overview of DNNs that goes into a bit more depth:
I think the post represents the typical data scientist's approach to a machine learning problem pretty well. Deep learning is exciting but, for practical problems, most people aren't using it yet. XGBoost is still dominant in Kaggle competitions. Lots of companies with lots of data still use simple single machine models.
So, I think it's a little misleading to say that deep learning eliminates the need for feature engineering. If we're going to provide business value to a company, we're often dealing with somewhat structured data and not "unstructured blobs of pixels or blobs of text". There are several simple linear models with highly engineered features in production where I work that resist attempts to replace them with more clever models and less feature engineering.
Deep learning is awesome and there may be a time when it solves every problem. But let's not oversell it. For now, if someone wants to do machine learning and get paid for it, they're more likely to see their colleagues using the techniques in this article than training deep multi-layer neural nets.
Deep Learning is heavily used in Kaggle competitions when there is image data.
Even in non-image data competitions, a deep neural network will often be one of the models chosen to ensemble. They generally perform slightly worse than XGBoost models, but have the advantage that they often aren't closey correlated, which helps fighting overfitting when tuning ensembling hyperparamters.
For image competitions you are right. Neural networks are often in winning teams ensembles, but they require a lot more work than something like xgboost (gradient-boosted decision trees). For a dataset that isn't image processing or NLP, xgboost is in general much more widely used than neural nets. Neural nets suffer from the amount of computing resources and knowledge needed to apply them, though given infinite knowledge and computing power they are probably on par with or better than xgboost. And if you need to analyze an image they are great.
I think the whole "you don't need to engineer features" thing can be a bit oversold, in that you don't need to build explicit feature extractors, but you do still need to design your architecture to fit your problem domain. It's more general than traditional feature extraction, but if you keep working on a single problem for long enough you may find that while deep learning can be a great starting point, you eventually want to add more hand-coded features if you really have some insight that the neural net isn't fully capturing.
I'm personally quite interested to see if we find meaningful ways of constructing differentiable feature extractors that we can train conjointly with a neural net.
Approaching (Almost) Any Machine Learning Classification Problem.
If you are doing sequence labeling, learning something about the data, tackling partially unlabeled data or time-varying data, you generally have to take a different approach.
It's a very insightful article about nitty-gritty details of working on ML problems. However as an outsider, I can't decide if some of very specific statements without any reasoning (e.g. good range of value for parameters) are highly valuable wisdom coming from years of experience, or merely overfitted patterns that he's adopted in his own realm.
I would say that table is really quite valuable. Kaggle problems come from all types of companies, so it doesn't make sense to say that it is "overfitted patterns that he's adopted in his own realm". With that said, validation on your own dataset will trump general knowledge, so you shouldn't view these parameters as hard and fast rules. But the parameters in that table will provide a useful starting point, and if you stray too far from them that is a warning sign that you might be overfitting.
Neither. The values he gives are reasonable (not bad choices) but also quite broad (covering several orders of magnitude usually - not valuable per-se).
Of course, things become much clearer when you actually know what the hyperparameters represent. For example, knowing the random forest algorithm, you'd know that having a higher number of estimators is always better (with the downside being diminishing returns and slower performance / more memory usage). Ironically, the blog post here gets this point wrong.
Throwing out a couple of stupid questions since we're talking about rules-of-thumb..
1) How much does feeding irrelevant features have on a model? For example if you added several columns of normally distributed random numbers?
2) How much impact does having several highly correlelated feature have on a model?
3) If you had limited time and budgets would it be better spent on cleaning data (removing bad labels, noise in data); feature engineering (relevant features) or feature selection (removing irrelevant or redundant feature)?
1) It depends heavily on the model. Something like xgboost (gradient boosted decision tree) will handle irrelevant features fairly well, while other models (like linear models, especially without lasso regularization) will have much more trouble. In virtually all cases adding noise will decrease model performance.
2) Same as 1), depends on the model. With good hyper-parameters xgboost can handle correlated features well, while other models may struggle.
3) With a good model (again like xgboost), feature engineering is usually the best use of your time. Removing "bad" labels and "noise" in the data is especially dangerous, as if you are not extremely careful you can make your model worse. If you can identify why the label is "bad" then you can remove or correct it, but you need a reason why you wouldn't have these bad labels on your test dataset. Removing outliers can help your model, but it is risky. In contrast smart feature engineering is low risk and can provide large gains if you see a pattern the model could not see. Feature selection can be important as well, and is generally pretty quick assuming you have good hardware, so you might as well do it, especially if you have some knowledge about which features you expect to be not that useful.
Awesome post and site, my only complaint about this blog is that you'd think the top left would be a link and 'No Free Hunch' would not be one, but in fact it's the opposite..
"Awesome post." Great! Show your support with an upvote. Bam.
"Too hungover to read all that today though. Definitely Saturday morning over a cup of coffee - assuming I'm not in the same shape then." This contributes in no way, shape, or form to promoting insightful and educational discussion.
There's definitely a school of thought on HN which would say so. Upvotes are a fine proxy for appreciation. Other online communities have taken steps to remove verbose "+1" mechanisms too - for example, GitHub, as per demand, added the ability to respond "+1" to comments without having to write an entirely new comment. That indicates that consumers of GitHub (and most HN users are probably in that demographic) value brevity.
Personally, I suspect a post saying only "Awesome post" would have garnered 0 or only a couple downvotes. However, the hangover comment was - in my opinion - a bit self-indulgent, irrelevant, and not in any of the broad categories of things people come on HN to find.
Just try not to humble brag or add 0 or negative valued commentary. No one gives a rats ass about your hangover, when you'll read the article, we care about comments about the article when read now..
Great post, but how am I supposed to find it credible when he uses quotation marks for emphasis?
> The splitting of data into training and validation sets “must” be done according to labels. In case of any kind of classification problem, use stratified splitting. In python, you can do this using scikit-learn very easily.
> Doing so will result in very good evaluation scores and make the user happy but instead he/she will be building a useless model with very high overfitting.
Testing on your training data doesn't in-and-of-itself lead to overfitting, but it will hide overfitting if it does exist (and is a terrible practice for that reason).
> At this stage only models you should go for should be ensemble tree based models.
Not sure why this should be the case. Many ensemble models are very memory hungry & slow, Random Forests being a good example. They are flexible and have minimal assumptions, sure, but that doesn't mean you shouldn't try any other modeling technique, especially if you have domain knowledge.
> We cannot apply linear models to the above features since they are not normalized.
Not true at all. You won't be able to meaningfully compare coefficient magnitudes, but you can certainly apply linear models.
> These normalization methods work only on dense features and don’t give very good results if applied on sparse features.
Since normalization is done feature-by-feature, there is no reason why this should be true.
> generally PCA is used decompose the data
If your latent features add linearly, ok, but do they? Is it meaningful to have negative values for your features? If not, consider using something like sparse non-negative matrix factorization.
Lastly, using this approach means you have a HUGE model & parameter space to search through. Because of this you will need a ton of data to get meaninigful results.
He seems to be treating machine learning methods as a bag of tricks. That's ok so far as it goes, but in my experience it's much more valuable to try and understand your data, and the process that generates it, and then build a model that tries to reflect that data generation process.