Hacker News new | past | comments | ask | show | jobs | submit login

>> Machine Learning was a branch of Statistics

Thats incorrect.

""" Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives "computers the ability to learn without being explicitly programmed." """

-- Wikipedia




I don't personally know the state of the field in 1959 nor the accuracy of the wikipedia page, but the phrase 'Machine Learning' as it is currently being used, is closer to statistics than is is to learning.


Its even closer to systems, optimization etc. In fact I would argue that methods like backpropagation, dropout, batch-normalization, LSTM, RNNs make it much closer to Optimization rather than Statistics.


That's fair enough, although I'm still not really sure why statisticians have to think about data in such a different way to machine learning researchers. I feel that if a statistician didn't look at the bigger picture of what their data is actually about and whether there are existing techniques to tackle that problem, they'd make a pretty terrible statistician.


I dont know if the commentator's assertions are correct, but my anecdotal experience follows: A statistician asks me if adding a feature could make a model worse (R^2), and I am like of course! He gets testy, and snidely chimes back 'i don't know why you think that could happen'. And I think 'I don't know why you don't think you can over fit data...'

then it hit me, statisticians thinking revolves around running the model on the entire data set, where my thinking revolves around how it performs on test data


There is a good paper by Leo Breiman [1] which discusses the exact issue that you mentioned. E.g. Statisticians believe that there is a "True" model that generates the data, and the errors observed are merely noise. ML on the other hand does not assumes existence of a "True" model, the assumption is that only the Data is source of the truth and any model that predicts the data with lowest error is preferable subject to performance on test/cross-validation etc. This is a powerful approach and distinguishes ML as separate field from statistics.

My favorite example is predicting prices of real estate. A statistician will build a multi-level model that takes into account various effects (Zip code level, city level, school district level, year/month of acquisition etc.) and then build a regression model. The errors then would then be simply noise, an ML approach would be simply using weighted K-Nearest Neighbors, with geographic location as part of distance metric. Sure there are no effects to adjust but the K-NN regression model can account for difficult to capture quirks of geography by being able to represent local non-linear decision surface.

http://projecteuclid.org/euclid.ss/1009213726


There's a fairly simple difference that causes these holy wars. Machine learning likes to make distribution-free models. Statistics likes to study which distributions we can approximately fit to data, or falsify powerfully for data.


What do you mean by distribution-free? Because there's still all of the Bayesian forms of machine learning, and even withing neural networks a lot of work is based around distributions (see generative NNs like variational autoencoder).


Probabilistic machine learning is usually considered to be mostly statistics. The two classes aren't separated cleanly by a high-margin hyperplane, but we can clearly see the structure within the clustering ;-).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: