Great interview. In my experience it's amazing just how many people are talking about "Big Data" and just how exactly none of those are the ones with the necessary PhDs in statistics and algorithms to get anything of any value done.
In my experience there are very few domains within Machine Learning where you don't need to be an expert in the field to yield useful conclusions out of the data.
Even if you have a high-level conceptual understanding of the statistical methods, tuning the parameters to yield something relevant, or much more so, adapting existing algorithms to meet your needs requires some pretty serious dedication to the field.
As someone who has tried out various MOOCs and entry level resources on machine learning, this is the same conclusion I came to. Beyond any sort of trivial example, I found I lacked the mathematical and statistical knowledge to not only interpret the results in a relatively unbiased and error-free way, but to know "what to do next."
What scares me is that MOOCs are really pushing the data scientist field -- see Udacity and Coursera -- but giving people only enough knowledge to be dangerous. It's tough because data science is a fascinating field and many people who have the interest and aptitude don't have the means or life situation to go to grad school for it. These MOOCs are trying to appeal to such people, but they're not nearly rigorous enough.
Beyond any sort of trivial example, I found I lacked the mathematical and statistical knowledge to not only interpret the results in a relatively unbiased and error-free way, but to know "what to do next."
The popular MOOCs don't take you far enough to start doing serious machine learning, but you don't need a PhD to be ready to solve those problems.
It takes work. Lots of work. Re-learn linear algebra until you know why "eigenvectors" are so important. Know what the most important matrix factorizations (LU, QR, SVD, Eigen, Cholesky) do. Read the papers until the math becomes "no big deal". Pick up a probability textbook and read the whole thing; also, get a working knowledge of real analysis. It won't happen quickly.
The PhD is some classes, plus 3-7 years of focused work. Some of that's compressible and unnecessary to becoming a data scientist. Some of it isn't. The Coursera courses are great for getting you started; they're entry-level college courses, and if you read the papers and the seminal textbooks (e.g. Elements of Statistical Learning by Hastie et al) you can get into the intermediate territory in a couple years or so. It's not easy, but it can definitely be done. Getting to the expert level, I think, just requires real-world experience on real-world problems... but, one hopes, you can start attacking such problems once you're at the intermediate level.
What you are describing as a background is all part of a "normal" math/CS undergrad education (at least in Germany where I studied).
From a level of the mathematical difficulty, Elements etc. (but also current-level research papers) are all readable by anyone with a solid understanding of undergraduate mathematics (which is essentially a decent Linear Algebra course, multivariate analysis, a probability course, and a numerical computing course).
I think the reason why employers look for candidates with a PhD is that too many people "scrape by" when getting their CS degree -- e.g. they somehow fulfilled the required coursework, and somehow got their degree. The PhD requirement is essentially a bureaucratic substitute for answering the question "has this person understood math in sufficient depth to be able to do independent work with it".
Thanks muraiki and michaelochurch. This is a similar frustration I faced. The moocs often seem to teach you just enough that is similar to formula substitution, everything starts to crumble when you depart towards data that's significantly different. So I have begun from the bottom starting with MIT's Linear Algebra and Harvard's Statistics 110. Your comments have validated my journey though this is going to be a long one.
Thank you for your advice. In my case it's not "re-learn linear algebra" but "learn linear algebra... after first learning calculus and how to understand/write a proof." :) At 32 I'm not certain if this is a worthwhile way for me to go...
That being said, I haven't given up completely. I'm starting to read "The Haskell Road to Logic, Maths, and Programming" in the hopes of finally being able to grok proofs. At the very least, I feel that learning more math can only help me as a developer.
> ... how exactly none of those are the ones with the necessary PhDs in statistics and algorithms to get anything of any value done.
I see it almost the other way around: Companies strictly demand PhD's for Big Data jobs and can't find this unicorn. Yet we live in a time where we don't need a PhD program to receive education from the likes of Ng, LeCun and Langford. We live in a time where curiosity and dedication can net you valuable results. Where CUDA-hackers can beat university teams. The entire field of big data visualization requires innate aptitude and creativity, not so much an expensive PhD program. I suspect Paul Graham, when solving his spam problem with ML, benefited more from his philosophy education than his computer science education.
Of course, having a PhD. still shows dedication and talent. But it is no guarantee for practical ML skills, it can even hamper research and results, when too much power is given to theory and reputation is at stake.
In my experience Machine Learning was locked up in academics, and even in academics it was subdivided. The idea that "you need to be an ML expert, before you can run an algo" is detrimental to the field, not helping so much in adopting a wider industry use of ML. Those ML experts set the academic benchmarks that amateurs were able to beat by trying out Random Forests and Gradient Boosting.
I predict that ML will become part of the IT-stack, as much as databases have. Nowadays, you do not need to be a certified DBA to set up a database. It is helpful and in some cases heavily advisable, but databases now see a much wider adoption by laypeople. This is starting to happen in ML. I think more hobbyists are right now toying with convolutional neural networks, than there are serious researchers in this area. These hobbyists can surely find and contribute valuable practical insights.
Tuning parameters is basically a gridsearch. You can bruteforce this. In goes some ranges of parameters, out come the best params found. Fairly easy to explain to a programmer.
Adapting existing algorithms is ML researcher territory. That is a few miles above the business people extracting valuable/actionable insight from (big or small or tedious) data. Also there is a wide range of big data engineers making it physically possible to have the "necessary" PhD's extract value from Big Data.
While there's some truth in what you're saying, you sort of demonstrate a very common pitfall:
> Tuning parameters is basically a gridsearch. You can bruteforce this. In goes some ranges of parameters, out come the best params found.
This sounds so simple. However, if you just do a bruteforce grid search and call it a day, you're most likely going to overfit your model to the data. This is what I've seen happen when amateurs (for lack of a better word) build ML systems:
(1) You'll get tremendously good accuracies on your training dataset with grid search
(2) Business decisions will be made based on the high accuracy numbers you're seeing (90%? wow! we've got a helluva product here!)
(3) The model will be deployed to production.
(4) Accuracies will be much lower, perhaps 5-10% lower if you're lucky, perhaps a lot more.
(5) Scramble to explain low accuracies, various heuristics put in place, ad-hoc data transforms, retrain models on new data -- all essentially groping in the dark, because now there's a fire and you can't afford the time to learn about model regularization and cross-validation techniques.
And eventually you'll have a patchwork of spaghetti that is perhaps ML, perhaps just heuristics mashed together. So while there's value in being practical, when ML becomes a commodity enough to be in an IT stack, it is likely no longer considered ML.
I agree, machine learning currently requires huge globs of knowledge to wield. As such it is a tool for an intelligent entity to use, it is not intelligent in it's own right. I disagree with Michael about the correct way forward for machine learning, the problem is not one of piecemeal engineering. If you are spending your time developing algorithms to solve a particular class of problem, you are wasting your time(in terms of pursuit of GAI). The overarching problem needs to tackled, what is the correct framework within which to think about intelligent systems? This is, as mentioned in the article a question of insight - we need the metaphorical apple to fall on some bright sparks head. But if all the bright sparks are fully engaged in chasing the short term problem it might take a long time.
What makes you think human brain isn't just an ensemble of hundreds of different specialized algorithms?
Trying to emulate biological brains might not be the way forward. People tried to fly by constructing bird-like feathers and wings - it obviously didn't work. We had to understand the underlying principles governing flight. The same applies to creating neural networks.
There's some underlying principle the brain uses. It doesn't mean we have to crack brain structure to achieve strong intelligence.
We should look for inspiration in biological systems but we should not try to copy them.
In my experience it's amazing just how many people are talking about "Big Data" and just how exactly none of those are the ones with the necessary PhDs in statistics and algorithms to get anything of any value done.
The fact that you think it takes a PhD to be a decent data scientist indicates that you're out of your depth on this one.
You don't need a PhD to get useful work done in these fields. You need to work hard and tackle difficult math. It takes years, but it can be done if you have the talent and drive. A prestigious (top-10) PhD certainly makes your life easier in getting the top jobs, but it doesn't really make you more (or less) able to fulfill them. I don't have a PhD and can do what 95+ percent of "PhD Data Scientists" do for work.
PhD is (a) focused work on a specific, usually narrow problem, and (b) years of self-study that required to know enough to attack said problem. For real-world data science, (a) only matters in the ~1% chance of overlap between your dissertation and the needs of your employer, and (b) doesn't require five years in an academic institution (although it probably does require about that much time, if you study on your own, since you're likely to be doing much of the work on your own time).
The PhD is a valuable experience and I don't mean to denigrate it. I often wish I had gotten one, in my 20s, instead of becoming a world-class expert on software office politics and "only" an intermediate-plus Haskell/Clojure/machine-learning guy. The PhD is a great experience for many people, but I don't think it belongs on a pedestal.
I'm not talking strictly about a PhD, but about PhD level work, some of which can't really be done on your own unless you're a supremely talented individual.
You can look at some of the modern ML algorithms and see what I mean; many people that I know have worked with Latent Dirichlet Allocation, but they have no idea how the model works, and there's no way they could extend to work online or under certain performance or storage constraints without havin spent months and years working on that problem.
That's not a realistic expectation for anyone in the field. Yes, the algorithms you find in Weka and other ML toolkits are useful, but the actual "Big Data" problems have their own performance and algorithmic constraints that are far, far beyond dedicated self-learners.
I was grateful and surprised to see the article start off immediately with a meta-remark on the collusion between pop science media and academics. It recalled one my frustrations during grad school in the late 2000s: student researchers striving for recognition and journalists sexing up our stories that misinformed the public.
This feedback loop explains a great chunk of why we on HN spend so much time knit-picking through stories on e.g. Wired. What we read is not so much "reporting," but designs-by-committee of researchers doing things they think the public wants/needs and reporters bending stories toward what they think the public wants and needs.
All news is like that. When they cover stories we actually know something about we see that its all misinformed BS, but then for some strange reason, on other issues, we're perfectly happy to have every thing voxplained to us. (or the NY times is gospel if that floats your boat) Michael Chrichton:
“Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them.
In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know.”
> When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.
It's actually worse than that. What I see is that when companies have the ability to store and "analyze" large amounts of data, their appetite for data tends to increase. So they seek to take in as much data as they can find. More often than not, the quality of the data is mixed at best. Frequently, it's horrible, and because the focus is on data acquisition and not data quality, nobody notices the bad data, missing data and duplicate data.
The result: even if you manage to come up with decent hypotheses, you can't trust the data on which you test them.
> When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.
This is not necessarily a bad thing. Take the domain of application performance management. You're collecting hundreds of thousands of metrics from all over the place, OS, network, middleware, end user. Occasionally there is a performance problem that is non-obvious. You go through the obvious metrics and find nothing. It is a great thing at this point to just throw all this data at some algorithm and let it come back to you with "metric X, Y, Z looks related". This gives me some hypothesis I can go check that I would probably never have thought of on my own. And I have a direct way of verifying if it was a correct hypothesis: oh, it looks like there's 2 disks in this cluster, 1 is running at 100% the other at 0% so the overall utilization only shows 50%, I didn't think that was a problem. Investigate. Oh this disk has compression enabled, the other doesn't, turn it off, the application runs fast now.
Relevant but unmentioned in his list of accolades, Jordan is the 2015 Rumelhart prize winner, which is the equivalent of the Turing award for the Cognitive Sciences.
I am glad he pointed out that most artificial neural networks bear only a superficial resemblance to our own biological ones. But I think he failed to appreciate the power behind Boltzmann Machines - a type of neural network designed to create a generative model of a dataset. Personally, I find the resemblance between these neural networks and the real ones a little uncanny. And very few people seem to realize that the formalism behind a Boltzmann Machine can be adapted to fit the activation patterns of real neurons - you just have to redefine the energy function to match that of real biological neurons.
I've only started looking at RBMs recently, but ... what are you talking about? Biological neural networks use spikes. RBMs certainly don't. RBMs look more like HMMs to me than like biological neurons.
Don't take this as me saying, "you're wrong" -I'm curious if there is another way to think about RBMs (aka, "papers please" -so I have a deeper understanding when I do my own implementation of RBMs).
The energy function for a Boltzmann machine is usually E = v^T x W x h, where v are the visible units, W contains the weights, and h represents the hidden units. The form of the energy function defines the activation function of each neuron and the learning rule that goes with it. Now, you can in theory start off with any energy function you like (this form just happens to be the simplest). You would then have to re-derive the activation function and learning rule.
Just what would the energy function look like for real neurons? I don't know but we do know that the activation function would have to "spike" in bursts. So that is a clue. We also have rudimentary ideas about the learning rule used in biological neural networks, so you would also want to take this into account when determining the actual energy function. Finally, real neurons do not send retrograde signals but are instead wired recurrently, which must also be taken into consideration.
None of the parties mentioned actually deny the above equivalence. The reason backprop is a popular idea in deep learning is because people started developing continuous models, where the output (and the error) was a continuous and differentiable function of the input and the weights, which allowed chain rule to be used to compute the gradients, which allowed one to use gradient descent methods. This shift from discrete units to continuous units was termed error backpropogation, and not just chain rule.
I recently worked for a very cutting edge bioinformatics company, and I particularly agree with his segment about data sizes growing.
What I would say though, is that I think it is less an issue of the statistical strength of the data, and has more to do with the methods used to turn data itself into the statistics. For example, I was working with what by now (size projections are paramount in sysadmin planning for stuff like this) should be close to a Petabytes worth of genetic data. The real issue we were running into was that the traditional tools tend to fall apart on data of this size.
What we ended up doing was writing a distribution protocol for a certain application that worked well but wasn't very concurrent, and then every machine on the network besides the storage/sequencers/backup would crunch the data, helping even the big servers out. A big server would get 10-30 workers and a workstation would get 1-4. We turned 2 day analysis into 4 hour analysis.
And once we did the analysis, only one person, the company owner/genius, could decipher it.
I have to say, as a sysadmin, it was probably one of the most challenging and most educational positions I ever had. I actually enjoyed always being the only person in the room without a Phd.
Indeed, the big-data winter is just waiting to happen, after all the hot air that has been produced (& continues to be). Anyway, it's very nice to see the media hype put in perspective for a change.
I can see how some people might feel like being between a rock and a hard place: The data firehoses are all in place, our key-value stores are getting fuller by the hour, and we're supposed to sit and wait for decades before we'll be able to make any sense of it? I wouldn't be surprised if some will much rather play roulette today than make a sure bet in 10+ yrs.
"we have no idea how neurons are storing information, how they are computing, what the rules are, what the algorithms are, what the representations are, and the like."
"...you get an output from the end of the layers, and you propagate a signal backwards through the layers to change all the parameters. It’s pretty clear the brain doesn’t do something like that. "
So why can't the brain do some kind of backpropagation?
We know the brain doesn't do "backpropogation" in the sense that backpropagation is a complete description of how the brain works. It may be a component, but there has to be much more, because what backpropogation can do is fairly well characterized mathematically now and is not sufficient to build anything like a human brain. If you haven't seen a paper that clearly spells that out, it's because it would be considered too trivially obvious a result to publish.
When we don't know how a thing works that does not preclude us being able to eliminate some of the possibilities. Our ignorance is profound, but not total.
This is a reply to multiple sibling comments. There is actually recent work which shows that deep learning methods can also work WITHOUT any reverse signals: http://s.yosinski.com/dan_cownden_presentation.pdf
Interesting paper. Any more details on the architecture of the feedback connections. Also I can't tell from the paper where and how weights are being updated e..g what does "train" mean in this context
I believe instead of multiplying the delta by W^t to backpropagate the error from layer l to l-1, you multiply it by a random projection B. It's hard to dig deeper because there doesn't appear to be any other information on it except here: http://isis-innovation.com/licence-details/accelerating-mach...
As an aside, are they really trying to patent a slight twist on backpropagation? That seems pretty counter-productive to me.
It looks like there are still reverse signals (eg., deltas), but they are multiplied by a random matrix B instead of the usual transpose of the weight matrix for that layer. Am I misunderstanding?
This is not entirely correct. There is a lot of evidence that action potentials backpropagate into the dendrites [1]. On a local level, this allows synapses on the dendrites to "be aware" of the activity of the post-synaptic neuron. However, I do not think this accomplishes/implements the ANN backpropagation algorithm (although a few statements and a citation in [1] alludes to this being possible).
As a weak epiphenominalist, I'd argue that all conscious thought is a backpropagation. Is it controversial that the brain is able to perceive its own output, or am I misunderstanding how backpropagation in neural networks is implemented?
that's not what backpropagation is. Backpropagation is best thought of as a 'cheat' (algorithmic simplification) that allows you to calculate the derivative of a feed-forward neural net. You would need to calculate the derivative of a neural net in general to optimize relative to some cost function, you could use, for example, gradient descent, but that is computationally costly.
For some neural nets, you still have a gradient, but the concept of back or forward propagation is not defineable. Based on the topology and structure of biological neural nets, what would you think is the case?
Think about it; biological neurons most definitely don't use backprop. Physical neurons can't backprop linearly if they forward prop nonlinearly. I think Hinton pointed this out, though almost all ANN researchers agree.
Hebbian learning .... maybe. I never thought too much about contrastive divergence.
It doesn't really matter; brains most certainly don't work like ANNs, except maybe in some weird mean field sense for a few things like liquid state machines. It would be a huge coincidence if LeCun or Hinton or whoever magically wrote down the brain equation....
>In the brain, we have precious little idea how learning is actually taking place.
It's Hebbian learning. When a post-synaptic neuron fires shortly after a pre-synaptic one fires, the synapse in question is strengthened (the surface area actually becomes larger). I hope he's talking about higher level concepts of learning, because otherwise he's wrong.
Hebbian learning is the little idea we do have. We don't know much more: how memories are represented by neurons, control, consciousness, vision, and almost anything.
A lot of people are building things [with big data] hoping that they work, and sometimes they will ... Eventually, we have to give real guarantees. Civil engineers eventually learned to build bridges that were guaranteed to stand up. So with big data, it will take decades, I suspect, to get a real engineering approach, so that you can say with some assurance that you are giving out reasonable answers and are quantifying the likelihood of errors.
It's seems like the idea is that machine learning and data driven inference have to grow up and become a real scientific discipline. "Why can't you be more like Civil Engineering?" This isn't the best way to look at it. Machine learning is designed for situations where data is limited and there are no guarantees. Take Amazon's recommendation engine for example. It's not possible to peer into someone's mind and come up with a mathematical proof that states whether they will like or dislike John Grisham novels. A data driven model can use inference to make predictions based on the person's rating history, demographic profile, etc. It's true that many machine learning approaches don't have the scientific heft of civil engineering, but they are still very useful in many situations.
I'm not disagreeing with the eminence of Michael I. Jordan. I think this is a philosophical question with no correct answer. Is the world deterministic, can we model everything with rigorous physics style equations? Or is it probabilistic, are we always making inferences based on a limited amount of data? Both of those views are valid, especially in different contexts. Some of the most interesting problems are inherently probabilistic, such as predicting the weather, economic trends and the behavior of our own bodies. "Big Data" is obviously a stupid buzzword, but the concept of data driven decision making is very sound. We should put less focus on media hype terms and continue to encourage people to make use of large amounts of information. Get rid of the bathwater, keep the baby.
> Similarly here, if people use data and inferences they can make with the data without any concern about error bars, about heterogeneity, about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re an engineer and a statistician—then you will make lots of predictions, and there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.
He is not saying anything about the relative heft of machine learning and civil engineering. He is saying that if you don't worry about whether your predictions coming from big data are accurate, and whether you know a priori that they are accurate, you will still make predictions, but some of them will be wrong, and you don't know which ones. The analogy with engineering is only incidental to his point, which is mainly about overfitting.
You can point out afterwards that a certain prediction made using big data was correct in hindsight by collecting data after the prediction was used to make some decisions, like Amazon might. But you would really like to know whether a decision is likely to be a good one before you make it. And he, as a scientist, is interested in knowing for sure whether his results are correct.
A lot of the talk on machine learning reminds me of thermodynamics. There are some states we can say can happen. We can determine what a state can be composed of in terms of microstates with certain probabilities. We have definite answers for some things, and in other situations, we have to settle for big picture images. It all depends on the measure of the space you are working in. Nevertheless, there are ways to quantify errors in machine learning routines. There are mathematically sound ways to reduce error too, and intuition gives us even more models (to test). I do not think it should be a debate based on deterministic and probabilistic guarantees. The question should be more geared to how can we make assumptions to form better models and consistently do testing along the way.
>>Another example of a good language problem is question answering, like “What’s the second-biggest city in California that is not near a river?” If I typed that sentence into Google currently, I’m not likely to get a useful response.
So I typed that in google just to see and indeed I got nothing. I guess their [1]knowledge graph still has a long way to go.
Resolving these kind of queries is same as asking system to write a Turing complete program. A generalized query essentially sets a goal and the resolver is expected to create a program on the fly to build an answer.
For example, you can set a query "what's the 2nd biggest city in CA not near the river that has weather same as Seattle and is not among the top 500 cities in US".
As you can see generalized query would literally require system to create a program on its own. If we can do this, we would not need programmers and very likely it would be same breakthrough as practically unlimited supply of energy.
Uh, no, that's just "the second biggest city" Wolfram can't even handle "What is the second biggest city in California near a river" - IE, can't do something one step beyond the trivial.
His comments are way off the mark. The recent advances in neural network training are not strictly due to convolutional neural networks, but rather the discovery that gradient descent works remarkably well on training multilayer neural networks when using modern hardware. All of the best performing pattern recognition techniques in speech, image recognition, and natural language processing now utilize "neural networks". A neural network is nothing more than a poor name for a non-linear statistical model, and if you like one with a hierarchical structure (which is made possible strictly due to the non-linearity).
I don't think that anybody in the research community (except for maybe an occasional crazy) believes that neural networks have any biological significance beyond inspiration. NIPS (Neural Information Processing Systems) has been a reputable venue for work in statistics for some years now with no confusion over the idea that "Neural" does not mean a precise (or even imprecise) imitation of biological neurons.
How quickly did you read this? He says very nearly what you are saying:
"Well, I want to be a little careful here. I think it’s important to distinguish two areas where the word neural is currently being used.
One of them is in deep learning. And there, each “neuron” is really a cartoon. It’s a linear-weighted sum that’s passed through a nonlinearity. Anyone in electrical engineering would recognize those kinds of nonlinear systems. Calling that a neuron is clearly, at best, a shorthand. It’s really a cartoon. There is a procedure called logistic regression in statistics that dates from the 1950s, which had nothing to do with neurons but which is exactly the same little piece of architecture."
In my experience there are very few domains within Machine Learning where you don't need to be an expert in the field to yield useful conclusions out of the data.
Even if you have a high-level conceptual understanding of the statistical methods, tuning the parameters to yield something relevant, or much more so, adapting existing algorithms to meet your needs requires some pretty serious dedication to the field.