Empiricism and the limits of gradient descent

simonster · on May 28, 2018

There are a couple of factual errors here. First, the difference between backprop and evolution is smaller than the author indicates. The error signal used in modern backprop training is stochastic because it is computed on a minibatch (which is why it's called stochastic gradient descent). This stochasticity seems important to achieving good results. And the most popular evolutionary algorithm in the deep learning world is Evolution Strategies, which effectively approximates a gradient. Ordinary genetic algorithms are not gradient-based and have recently shown promise in limited domains, but can't compete with gradient-based algorithms for supervised learning.

The key claim in the article, that gradient descent could not discover physics from equations seems, like it is a statement about neural networks, not gradient descent. Given sufficient training data, a neural network can probably learn to model physics. I sympathize with the concern that it's very difficult to translate a neural network's knowledge into human concepts, but I see no reason to believe that optimizing the same system with an evolutionary algorithm would make this problem any easier. You could e.g. try to do program induction (which was supposed to be the future of AI many decades ago) instead of modeling the data directly, but choosing to perform program induction does not preclude the use of a neural network. Neural networks trained by gradient descent can generate ASTs (e.g. http://nlp.cs.berkeley.edu/pubs/Rabinovich-Stern-Klein_2017_...).

[Edited to remove reference to universal approximation; as comments point out, even if a neural network can approximate a function, it isn't guaranteed to be able to learn it. But I am reasonably confident that a neural network can learn Newton's second law.]

_0w8t · on May 28, 2018

Recently I was involved in calibrating a thermal infrared camera at work. A colleague out of curiosity tried to use machine learning and ended up with model containing hundreds of parameters (weights). Yet it was not better than a trivial model using a Planck integral (based on simple assumption about how things worked) and a linear regression (to account for systematic errors), 2 parameters in total. And the simple model completely ignored time dependencies assuming thermal stabilization which could be accounted using a couple of extra parameters based on a typical solution of heat transfer equation. Initially it was puzzling as I thought that Planck integral should be easily modeled with basic blocks of ML models. But then I realized that Planck function in our case was integrated over a complex profile of an infrared filter and may not be something that is easy to capture within ML.

2sk21 · on May 28, 2018

ML is a blunt instrument in such situations. I think your example illustrates the point of the original article very nicely.

joe_the_user · on May 28, 2018

Given sufficient data, according to the Universal Approximation Theorem, a neural network can learn to model physics.

The ability of a system of linked functions to approximate any continuous function seems rather far from the ability to "learn modern physics".

It would seem like knowing modern physics would involve symbolic calculations rather than just approximating the behavior of any system.

AstralStorm · on May 28, 2018

A lot of physics is functions with singularities. ANN can only approximate these to a specified limit...

I want to see a neutral network that correctly solves SAT-3.

roywiggins · on May 28, 2018

> Given sufficient data, according to the Universal Approximation Theorem, a neural network can learn to model physics.

It just says there are weights to approximate any function, not that you can actually learn the weights. Neural networks trivially can't learn how to approximate noncomputable functions to any accuracy, and there might be a lot of other functions that neural networks are terrible at actually learning.

srean · on May 29, 2018

I am waiting for this uneducated drivel of explaining NN performance by their 'universal function approximator property' to stop. There are tons other schemes that are also universal approximators, they were known before NN was a thing. Why don't we use those ? Why don't they work as well ?

Learning from examples and generalizing is a much different problem from function approximation.

sannee · on May 29, 2018

Maybe suggest using polynomials instead of neural networks next time that happens? :)

simonster · on May 28, 2018

It's a fair point that the Universal Approximation Theorem does not guarantee that the weights can be learned. OTOH, the physical laws that the article states a neural network cannot discover are computable functions.

AstralStorm · on May 28, 2018

You need a stronger bound than this. They have to be possible to approximate govern specific network size, architecture and activation functions. Calculating that (or good statistics that will say so approximately) is a hard problem... It is solvable for a bunch of activations in a layered perceptron but attempt extending this to something more complex.

togelius · on May 28, 2018

I had to make a few simplifications to spell out the differences clearly and avoid making the text infinitely long. It's true that most current gradient descent algorithms are stochastic because they are computed in batch mode, and that sophisticated evolution strategies approximate the gradient. I still think the differences are significant, in that evolution updates less often and the direction of the update is much less (if at all) dependent on the feedback.

Now, your point about to what extent this is really about neural networks is a good one. Could a network learn F=ma, even if we could not interpret it? Maybe. With the right data, represented the right way.

thesz · on May 29, 2018

>This stochasticity seems important to achieving good results.

No, it is not and may be counter resultive, so to say.

https://arxiv.org/pdf/1605.02026.pdf - page 8, figure 2(b). SGD optimized neural networks stops learning at the accuracy at which whole-dataset methods start!

Also please note that the figure I pointed to is about high energy particles analysis. SGD trained NN cannot even distinguish particles with good precision, let alone discover physics.

nabla9 · on May 28, 2018

Also. Neural networks commonly use dropout regularization. In dropout your train only fraction (typically 50%) of randomly selected neurons. Effectively creating essembles.

Gradient descent and evolutionary algorithms (and many other search algorithms) advance in the hypothesis space with incremental (stochastic) steps and both algorithms are path dependent. How they generate and update their hypothesis, how big steps they take, how they represent their state, and how they apply randomness creates unique learning bias but there is nothing fundamentally different.

goatlover · on May 28, 2018

> Given sufficient training data, a neural network can probably learn to model physics.

Maybe basic Newtonian physics, but I seriously doubt any ANN we've built to date could come up with QM or Relativity no matter how wonderfully massive and accurate the data was.

Looks to me like those required sophisticated conceptual understanding of the world in addition to leaps of the imagination and creative thought experiments.

tlb · on May 28, 2018

I'm optimistic about the potential for evolutionary algorithms. I've used both EAs and gradient descent in developing robot controllers.

But the argument here about why gradient descent won't be able to learn certain things is weak. Thought experiments are not a reliable guide to what what GD can or can't do.

It's fair enough to say that F=ma and E=mc² aren't in the data. Indeed, it took thousands of years of human thought to arrive at them. So the argument "it's not clear how an algorithm could extract F=ma from the data" isn't a strong criticism, because humans also can't do it by induction.

The long process culminating in F=ma involved a lot of abstract symbolic thought. Whether human-level abstract symbolic thought can be learned through GD (probably in combination with some sort of Monte Carlo tree search) is an open question. It can only be answered by trying to build things and seeing if they work.

If you want to make an argument about the limits of GD and induction, it'd be better to compare to a problem humans can solve reliably, rather than an insight that one genius had after decades of thought while standing on the shoulders of other geniuses.

visarga · on May 28, 2018

The author of the article mistook the importance of environment interactivity, considering that the learning algorithm is the key difference.

I think the distinction here should be between dataset based learning and simulator based learning. The genetic algorithms mentioned in the article rely on a dynamic environment, not a static dataset. Given the dynamic environment (which is like an infinite dataset) gradient methods can learn just as well - look at AlphaGo for example. But when the model can't experiment / try new actions and see the effects, it can't separate causes from correlations.

You can extract only so much from a dataset, the model needs a way to cause and observe external effects. The environment could be the real world, a simulated world, a game, a meta neural net optimiser (AutoML), or any domain where the model can act and influence the path of learning and the environment by its previous actions.

I'm happy to see the boom in RL and simulator based learning in the last few years. It means we are on the right track.

Eridrus · on May 28, 2018

Right, I think it's very easy to fall into a trap of thinking that something is too hard for ML methods, because the way we solve them is different than how computers can solve them.

There is ongoing work to try to see if we can learn NN-based controllers that do their own learning or thinking. The Neural Turing Machine work being the most famous, but there's also some amount of single-shot/few-shot learning literature that is exploring other ways of learning because trying to learn from examples via SGD is too slow.

EAs are certainly interesting and I think are show some interesting results for hyperparameter tuning/architecture search of neural nets where we don't have gradients, and doing it with far less computation than RL solutions.

perl4ever · on May 28, 2018

I don't understand how E=mc^2 can not be in the data. If it's a universal law, isn't it in more (all) data than any pattern that isn't universal?

maxander · on May 28, 2018

The article is making a simpler point than that. If I show you the table:

And so on for some arbitrary number of rows, you can look at the table all you want but you will not perceive "A+C=B". It's just not written there. To get A+C=B you have to generate something else in addition to the table, namely a hypothesis- but this is a creative act, not an empirical one.

LolWolf · on May 28, 2018

In this case, any linear model (with reasonable, minimizable loss, which could be, say convex) will learn the correct thing.

Here's a quick gist[0] doing it using least-squares, and learning it exactly (also, for B in the third row you may have intended 35 instead of 45?).

This simple regression model learns exactly the weights (-1, 1)—equivalently, it learns -A + B = C.

------

[0] https://gist.github.com/guillean/6f3ff05fa99b2b377fcf309bdc4...

tlb · on May 28, 2018

If you connect A and B as the input to a linear neural net, and train against C, it'll very quickly arrive at weights of [-1, +1] and be able to correctly predict C given A and B. Whether or not it represents it in notation humans are familiar with, it has learned it for the practical purpose of being able to compute the function.

goatlover · on May 28, 2018

But how would a neural network know to connect data for mass with data for the measured speed of light? Why would a neural network be looking for an equation for energy conversion in the first place? If you just provide tons of raw data from instruments, what does that mean? What do yo do with it?

Sure, a human can clean the data and put it into a format that gives meaningful results. But if we're just talking about an AI learning from raw data with no supervision, where does it even start?

mannykannot · on May 28, 2018

As someone who finds this interesting but does not know enough to take a position, I think a bigger question would be how does it come up with the abstract concept of energy?

I am aware that Alpha Go Zero came up with various strategic abstractions of the game that are recognized by competent players, and some novel ones, but I do not know where this program and its self-play training stands in the dichotomy of this debate.

goatlover · on May 28, 2018

Go has a clear objective that you can train for. What would be the objective in coming up with a physics law from raw data? What would it even be training for? That sounds like asking whether DL could create a new board game from a bunch of data on human behavior.

> think a bigger question would be how does it come up with the abstract concept of energy?

I don't see how it could, and that's kind of how Kant argued against empiricism. You can't derive a conceptual understanding of the world from raw data. There's nothing in raw data to structure or make sense of it without some way to interpret the data. Even calling it data is an interpretive act (as opposed to noise).

PeterisP · on May 28, 2018

Given a sufficiently large table, an agent can certainly perceive without acting (testing a hypothesis) that "A+C=B" through its compressibility; the most compact(least complex) representation of the data can replace the many individual datapoints in one column with a learned rule how to calculate it.

There's related research on how children learn language, namely, how much observed evidence (i.e. based on cases where we have monitored and counted every word a child has heard in their life) is needed for a child to switch from a "lookup table" approach for certain features to "rule based" approach (detectable by observing overregularization, applying a systematic rule even when the actual language, including examples the child has heard, has an exception to that rule) and then to a "rule+exceptions" correct understanding; the experiments point towards "learning a rule" then and only then when a "compressed representation" is beneficial from information theory point of view.

perl4ever · on May 28, 2018

It seems to me that A+C=B and E=mc^2 are both written in, or consistent with, that table. But widening the context produces something different in each from a probabilistic perspective. A+C=B is 100% certain with narrow context, but falsified with a little more context, while E=mc^2 is less certain with narrow context, but increasingly certain with wider context.*

*Like, this table is on a computer connected to a global network, which is based on electromagnetism, which is intimately related to relativity. The more you understand what the table is, the more certain you can be of E=mc^2.

AstralStorm · on May 28, 2018

Yes, but this is just total energy. Try kinetic energy instead. (Even Newton's Ek=0.5 * mv^2 much less Lorentz special relativity or general relativity.)

Now the model with kinetic energy plus rest energy. A network unaware of time will be unable to figure it out. Especially the differential in velocity.

What you need to actually devise such laws is generalizing conflict-driven clause learning with some good rule to pick models, name them and enumerate them. E.g. defining minimum generalizing set of logic clauses with support for undecidable and uncomputable functions. (Which means deciding when to give up.) This is essentially the inverse of a MAX-SAT solver. Minimax logic representation so to speak.

irishsultan · on May 28, 2018

Well, that hypotesis would be wrong in this particular case (check the middle row).

Which doesn't make your point wrong of course, and this is for a simple function.

qop · on May 28, 2018

Data has to be obtainable and is often dimensional.

How would you have measured the speed of light in the forties? Even having lots of test cases, you wouldn't be able to deduce something like that with test cases alone.

petermcneeley · on May 28, 2018

Speed of light has been experimentally determined long before the 40s. The discovery of the speed of light being constant lead to the special relativity (aka before GR which is the E=mcc). http://www.speed-light.info/measure/speed_of_light_history.h...

tlb · on May 28, 2018

The speed of light is easy to measure. What's hard is measuring the change in mass when something gains or loses energy. A large nuclear reactor running for a year only loses a few grams of mass-energy -- you can't measure it accurately enough or be sure that the mass didn't go somewhere else like into the cooling water.

AstralStorm · on May 28, 2018

Technically with billions of events you can. This is how LHC and other particle accelerator data is analysed.

The interesting part is detecting systematic errors. (Human equivalent is "are my senses lying right now".)

joe_the_user · on May 28, 2018

In wikipedia and numerous references, I get the impression that EAs, evolutionary algorithms, are a huge and fuzzily defined field.

With Genetic algorithms, just one kind of evolutionary algorithm, the fitness, the mutation and the crossover function seem to require the implementor to look at the problem domain and "come up with something" whereas once you have a goal, gradient descent requires lots of tuning but is more or less defined, you can track how well you're doing and so-forth.

Perhaps there's something I'm missing. Pointers would be welcome.

Looked at: https://en.wikipedia.org/wiki/Evolutionary_algorithm https://en.wikipedia.org/wiki/Genetic_algorithm

tlb · on May 28, 2018

Salimans et al evolved weights in convolutional neural nets -- the same structure of network that works with deep reinforcement learning -- to play Atari games from pixels. https://blog.openai.com/evolution-strategies/

So you can make it work with generic models. Although as you say, many of the big successes like the Backgammon players and Kosa's work on electronic circuits used domain-specific models.

sepranu · on May 28, 2018

The points the author makes about gradient descent are accurate, in a sense. However, they oversimplify the technique (as it is currently applied today) and the context in which it is used. It seems as if the author, like many others, has a basic understanding of the subject's basic mechanisms, but not the context in which experts understand them.

The example the author cites regarding evo algorithms learning physical laws is laughable - "It's just not in the data - it has to be invented" applies equally to both the backprop and the evolutionary learning algorithms.

"In this case, the representation (mathematical expressions represented as trees) is distinctly non-differentiable, so could not even in principle be learned through gradient descent."

This is incorrect, almost like saying NLP data is not differentiable. For instance, set this representation up as the output of a network (or, if you wanted to be fancier, the central component of an autoencoder), and see how well it predicts/correlates with the experimental data. This is the error, which is back-propagated through the network's nodes.

FWIW, many theoreticians believe that the unreasonable effectiveness of neural networks and especially transfer learning is a result of their well-suitedness to encode laws of physics and Euclidean geometry. The author's final points about a nine-year-old survey may be out of date w.r.t. contemporary neural networks, which often have spookily good local minima and do not behave in the way intuition about gradient descent might suggest.

hmartiros · on May 28, 2018

In my experience if you have even a little smoothness in your problem's cost manifold, taking advantage of gradients is invaluable to sample efficiency. Many losses which don't seem differentiable can be reformulated as such - you can look around and see a wide array of algorithms being put into end-to-end learned frameworks. If the dimensionality is small, second-order methods (or approximations thereof) can do dramatically better yet. However, I'm also a fan of evolutionary algorithms. I see no reason why evolutionary rules can't be defined with awareness of gradient signals.

vinn124 · on May 28, 2018

> Many losses which don't seem differentiable can be reformulated as such...

agreed, especially with policy gradients.

> If the dimensionality is small, second-order methods (or approximations thereof) can do dramatically better yet.

i have not seen second order derivatives in practice, presumably due to memory limitations. can you point me to examples?

hmartiros · on May 28, 2018

They aren't common in deep learning, but if you look to estimation problems like odometry, optimal control, and calibration, the typical approach is to build a least squares estimator that optimizes with a gauss-newton approximation to the Hessian, or other quasi-newton methods. Gradient descent comparatively exhibits very slow convergence in these cases, especially when there is a large condition number. In the case of an actual quadratic loss function, it can (by definition) be solved in one iteration if you have the Hessian. However, getting it efficiently within most learning frameworks is difficult, as they primarily only compute VJPs or HVPs.

pmalynin · on May 28, 2018

Once again, there is a general conflation of evolutionary algorithms and learning without a differentiable error function. I've presented this argument again with a discussion with antirez here on HN [1], but the crux of is that Reinforcement Learning as it stands is optimization over (maybe)-non-differential errors. For AlphaGo there is no gradient, per se, that says you will optimize your wins if you go this way (now, it is optimized by training towards the win-rate "score", which could be an error score) -- look at REINFORCE for other variations. Evolutionary Learning and Reinforcement Learning as two sides of the same coin.

[1] https://news.ycombinator.com/item?id=16652138

creo · on May 28, 2018

(This is of course my opinion, not scientific fact) The biggest argument for EA is that proper implementation does not get stuck in local minimum. GD and SGD have that tendency, if not in one spot then probably looping between many. Problem with EA by the other hand is that its mathematical model is based on probability which is quite tricky to operate on even for above-average programmer.

ssivark · on May 28, 2018

I don't see why stochastic has to be worse than evolutionary algorithms. In high-dimensional spaces, there is a large space of possible "mutations". SGD just biases certain mutations based on the gradient from a minibatch. That sounds a lot like evaluating the fitness of a mutated population on a minibatch and culling members with low fitness. In fact, there are many demonstrations that the stochastic nature of SGD (coming from the use of minibatches) is crucial for effective learning.

ozy · on May 28, 2018

Notice how blacks swans is a labeling/categorization problem. Whereas F=ma is a causal relationship. The first has no induction problem, there is no swan-ness except in our minds. The second has no induction problem, because you model causal relationships and predict from them, better predictions means less wrong. The problem of induction is only one of "ultimate reality".

andbberger · on May 28, 2018

Meh, weird article. None of the nice modern results (imagenet family, etc) were achieved just through gradient descent - this article, like most, seems to be missing the forest for the trees with deep learning.

It's not about the network architecture, or gradient descent on their own - it's the interaction, the dynamical system over weight space that training is.

Behind every great modern deep learning result? An enormous hyperparameter search and lots of elbow grease to carefully tune that dynamical system juuuuust right so the weight particle ends up in just the right place when training finishes. Smells like evolution to me. Deepmind even formalized the evolutionary process a deep learning researcher runs manually when fine-tuning a model into population based training https://deepmind.com/blog/population-based-training-neural-n...

PeterisP · on May 28, 2018

"An enormous hyperparameter search and lots of elbow grease to carefully tune that dynamical system" i.e. the classic GDGS (gradient descent by grad student) approach where you have a grad student train a system, decide in which direction the parameters should be updated (i.e. look at the gradient), tweak the system and repeat until convergence.