Yes (I am not expert in neural nets, but that appears to be exactly what this is saying). If you look at what goes into a neural net and compare it to what goes into a Fourier transform, it should be obvious that neural nets have even more than they actually need to do this task.
This statement doesn't make sense to me. A neural network literally can't produce anything besides a continuous function, and the universality theorem says that there is no continuous function they can't (approximately) produce.
So what could you possibly mean when you say neural networks have "more" than they need to do something which characterizes exactly what they can and can't do?
Who cares if they can compute any function. The important question is, can they learn any function, and can they learn in a way that can generalize?
(And clearly they can for many useful domains).
This is a chapter from a textbook, one that seems pretty introductory at that. I'd say it is a pretty important piece of information to convey and prove, this ability of neural networks to represent any function.
Seems to me that jumping to whether they can "learn any function" as you asked, would certainly have come after this is first established.
From Wikipedia:
In 1969 in a famous monograph entitled Perceptrons, Marvin Minsky and Seymour Papert showed that it was impossible for a single-layer perceptron network to learn an XOR function.
The simple answer is that we no longer use the standard linear perceptrons that McCulloch and Pitts put forth in 1943. The limitations of perceptrons and their training algorithms are well-known and well-studied, and in the last 60 years we've come a long way.
A book came out a little bit ago which covers basic neural networks, and culminates in showing a network that can estimate XOR [1]. I strongly recommend reading not only the chapter, but the whole book.
'Single-layer' is an easily overlooked and absolutely essential modifier. Very few networks these days are single layers. (Three is the minimum for infinite dimensional functional approximation, given nonlinear activations.) Deep Networks you see in new research papers these days have, at minimum, perhaps three layers. Some by LeCun et al. will go as high as nine or twelve layers, with four or sixteen layers in breadth.
I'm pretty sure linearity was also part of the picture -- no matter how many layers of neurons you have, as long as they perform linear combinations, they could all be replaced with a single one (ignoring numeric errors).
I mentioned this in my post ("given nonlinear activations"), but you're correct to point out its importance. The combination of linear functions is always linear. Full stop. You need at least one non-linear layer to get reasonable results.
If you join a bunch of perceptrons together that limitation goes away. Another path is to make the problem effectively linear again by transforming into higher dimensions, kernels do this with one clever trick that allows them to avoid the computational cost of doing so explicitly.
The biologically-inspired nature of neural nets was a steady, mesmerizing flame for machine learning research. A chance to work on an extremely simple, provably universal system that had embarrassingly obvious rhetorical implications for profound-sounding problems like the computability of consciousness, or the computational complexity of the human brain, proved too seductive for the legions of research-moths that flocked to bask in that alluring flame.
The initial flurry of research activity slowed to a background simmer, however. As a steady march of implementation problems was knocked down, it became clear that the limitations of neural nets were inherent to neural nets themselves, rather than any particular implementation detail. For example, the Lagrangians calculated by SVMs are provably convex, in contrast to the lack of a convexity guarantee for neural nets, which happily gradient descent into local minima. There were workarounds, of course, with regularization techniques, stochastic descent, etc. providing some measure of relief, but a migration of research interest away from neural nets seemed increasingly promising, and today, the migration seems largely complete.
> but a migration of research interest away from neural nets seemed increasingly promising, and today, the migration seems largely complete.
What are you talking about? Deep learning is one of the hottest areas of research today, and a lot of it has to do with neural networks. NN's are the state of the art in several domains. Case in point: http://image-net.org/challenges/LSVRC/2014/results. All of the top entries use convolutional networks; in fact, almost all of the entries do.
The fact that the loss function represented by a neural network can be highly nonconvex is what makes them so effective in the domains in which they are used. See this presentation by Yann LeCun for more info: http://www.cs.nyu.edu/~yann/talks/lecun-20071207-nonconvex.p...
"ML theory has essentially never moved beyond convex models, the same way control theory has not really moved beyond linear systems. Often, the price we pay for insisting on convexity is an unbearable increase in the size of the model, or the scaling properties of the optimization algorithm ... This is not by choice: nonconvex models simply work better.
Have you tried acoustic modeling in speech with a convex loss? ... To learn hierarchical representations (low-level features, mid- level representations, high-level concepts....), we need “deep architectures”. These inevitably lead to non-convex loss functions."
This isn't to say that NN's are going to solve all our problems, but to say that there has been a shift in interest away from NN's is absurd.
Parent might be living in the recent past. There was a migration away from NNs in the 90s/early 00s, then Hinton and other people brought it back to life...with a vengeance :)
Exactly. The history of NN is full of ups and downs and it's becoming increasingly popular again the form of Deep Learning thanks to increasing cloud processing power and advancements by Hinton and others. Most to of the traditional criticism of NN is related to shallow nets. But deeper and far more complex structures like those in the animal brains are not explored enough.
If you're interested in the generalization ability of neural networks I can recommend the following paper:
"Intriguing properties of neural networks" http://arxiv.org/pdf/1312.6199v4.pdf
TLDR: The authors create adversarial examples, i.e., they slightly modify the original images, which look exactly the same to humans but neural networks can't classify the images correctly anymore. What does that imply on the ability to generalize? :)
On a more general note: NN are always treated as something magic. I think a "sober view" is that NN are a special way to parameterize non-liner functions. This is neither good nor bad but it's easy to see when you look at, for example, a 2 layer NN:
>"which look exactly the same to humans but neural networks can't classify the images correctly anymore. What does that imply on the ability to generalize?"
Usually that means that the model is experimenting overfitting, and that's actually one challenge on any machine learning model.
Quite. It's not hard to come up with models or families of functions which share this property.
What matters is not only whether they can learn it but how much data they need to learn it to a given degree of accuracy. This is the kind of question addressed by nonparametric statistics and statistical learning theory.
This is an important statement and should be upvoted more. Case in point: "the Weierstrass approximation theorem states that every continuous function defined on a closed interval [a, b] can be uniformly approximated as closely as desired by a polynomial function."
I think this ability to compute any function is important, and useful in understanding why neural nets work. Neural nets are essentially a blank computation matrix that, through training, best approximates a function.
comparing neural net neurons to logic gates in a computer circuit or FPGA, gives a nice intuitive understanding of why the training works in the first place. The training essentially sets up a custom virtual circuit for whatever function you need.
Given this property, it seems intuitive that they would be able to learn any function, since any function is representable as logic gates.
Sort of. It depends on the learning algorithm used. Typically some form of local optimization is used that gets it to a good local optima. This is usually good enough and fast. However you can use methods of global optimizations or whatever optimization algorithm you want to use.
NNs themselves are sort of agnostic to the algorithm you use to set their weights. It's sort of like saying "computers are Turing complete, but can we ever prove that human programmers are smart enough to program any function."
It computes because it is detached, there is no feeling. To learn, we would need to first create something that is conscious, an ever-changing self. It needs to be in reference to itself, not some notion of compute in a detached paradigm.
> "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E". This definition is notable for its defining machine learning in fundamentally operational rather than cognitive terms, thus following Alan Turing's proposal in Turing's paper "Computing Machinery and Intelligence" that the question "Can machines think?" be replaced with the question "Can machines do what we (as thinking entities) can do?"
It's important, because we don't need to care about cognition or consciousness, and we can still write programs that solve problems well by making inferences from patterns in data.
As mentioned in the article, the formal statement is actually "neural nets can approximate (arbitrarily well, using the supremum metric) any continuous function". For other norms, it can also approximate non-continuous functions.
You also need the qualifier "...any continuous function, on a compact set."
Once you add all three qualifiers in (approximate/continuous/compact) it starts to sound more like math and less like a miracle.
Incidentally, one thing of great interest is, how does the number of hidden units required behave as a function of dimensionality of the input domain. In dramatic language, "Can neural networks get around the curse of dimensionality?"
The Cybenko proof does not give enlightenment about that question. Andrew Barron (http://www.stat.yale.edu/~arb4/) had some results that seemed to indicate the dependence was moderate (not exponential). I'm not aware what the state of the art currently is.
>> Once you add all three qualifiers in (approximate/continuous/compact) it starts to sound more like math and less like a miracle.
I agree. The problem with attempting to make subtle technical points digestible by untrained people is that the larger meaning of the exercise is likely to be lost along with the details. This would probably be a good demonstration to show to investors in your NN-based startup, because it seems impressive. But if I had to state the take-home message of this that the average person should care about, I'm not sure there is one. There are simpler systems with the same property that are taught to undergraduate engineers (and NNs don't represent a novel path to having it, even); there's a great distance between approximating functions and solving practical ML problems.
Edit:
To me, personally, I would have been more interested had this been an argument about back propagation for training NNs. I suspected NNs could do this because of what's in them, but training is as much of a constraint on what they can actually do.
Yep, it's funny how different people can have different reactions. The headline of the article said "any function", and so did the first sentence of the post ("any function at all").
Later on in the article, it's qualified, but tempers are already rising.
Here come people with their non-computable functions, their unmeasurable functions, their nowhere-continuous functions, all wanting to get approximated. In sup norm! On unbounded sets!
The proof is constructive, so actually you can. Just create a few neurons for the "tower function" and the correct output weight at every point you want to map.
If you mean learn functions efficiently with few parameters, or in ways that generalize well, the article makes no claims about that at all. There are two "no free lunch" theorems. One of which says that it's impossible to guarantee any method will generalize well on all problems, and the other that says it's impossible to guarantee any optimization method will work well on all problems. You just have to make strong assumptions like "my function can be modeled by a neural net" or "the error function is convex."
However neural networks are agnostic to the optimization algorithm you use to set their weights. There are many different ones.
> how does the number of hidden units required behave as a function of dimensionality of the input domain
If I recall correctly, a non-linear problem can be solved as a linear problem if you consider more dimensions. The hidden layer add dimensions. So, it's not a function of the input domain but of the problem domain, which usually isn't explicitly known.
The question in the OP concerns approximation of the f(.) in:
y = f(x)
where the input "x" is d-dimensional (say).
Some problems (i.e., choice of "f") could be easy. Maybe f only depends on one element of x, for example. There would be no curse of dimensionality in this case. Same situation if "f" depends only on any fixed number of elements of "x".
I think this is roughly what you mean by "dimension of problem domain." Fix an "f", that defines a problem. And you're right, efficient solution of that fixed problem is important!
My remark (which is also in the last paragraphs of the Cybenko reference cited in the OP) had to do with increasingly difficult problems.
How to get such a sequence of problems? Suppose you take a simple function like
f(x) = exp(-0.5 * dot(x,x))
i.e., the Gaussian, and approximate it with a linear superposition of 1-d sigmoids (as a neural net would).
The question is, is there an explicit dependence on dimensionality, and is that dependence exponential in d?
And of course, for more general function classes (not just the single Gaussian function above), is there such a dependence? If it is not exponential, that would be astonishing, revolutionary.
The reason this setup ("vary the problem size") is interesting is that we would clearly like to use neural nets for increasingly higher-dimensional problems (e.g., learn appearance of 32x32 cats, then 64x64 cats, then ...).
I think I know the problem you're referring to (despite not having the right kind of specialist training to actually work in this field). I think that the fact that random or quasirandom probes are able to explore these high-dimensional spaces fairly effectively is evidence that the problem is somehow not exponential and therefore tractable in some fashion. Does that sound relevant?
One has to be careful to separate the functional-approximation problem (which is in the OP) from the statistical-model-selection problem (which is NOT in the OP).
The statistical-model-selection problem is easier in some ways (you don't have to approximate the function everywhere, just where you get data; if you didn't get data somewhere, you don't care what happens there).
It is harder in some ways (all that probabilistic folderol, plus, your data is noisy).
There are results that give rates for the functional-approximation problem. By rates, I mean, how good is the approximation versus number of hidden units. The best work I know of is by Andrew Barron, but that was in the mid/late 1990s. He's like a genius bulldozer, so his papers are tough going. You'll note that the Cybenko results, like in the OP, do not give rates. This is obviously a huge difference.
There are also results in the statistical-model-selection problem, of course. With rates. That's Vladimir Vapnik's big contribution, later carried on by others. One of the main results is that a model class has an intrinsic complexity (VC dimension, or other measures) and you only need to have order-of that many random probes to get (close to) the best model in the class. No matter what the dimensionality of the space where the data lives.
This is precisely the point you're making above.
My comment is getting too long, but: notice that the statistical-model-selection work only speaks about how to get (close to) the best model in the model class. You then have to make sure your model class is full enough to get close to the optimal rule (which is chosen by Nature, or whatever).
So to make a full theory, you need both pieces: how to choose a big-enough model class (functional approx.) and how to select a good-enough member of the model class (statistical model selection).
Typically there is a total error having one term for each of the two effects, and therefore a balancing act between making the model class big enough to approximate anything, vs. making it small so that you can easily select a good model. If you look at it sideways, this therefore looks like an optimization problem with a Lagrange multiplier penalizing model complexity. So it's quite elegant.
"A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated." - Cover, Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition., 1965
This is a well-known observation, but it's not at all the issue I was trying to get at, and which the existence proof of the OP raised.
From the last paragraph of the paper by George Cybenko referenced in the OP:
"While the approximating properties we have described are quite powerful, we have focused only on existence. The important questions that remain to be answered deal with ... how many terms in the summation (or equivalently, how many neural nodes) are required to yield an approximation of a given quality? [...] We suspect quite strongly that the overwhelming majority of approximation problems will require astronomical numbers of terms. This feeling is based on the curse of dimensionality that plagues multidimensional approximation theory and statistics."
[I'm highly familiar with both the paper by Cybenko, and the paper by Tom Cover on linear separability that is the source of the Wikipedia quote pasted above, having gone through them carefully as part of my PhD. Andrew Barron, mentioned above, was Tom Cover's student, and the work Andrew did can be viewed as another approach to this problem.]
I misunderstood intentions. I just wanted to chime in with the name of theorem I thought you were describing. I like knowing the names of theorems, and I had a hard time finding this name when I first looked for it.
> > it starts to sound more like math and less like a miracle.
> that applies to all of math. There are no miracles.
Err, what applies to all of math? That it sounds like math?
I do not agree that there are no miracles—plenty of mathematics is miraculous. As a first example off the top of my head, the areas of squares that can be drawn with vertices on the integer lattice in the plane are the integers whose prime factorisations include an even number of times each prime `p` for which `p + 1` is divisible by 4. This is a miraculous relationship between geometry and number theory, and it's just the tip of an iceberg!
It would have been pretty interesting if this had NOT held.
It would have meant that even though "neural nets can NOT approximate (arbitrarily well, using the supremum metric) any continuous function", a neural network (the humans involved) was able to discover this limitation.
I find the idea of a neural net finding a limitation of a neural net, to be interesting.
A "neural net" is what a computer scientist decided to call something that he thought behaved somewhat like a now dated abstract model of what individual neurons worked like, from a period when neuroscience was really in its infancy.
If we take the phrase neural network to mean, "a series of deeply and widely interconnected elements which assist or inhibit the transmission of signals to one another, activated by the inputs exceeding a threshold," then I think there's a pretty good amount of overlap.
I understand there's a temporal signaling model (pulses at varying rates, not steady state signals) and stochastic information (like random firing and a bunch of noise), but once we abstract out the axons, dendrites, and neurochemicals, is there another piece of functional equipment which drastically effects things? How does our simplified view that small, individually stupid pieces, acting in concert to produce complex behavior differ from the real brain?
The point is more that you can't abstract neurons away into a simple "analog in, digital out" pseudo-transistor with fixed connections and expect that to describe how the brain works. The brain makes active use of all those details you are abstracting away, in ways that would make your model's predictions differ from reality.
No, absolutely not. I'm saying that a "neural net" a la McCulloch doesn't accurately model how computation is performed by the brain. You can't accurately model human thinking by recording the brain's connections as a classical neural net. That's all I'm saying.
You should explain in concrete terms why that is the case. I think its apparent that the human brain is a much more complex and advanced neural network (Intel 4004 vs Intel i7 perhaps?), but to say it is not is interesting and I would like to hear why.
Perhaps "not even close" was a bit strong, but I'm talking about a neural net as a specific mathematical model here. To say that the brain "is" an instance of a particular mathematical model isn't even really meaningful. At best you could try to argue that the brain is "modelled well by <formalism> for <purpose>", although I think you'd lose that argument for all but the weakest of purposes.
I'm not a neurobiologist (if you are please correct/clarify!), but here are just a few of doubtless many significant differences as I understand them:
* Real neurons don't update in a bunch of coordinated discrete timesteps, as an ANN learning algorithm does. They can fire independently and in continuous time.
* Real neurons' activation behaviour is closer to sudden delta-function-like spikes, rather than a smooth activation function or something which allows them to stay in a firing state for longer than a short pulse.
* The structure of the connection graph of neurons in the brain is incredibly more complex than that of artificial neural net models, can change over time, isn't split into obvious layers from input to output, there isn't a clearly-identifiable error signal which is used to train it at the outputs, ...
* A real neuron is an incredibly complex biological system whose behaviour can be modulated in all sorts of ways and which I expect would take a nightmarishly complicated set of PDEs to model with any degree of realism even as an isolated unit. To think we've captured all aspects of their behaviour relevant to human cognition with a simple weighted sum and a sigmoid (say) seems pretty naive.
Details aside though, the idea that you can start drawing deep philosophical conclusions about the nature of human thought, our ability to conceive of our own limitations etc, based on an analogy between a complex biological system and a simple mathematical formalism which is at best loosely influenced by certain limited aspects of it -- it's just silly, one of those "not even wrong" sort of statements.
I think its worth pointing out that when computer science researchers play with different kinds of neural nets which expand beyond the simplistic version used in the OP, they still call them neural nets. I have seen abstract computer-driven neural nets that fire independently and in (more) continuous time, are plastic, lack clear layers, etc.
I think when you said: "but I'm talking about a neural net as a specific mathematical model here" that is fair, you are asserting that while this model similar to a biological one at some very primitive level the model of neural net used here is so far removed for our biological implementation as to render any comparison non-meaningful. I believe that is possible, but we still have discussed evidence that is the case.
I thought it might not be "silly" question: Consider that a turning machine simulated with pencil and paper, given enough paper and time is capable of doing any calculation your desktop computer can. Now it may take 10,000+ years and many "operators" lives, but that is besides the point.
I was thinking that the comparison between the pencil and paper turning machine and the desktop machine, might be similar to the this very basic neural net model and our own much more complex biological instance. The idea is that the two systems while varying vastly in efficiency might equivalent in what they can theoretically compute.
I don't see how asking the question of how much we can simplify our biological neural net and still have be computationally equivalent to the original (even if less efficient) is a silly question.
(I don't like silly, it implies the question is not worth asking. Calling your students question silly is a good way to make them never ask another.)
Because the neural network described there is just an abstraction that has mathematical and (with some modification) practical utility. Real neurons do not behave in the way the model behaves. Synapses have plasticity, not just neurons. I'm not sure if synaptic efficacy (i.e. how much pre-synaptic input influences output) is fully understood even now.
To put it simply, it is possible for ANNs to fail at something and for us to succeed at the same thing.
Right-- in the musculoskeletal system there are mechanisms that operate approximately like pulleys; so I think you can say "a person is a neural net" in exactly the same sense as you can say "a person is a pulley." Neural nets are models of systems that we see in the human body that have been simplified and adapted such that they can be used as tools. It's not clear what lessons about humans can be drawn from studying NNs.
You could say that a neural network found this limitation of neural networks, to the extent that neural networks could be defined in terms mathematical logic. However, it's not guaranteed that the neural networks in our brain could be explained in these terms--the physical processes underlying them are not fully understood.
This is also why Neural Nets are susceptible to overfitting and fell out of vogue in the 90s :) They will merrily fit themselves, very precisely, to your noisy, wiggly data. Obviously there are ways to combat this, but it seems like an corollary to their 'universality'.
Feedforward NN's are only useful to a very narrow* set of problems (*-> compared to 'all' problems out there).
Recurrent networks are needed for stateful operation, i.e. where some kind of memory is needed (in any case where the input is spread across some time or the sequence of data is important). And learning in recurrent nets is in very early stages unfortunately.
They are universal function approximators, which means they can map any set of input values to any set of output values. Of course to do this sometimes requires rote memorization of every possible input and it's output, rather than generalizing the function with a few parameters.
Adding more layers improves on this and allows you to make functions that compose multiple smaller functions. The problem with this is the nonlinearities cause the gradients to explode or vanish after a few layers. So the amount of computing power required to train them is huge.
Recurrent NNs had the same problem since they are equivalent to a very deep feed forward network; where every layer is a time step and the weights between every layer are the same.
But the invention of Long Short Term Memory has made training RNNs practical. Basically, as I understand it, some connections do not use nonlinearities so the gradients don't explode or vanish.
My point of view was that neurons in neural nets are essentially analogue logic gates. Given that combinations of logic gates are Turing complete, combinations of neural net neurons should be also.
My writing is not quite as nice or rigorous as the parent post, but the whole point of the blog is to get better at self-expression and explaining things.
I liked this article a lot, but I found it extremely confusing how some of the diagrams were interactive and some weren't. Why not make them all interactive? Barring that, an obvious visual indicator when one is interactive would be handy. As it was, I clicked on a lot of static images.
If neural nets can compute any function (as seems neatly proven here) then can they compute any function in more than a single way? If so, then upon applying the novel input (which was our goal following training) how can we know that the particular way which was computed via training set is 'right' for our novel input? If this is all true then it would seem to make neural nets perfectly unreliable as a means to modeling..?
This is a wonderful post. One minor aspect which nags me is that when I read "any function", I think any "effectively calculable method". But regular feedforward MLPs are not Turing Complete (will you be going over recurrent or recursive networks?). If so, it would be useful to note this distinction as I've never seen that point for confusion dealt with properly in one place.
The same arguments as in the original Cybenko paper, or the Stone-Weierstrass theorem, lend support to the idea that SVMs are universal approximators (with most typical kernels). This has been proven by a couple of authors.
I'm not aware of universal approximation results for random forests, but since they have the same general construction, this would not be surprising.
I don't think I understand your point. By awesome I meant it's a very exciting/interesting field of research. I don't mean to call thermodynamics awesome because it powers cars or because it's a well research field.
Trying to understand how the brain works or making a computer that works in a similar way is awesome to me. If there are still computational limitations that makes it not practical, that's still an awesomely interesting subject.
It was the IBM SyNAPSE Chip [1]. It's actually been known since 1988/89 that neural networks can approximate any continuous function [2], but this chapter explains how in a much more intuitive sense.
may i also humbly suggest the two vol. series called "parallel distributed processing" (rumelhart et al) which provides an excellent overview of early NN research.
http://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theor...