You also need the qualifier "...any continuous function, *on a compact set.*" On...

mtdewcmu · on Sept 2, 2014

>> Once you add all three qualifiers in (approximate/continuous/compact) it starts to sound more like math and less like a miracle.

I agree. The problem with attempting to make subtle technical points digestible by untrained people is that the larger meaning of the exercise is likely to be lost along with the details. This would probably be a good demonstration to show to investors in your NN-based startup, because it seems impressive. But if I had to state the take-home message of this that the average person should care about, I'm not sure there is one. There are simpler systems with the same property that are taught to undergraduate engineers (and NNs don't represent a novel path to having it, even); there's a great distance between approximating functions and solving practical ML problems.

Edit: To me, personally, I would have been more interested had this been an argument about back propagation for training NNs. I suspected NNs could do this because of what's in them, but training is as much of a constraint on what they can actually do.

mturmon · on Sept 3, 2014

Yep, it's funny how different people can have different reactions. The headline of the article said "any function", and so did the first sentence of the post ("any function at all").

Later on in the article, it's qualified, but tempers are already rising.

Here come people with their non-computable functions, their unmeasurable functions, their nowhere-continuous functions, all wanting to get approximated. In sup norm! On unbounded sets!

Irksome, greedy, and vexatious.

Houshalter · on Sept 3, 2014

It can learn any input to output mapping. What more do you want?

mtdewcmu · on Sept 4, 2014

The article doesn't say anything about whether it can learn it, just that it can represent it.

Houshalter · on Sept 4, 2014

The proof is constructive, so actually you can. Just create a few neurons for the "tower function" and the correct output weight at every point you want to map.

If you mean learn functions efficiently with few parameters, or in ways that generalize well, the article makes no claims about that at all. There are two "no free lunch" theorems. One of which says that it's impossible to guarantee any method will generalize well on all problems, and the other that says it's impossible to guarantee any optimization method will work well on all problems. You just have to make strong assumptions like "my function can be modeled by a neural net" or "the error function is convex."

However neural networks are agnostic to the optimization algorithm you use to set their weights. There are many different ones.

rcfox · on Sept 2, 2014

> how does the number of hidden units required behave as a function of dimensionality of the input domain

If I recall correctly, a non-linear problem can be solved as a linear problem if you consider more dimensions. The hidden layer add dimensions. So, it's not a function of the input domain but of the problem domain, which usually isn't explicitly known.

mturmon · on Sept 2, 2014

The question in the OP concerns approximation of the f(.) in:

  y = f(x)

where the input "x" is d-dimensional (say).

Some problems (i.e., choice of "f") could be easy. Maybe f only depends on one element of x, for example. There would be no curse of dimensionality in this case. Same situation if "f" depends only on any fixed number of elements of "x".

I think this is roughly what you mean by "dimension of problem domain." Fix an "f", that defines a problem. And you're right, efficient solution of that fixed problem is important!

My remark (which is also in the last paragraphs of the Cybenko reference cited in the OP) had to do with increasingly difficult problems.

How to get such a sequence of problems? Suppose you take a simple function like

  f(x) = exp(-0.5 * dot(x,x))

i.e., the Gaussian, and approximate it with a linear superposition of 1-d sigmoids (as a neural net would).

The question is, is there an explicit dependence on dimensionality, and is that dependence exponential in d?

And of course, for more general function classes (not just the single Gaussian function above), is there such a dependence? If it is not exponential, that would be astonishing, revolutionary.

The reason this setup ("vary the problem size") is interesting is that we would clearly like to use neural nets for increasingly higher-dimensional problems (e.g., learn appearance of 32x32 cats, then 64x64 cats, then ...).

mtdewcmu · on Sept 2, 2014

I think I know the problem you're referring to (despite not having the right kind of specialist training to actually work in this field). I think that the fact that random or quasirandom probes are able to explore these high-dimensional spaces fairly effectively is evidence that the problem is somehow not exponential and therefore tractable in some fashion. Does that sound relevant?

mturmon · on Sept 3, 2014

Yes, your observation is relevant.

One has to be careful to separate the functional-approximation problem (which is in the OP) from the statistical-model-selection problem (which is NOT in the OP).

The statistical-model-selection problem is easier in some ways (you don't have to approximate the function everywhere, just where you get data; if you didn't get data somewhere, you don't care what happens there).

It is harder in some ways (all that probabilistic folderol, plus, your data is noisy).

There are results that give rates for the functional-approximation problem. By rates, I mean, how good is the approximation versus number of hidden units. The best work I know of is by Andrew Barron, but that was in the mid/late 1990s. He's like a genius bulldozer, so his papers are tough going. You'll note that the Cybenko results, like in the OP, do not give rates. This is obviously a huge difference.

There are also results in the statistical-model-selection problem, of course. With rates. That's Vladimir Vapnik's big contribution, later carried on by others. One of the main results is that a model class has an intrinsic complexity (VC dimension, or other measures) and you only need to have order-of that many random probes to get (close to) the best model in the class. No matter what the dimensionality of the space where the data lives.

This is precisely the point you're making above.

My comment is getting too long, but: notice that the statistical-model-selection work only speaks about how to get (close to) the best model in the model class. You then have to make sure your model class is full enough to get close to the optimal rule (which is chosen by Nature, or whatever).

So to make a full theory, you need both pieces: how to choose a big-enough model class (functional approx.) and how to select a good-enough member of the model class (statistical model selection).

Typically there is a total error having one term for each of the two effects, and therefore a balancing act between making the model class big enough to approximate anything, vs. making it small so that you can easily select a good model. If you look at it sideways, this therefore looks like an optimization problem with a Lagrange multiplier penalizing model complexity. So it's quite elegant.

jo_ · on Sept 2, 2014

You are correct. This is Cover's Theorem.

"A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated." - Cover, Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition., 1965

mturmon · on Sept 2, 2014

This is a well-known observation, but it's not at all the issue I was trying to get at, and which the existence proof of the OP raised.

From the last paragraph of the paper by George Cybenko referenced in the OP:

"While the approximating properties we have described are quite powerful, we have focused only on existence. The important questions that remain to be answered deal with ... how many terms in the summation (or equivalently, how many neural nodes) are required to yield an approximation of a given quality? [...] We suspect quite strongly that the overwhelming majority of approximation problems will require astronomical numbers of terms. This feeling is based on the curse of dimensionality that plagues multidimensional approximation theory and statistics."

[I'm highly familiar with both the paper by Cybenko, and the paper by Tom Cover on linear separability that is the source of the Wikipedia quote pasted above, having gone through them carefully as part of my PhD. Andrew Barron, mentioned above, was Tom Cover's student, and the work Andrew did can be viewed as another approach to this problem.]

jo_ · on Sept 3, 2014

I misunderstood intentions. I just wanted to chime in with the name of theorem I thought you were describing. I like knowing the names of theorems, and I had a hard time finding this name when I first looked for it.

gohrt · on Sept 2, 2014

> it starts to sound more like math and less like a miracle.

that applies to all of math. There are no miracles.

JadeNB · on Sept 2, 2014

> > it starts to sound more like math and less like a miracle.

> that applies to all of math. There are no miracles.

Err, what applies to all of math? That it sounds like math?

I do not agree that there are no miracles—plenty of mathematics is miraculous. As a first example off the top of my head, the areas of squares that can be drawn with vertices on the integer lattice in the plane are the integers whose prime factorisations include an even number of times each prime `p` for which `p + 1` is divisible by 4. This is a miraculous relationship between geometry and number theory, and it's just the tip of an iceberg!