The question in the OP concerns approximation of the f(.) in: y = f(x) where the...

mtdewcmu · on Sept 2, 2014

I think I know the problem you're referring to (despite not having the right kind of specialist training to actually work in this field). I think that the fact that random or quasirandom probes are able to explore these high-dimensional spaces fairly effectively is evidence that the problem is somehow not exponential and therefore tractable in some fashion. Does that sound relevant?

mturmon · on Sept 3, 2014

Yes, your observation is relevant.

One has to be careful to separate the functional-approximation problem (which is in the OP) from the statistical-model-selection problem (which is NOT in the OP).

The statistical-model-selection problem is easier in some ways (you don't have to approximate the function everywhere, just where you get data; if you didn't get data somewhere, you don't care what happens there).

It is harder in some ways (all that probabilistic folderol, plus, your data is noisy).

There are results that give rates for the functional-approximation problem. By rates, I mean, how good is the approximation versus number of hidden units. The best work I know of is by Andrew Barron, but that was in the mid/late 1990s. He's like a genius bulldozer, so his papers are tough going. You'll note that the Cybenko results, like in the OP, do not give rates. This is obviously a huge difference.

There are also results in the statistical-model-selection problem, of course. With rates. That's Vladimir Vapnik's big contribution, later carried on by others. One of the main results is that a model class has an intrinsic complexity (VC dimension, or other measures) and you only need to have order-of that many random probes to get (close to) the best model in the class. No matter what the dimensionality of the space where the data lives.

This is precisely the point you're making above.

My comment is getting too long, but: notice that the statistical-model-selection work only speaks about how to get (close to) the best model in the model class. You then have to make sure your model class is full enough to get close to the optimal rule (which is chosen by Nature, or whatever).

So to make a full theory, you need both pieces: how to choose a big-enough model class (functional approx.) and how to select a good-enough member of the model class (statistical model selection).

Typically there is a total error having one term for each of the two effects, and therefore a balancing act between making the model class big enough to approximate anything, vs. making it small so that you can easily select a good model. If you look at it sideways, this therefore looks like an optimization problem with a Lagrange multiplier penalizing model complexity. So it's quite elegant.