In the beginning, there is this statement (assumption):
> when processing natural language, we can bound the maximum number of things we’ll need to store on the stack
Why is that? Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth? But that's not strictly true, just highly unlikely, right? In principle, for every sentence of length N, I can easily add a word, and have length N+1. In a similar way, I can also always make the stack deeper.
This discussion reminds me in general about whether RNNs are turing complete or not. There is a famous paper stating that they are (On the computational power of neural nets, Siegelmann & Sontag, 1992). However, that uses infinite precision rational numbers for weights and activations. Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.
> Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth?
It's true that you can't really find an N such that a natural language sentence with stack depth N is OK, but a sentence with stack depth N+1 is ungrammatical. But in practice, the stack depth of real sentences is small [1], and sentences that require deep stacks quickly become impossible for humans to understand.
For example,
Stack depth 1: The rat ate the cheese.
Stack depth 2: The rat [that the cat chased] ate the cheese.
Stack depth 3: The rat [that the cat [that the teacher liked] chased] ate the cheese
Stack depth 4: The rat [that the cat [that the teacher [that the student hated] liked ] chased] ate the cheese.
English speakers usually can't handle depths 3 or higher.
This kind of phenomenon has caused a lot of controversy in linguistics. There does seem to be some sense in which the stack-depth-4 sentence is grammatical English, even though basically no one can understand it. But for the purposes of practical NLP systems, this phenomenon means that if you can model stack depth of, say, 10, then you'll be able to model almost all the naturally-occurring sentences.
But English speakers make games and songs out of this practice.
>> There's a fleck on the speck on the tail
on the frog on the bump on the branch
on the log in the hole in the bottom of the sea.
Which is a proper grammatical sentence. The key to the song is building it up. The song starts with: "There's a hole in the bottom of the sea", and the song builds the sentence up. Indeed: having a grossly complicated sentence is the "Fun" of the song, but that doesn't change the fact that its grammatically correct (and actually understood by the listener / singers).
The examples in GP deepen and pop the stack before the sentence is over. Your example only deepens the stack-- it's like tail recursion vs non-tail recursion. I think that's what makes it a special case of deeply nested grammatical structure that's easy to understand.
There are no weird sequences of words "hated liked chased" that require backtracking to parse.
Exactly, "a fleck on the speck on the tail on the frog on the bump on the branch on the log in the hole in the bottom of the sea" only has stack depth 1 as long as you can do a kind of tail call optimization (aka left-corner parsing with an arc-eager transition, see https://www.aclweb.org/anthology/C92-1032.pdf)
> Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.
My understanding is that Turing machines assume infinite memory -- could you help me understand how this differs? I'm guessing that I'm not understanding your point, but if "finite memory" is a disqualifier then Turing machines aren't Turing-complete.
Yes, for Turing machines (TM), or a machine which is as powerful as a Turing machine, you need infinite memory. A Turing machine always infinite memory (the tape is infinite). This is an important property of a TM. Once you have a finite tape only, it is strictly less powerful. That means, for any model / machine with finite memory, it is not as powerful as a Turing machine, i.e. not Turing complete.
So, (standard) RNNs with finite precision (standard floats) have finite memory, and thus are not Turing complete.
This paper (On the computational power of neural nets) shows that RNNs with infinite precision are Turing complete. But in practice, we never model RNNs that way. There are other ways we can add external (conceptually infinite) memory to a RNN. E.g. see neural turing machines (NTM) and differentiable neural computer (DNC).
I believe that the insights is that for practical purposes, you can restrict yourself to a bounded stack; under the assumption that a large enough constant means that you will not run into an input large enough to overrun your limit.
It is even more unreasonably effective, outperforming already unreasonable effective recursive neural networks on reasonable spectacular number of tasks.
Actually, it's a homage to "The Unreasonable Effectiveness of Recurrent Neural Networks" which is a homage to "The Unreasonable Effectiveness of Mathematics in the Natural Sciences".
I'm surprised that genetic algorithms on resource-bounded general purpose virtual machines have not surpassed neural nets in their problem solving ability.
Is the computational advantage of neural nets on GPUs so dramatic?
It comes down to gradients. When you have useful gradients to work with backprop on a GPU is vastly faster and more directed than GAs. One could uncharitably say that all of the impressive results from NNs in the last decade are the result of this ability to throw large amounts of data into highly parallel training, rather than new theory about the expressiveness and capabilities of NNs.
When you don't have or can't use a gradient, GAs become the go-to tool, for optimizations (if only because you don't have much in the way of other options).
GAs can be way, way, way more compute intensive as your population size increases.
That said, I use GAs for optimization and find them to be an incredibly useful tool for various optimization tasks - just generally not for learning weights of NNs, where a GPU + backprop generally reigns supreme (or at least works well enough/much faster than a GA).
You might be interested in "compact GA" which only requires keeping the % of 1s (or 0s) at each bit position.
I suspect that with some more engineering and attention from people doing ML stuff, GA-style algorithms can be made just as memory and space efficient as gradient methods, while giving better results and being more widely applicable.
With a genetic algorithm you have to try a bunch of weight variations and see which one works best. With backpropagation you can calculate one derivative and find out which variation will work best in one step. It’s hugely more efficient.
GA-style search isn't actually taking multiple samples to decide on a new point to move to - it's taking multiple samples to decide on a smaller region to focus on. This can be more efficient than backprop, depending on how "easy" it is to tell via sampling which subregion has better performance.
Traditional GA has a practical problem for training large models - keeping a large population of model weights in memory is not feasible. Like if you had to keep 1000 variations of the GPT-3 model weights in memory for training that's no good. Though people have ideas for how to solve for that as well (again, see the post I linked).
You can use genetic algorithms for training the weights of a neurL network. This is called neuroevolution and it's competitive on reinforcement learning because gradients are harder to find there...
I actually like this kind of title, so... `"Unreasonable repetition of 'unreasonable' and 'considered harmful' considered harmful." considered harmful.`
In the beginning, there is this statement (assumption):
> when processing natural language, we can bound the maximum number of things we’ll need to store on the stack
Why is that? Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth? But that's not strictly true, just highly unlikely, right? In principle, for every sentence of length N, I can easily add a word, and have length N+1. In a similar way, I can also always make the stack deeper.
This discussion reminds me in general about whether RNNs are turing complete or not. There is a famous paper stating that they are (On the computational power of neural nets, Siegelmann & Sontag, 1992). However, that uses infinite precision rational numbers for weights and activations. Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.
Some interesting discussion about this is also here (own question, and also has some own answer, among others): https://stackoverflow.com/questions/2990277/how-useful-is-tu...