In the beginning, there is this statement (assumption):
> when processing natural language, we can bound the maximum number of things we’ll need to store on the stack
Why is that? Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth? But that's not strictly true, just highly unlikely, right? In principle, for every sentence of length N, I can easily add a word, and have length N+1. In a similar way, I can also always make the stack deeper.
This discussion reminds me in general about whether RNNs are turing complete or not. There is a famous paper stating that they are (On the computational power of neural nets, Siegelmann & Sontag, 1992). However, that uses infinite precision rational numbers for weights and activations. Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.
> Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth?
It's true that you can't really find an N such that a natural language sentence with stack depth N is OK, but a sentence with stack depth N+1 is ungrammatical. But in practice, the stack depth of real sentences is small [1], and sentences that require deep stacks quickly become impossible for humans to understand.
For example,
Stack depth 1: The rat ate the cheese.
Stack depth 2: The rat [that the cat chased] ate the cheese.
Stack depth 3: The rat [that the cat [that the teacher liked] chased] ate the cheese
Stack depth 4: The rat [that the cat [that the teacher [that the student hated] liked ] chased] ate the cheese.
English speakers usually can't handle depths 3 or higher.
This kind of phenomenon has caused a lot of controversy in linguistics. There does seem to be some sense in which the stack-depth-4 sentence is grammatical English, even though basically no one can understand it. But for the purposes of practical NLP systems, this phenomenon means that if you can model stack depth of, say, 10, then you'll be able to model almost all the naturally-occurring sentences.
But English speakers make games and songs out of this practice.
>> There's a fleck on the speck on the tail
on the frog on the bump on the branch
on the log in the hole in the bottom of the sea.
Which is a proper grammatical sentence. The key to the song is building it up. The song starts with: "There's a hole in the bottom of the sea", and the song builds the sentence up. Indeed: having a grossly complicated sentence is the "Fun" of the song, but that doesn't change the fact that its grammatically correct (and actually understood by the listener / singers).
The examples in GP deepen and pop the stack before the sentence is over. Your example only deepens the stack-- it's like tail recursion vs non-tail recursion. I think that's what makes it a special case of deeply nested grammatical structure that's easy to understand.
There are no weird sequences of words "hated liked chased" that require backtracking to parse.
Exactly, "a fleck on the speck on the tail on the frog on the bump on the branch on the log in the hole in the bottom of the sea" only has stack depth 1 as long as you can do a kind of tail call optimization (aka left-corner parsing with an arc-eager transition, see https://www.aclweb.org/anthology/C92-1032.pdf)
> Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.
My understanding is that Turing machines assume infinite memory -- could you help me understand how this differs? I'm guessing that I'm not understanding your point, but if "finite memory" is a disqualifier then Turing machines aren't Turing-complete.
Yes, for Turing machines (TM), or a machine which is as powerful as a Turing machine, you need infinite memory. A Turing machine always infinite memory (the tape is infinite). This is an important property of a TM. Once you have a finite tape only, it is strictly less powerful. That means, for any model / machine with finite memory, it is not as powerful as a Turing machine, i.e. not Turing complete.
So, (standard) RNNs with finite precision (standard floats) have finite memory, and thus are not Turing complete.
This paper (On the computational power of neural nets) shows that RNNs with infinite precision are Turing complete. But in practice, we never model RNNs that way. There are other ways we can add external (conceptually infinite) memory to a RNN. E.g. see neural turing machines (NTM) and differentiable neural computer (DNC).
I believe that the insights is that for practical purposes, you can restrict yourself to a bounded stack; under the assumption that a large enough constant means that you will not run into an input large enough to overrun your limit.
In the beginning, there is this statement (assumption):
> when processing natural language, we can bound the maximum number of things we’ll need to store on the stack
Why is that? Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth? But that's not strictly true, just highly unlikely, right? In principle, for every sentence of length N, I can easily add a word, and have length N+1. In a similar way, I can also always make the stack deeper.
This discussion reminds me in general about whether RNNs are turing complete or not. There is a famous paper stating that they are (On the computational power of neural nets, Siegelmann & Sontag, 1992). However, that uses infinite precision rational numbers for weights and activations. Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.
Some interesting discussion about this is also here (own question, and also has some own answer, among others): https://stackoverflow.com/questions/2990277/how-useful-is-tu...