The Unreasonable Syntactic Expressivity of RNNs

albertzeyer · on Nov 5, 2020

This is quite interesting.

In the beginning, there is this statement (assumption):

> when processing natural language, we can bound the maximum number of things we’ll need to store on the stack

Why is that? Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth? But that's not strictly true, just highly unlikely, right? In principle, for every sentence of length N, I can easily add a word, and have length N+1. In a similar way, I can also always make the stack deeper.

This discussion reminds me in general about whether RNNs are turing complete or not. There is a famous paper stating that they are (On the computational power of neural nets, Siegelmann & Sontag, 1992). However, that uses infinite precision rational numbers for weights and activations. Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.

Some interesting discussion about this is also here (own question, and also has some own answer, among others): https://stackoverflow.com/questions/2990277/how-useful-is-tu...

canjobear · on Nov 5, 2020

> Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth?

It's true that you can't really find an N such that a natural language sentence with stack depth N is OK, but a sentence with stack depth N+1 is ungrammatical. But in practice, the stack depth of real sentences is small [1], and sentences that require deep stacks quickly become impossible for humans to understand.

For example,

Stack depth 1: The rat ate the cheese.

Stack depth 2: The rat [that the cat chased] ate the cheese.

Stack depth 3: The rat [that the cat [that the teacher liked] chased] ate the cheese

Stack depth 4: The rat [that the cat [that the teacher [that the student hated] liked ] chased] ate the cheese.

English speakers usually can't handle depths 3 or higher.

This kind of phenomenon has caused a lot of controversy in linguistics. There does seem to be some sense in which the stack-depth-4 sentence is grammatical English, even though basically no one can understand it. But for the purposes of practical NLP systems, this phenomenon means that if you can model stack depth of, say, 10, then you'll be able to model almost all the naturally-occurring sentences.

[1] https://www.jstor.org/stable/40057996

dragontamer · on Nov 5, 2020

But English speakers make games and songs out of this practice.

>> There's a fleck on the speck on the tail on the frog on the bump on the branch on the log in the hole in the bottom of the sea.

Which is a proper grammatical sentence. The key to the song is building it up. The song starts with: "There's a hole in the bottom of the sea", and the song builds the sentence up. Indeed: having a grossly complicated sentence is the "Fun" of the song, but that doesn't change the fact that its grammatically correct (and actually understood by the listener / singers).

skulk · on Nov 5, 2020

The examples in GP deepen and pop the stack before the sentence is over. Your example only deepens the stack-- it's like tail recursion vs non-tail recursion. I think that's what makes it a special case of deeply nested grammatical structure that's easy to understand.

There are no weird sequences of words "hated liked chased" that require backtracking to parse.

canjobear · on Nov 6, 2020

Exactly, "a fleck on the speck on the tail on the frog on the bump on the branch on the log in the hole in the bottom of the sea" only has stack depth 1 as long as you can do a kind of tail call optimization (aka left-corner parsing with an arc-eager transition, see https://www.aclweb.org/anthology/C92-1032.pdf)

pixl97 · on Nov 5, 2020

Is it only understood because of the particular build up?

Anything could be explained if it's broken into a ton of steps, with each step adding a new level, but very little daily language works like that.

jhardy54 · on Nov 5, 2020

> Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.

My understanding is that Turing machines assume infinite memory -- could you help me understand how this differs? I'm guessing that I'm not understanding your point, but if "finite memory" is a disqualifier then Turing machines aren't Turing-complete.

albertzeyer · on Nov 5, 2020

Yes, for Turing machines (TM), or a machine which is as powerful as a Turing machine, you need infinite memory. A Turing machine always infinite memory (the tape is infinite). This is an important property of a TM. Once you have a finite tape only, it is strictly less powerful. That means, for any model / machine with finite memory, it is not as powerful as a Turing machine, i.e. not Turing complete.

So, (standard) RNNs with finite precision (standard floats) have finite memory, and thus are not Turing complete.

This paper (On the computational power of neural nets) shows that RNNs with infinite precision are Turing complete. But in practice, we never model RNNs that way. There are other ways we can add external (conceptually infinite) memory to a RNN. E.g. see neural turing machines (NTM) and differentiable neural computer (DNC).

gizmo686 · on Nov 5, 2020

I believe that the insights is that for practical purposes, you can restrict yourself to a bounded stack; under the assumption that a large enough constant means that you will not run into an input large enough to overrun your limit.

thesz · on Nov 5, 2020

No word about convolutional sequence modeling: https://openreview.net/pdf?id=rk8wKk-R- (PDF)

It is even more unreasonably effective, outperforming already unreasonable effective recursive neural networks on reasonable spectacular number of tasks.

Der_Einzige · on Nov 5, 2020

When will this be reviewed? Says it's still under review at ICLR...

eli_gottlieb · on Nov 5, 2020

Looks like it ended up as an ICLR 2018 workshop paper. https://www.iclr.cc/Conferences/2018/Schedule?showEvent=511

jerome-jh · on Nov 5, 2020

I read it as: RNN's (well actually FSTM's) can simulate a type of FSM's with a memory efficiency not thought possible.

kubanczyk · on Nov 5, 2020

Don't you think that the plural nominatives without an apostrophe are less ambiguous? I mean RNNs or FSTMs.

It's the OED's and CMoS's recommendation: https://english.stackexchange.com/a/56010/15831

jerome-jh · on Nov 5, 2020

OK thanks for the advice. I will do that from now on. I notice there is an exception for single letters and single digit numbers.

canjobear · on Nov 5, 2020

RNNs looks to me like an acronym with a lowercase part. Like “RNN simple” or something. The apostrophe is less ambiguous in this context.

GreenHeuristics · on Nov 5, 2020

What's unreasonable about it?

marton78 · on Nov 5, 2020

It's an hommage to https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...

The title has become a meme just like "... considered harmful".

kubanczyk · on Nov 5, 2020

Actually, it's a homage to "The Unreasonable Effectiveness of Recurrent Neural Networks" which is a homage to "The Unreasonable Effectiveness of Mathematics in the Natural Sciences".

Nowado · on Nov 5, 2020

Meme chains definitely are a progress compared to eponyms, but personally I hoped we would make a slightly bigger jump this time.

ve55 · on Nov 5, 2020

It would be better read in a way such as 'unreasonably surprising'

7373737373 · on Nov 5, 2020

I'm surprised that genetic algorithms on resource-bounded general purpose virtual machines have not surpassed neural nets in their problem solving ability.

Is the computational advantage of neural nets on GPUs so dramatic?

cpgxiii · on Nov 5, 2020

It comes down to gradients. When you have useful gradients to work with backprop on a GPU is vastly faster and more directed than GAs. One could uncharitably say that all of the impressive results from NNs in the last decade are the result of this ability to throw large amounts of data into highly parallel training, rather than new theory about the expressiveness and capabilities of NNs.

When you don't have or can't use a gradient, GAs become the go-to tool, for optimizations (if only because you don't have much in the way of other options).

king_magic · on Nov 5, 2020

GAs can be way, way, way more compute intensive as your population size increases.

That said, I use GAs for optimization and find them to be an incredibly useful tool for various optimization tasks - just generally not for learning weights of NNs, where a GPU + backprop generally reigns supreme (or at least works well enough/much faster than a GA).

pchiusano · on Nov 5, 2020

You might be interested in "compact GA" which only requires keeping the % of 1s (or 0s) at each bit position.

I suspect that with some more engineering and attention from people doing ML stuff, GA-style algorithms can be made just as memory and space efficient as gradient methods, while giving better results and being more widely applicable.

Here's a post on this: http://pchiusano.github.io/2020-10-12/learning-without-a-gra...

woah · on Nov 5, 2020

With a genetic algorithm you have to try a bunch of weight variations and see which one works best. With backpropagation you can calculate one derivative and find out which variation will work best in one step. It’s hugely more efficient.

pchiusano · on Nov 5, 2020

GA-style search isn't actually taking multiple samples to decide on a new point to move to - it's taking multiple samples to decide on a smaller region to focus on. This can be more efficient than backprop, depending on how "easy" it is to tell via sampling which subregion has better performance.

Here's a post on this: http://pchiusano.github.io/2020-10-12/learning-without-a-gra...

Traditional GA has a practical problem for training large models - keeping a large population of model weights in memory is not feasible. Like if you had to keep 1000 variations of the GPT-3 model weights in memory for training that's no good. Though people have ideas for how to solve for that as well (again, see the post I linked).

Der_Einzige · on Nov 5, 2020

You can use genetic algorithms for training the weights of a neurL network. This is called neuroevolution and it's competitive on reinforcement learning because gradients are harder to find there...

gmadsen · on Nov 5, 2020

yes, extremely.

parallelism like that isn't possible with GA

neolog · on Nov 5, 2020

Please add an RSS feed to your blog.

pedro1976 · on Nov 5, 2020

As an intermediate solution you can use this feed [0] which is the result of rss-proxy [1].

[0] https://rssproxy.migor.org/api/feed?url=https%3A%2F%2Fnlp.st...

[1] https://github.com/damoeb/rss-proxy

adamnemecek · on Nov 5, 2020

The ad nauseam repetition of memes like "considered harmful" or "unreasonable..." is too much.

FeepingCreature · on Nov 5, 2020

I think you mean "Unreasonable repetition of 'unreasonable' and 'considered harmful' considered harmful."

After all, self-referentiality is all you need.

dunefox · on Nov 5, 2020

I actually like this kind of title, so... `"Unreasonable repetition of 'unreasonable' and 'considered harmful' considered harmful." considered harmful.`