An Introduction to Recurrent Neural Networks

stereolambda · on July 25, 2019

It's worth noting that apparently (as I learned lately) RNNs are going slightly out of fashion because they are hard to parallelize and have trouble remembering important stuff at larger distances. Transformers are proposed as a possible solution - very roughly speaking, they use attention mechanisms instead of recurrent memory and can run in parallel.

I have to say that while I understand the problems with recurrent nets (which I've used many times), I haven't yet grokked the alternatives. Here are some decently looking search results for you as starting points. Warning, these can be longer and heavier reads probably not for beginners.

https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c7... (there's some sensationalism here to be fair)

https://mchromiak.github.io/articles/2017/Sep/12/Transformer...

https://www.analyticsvidhya.com/blog/2019/06/understanding-t...

https://www.tensorflow.org/beta/tutorials/text/transformer

That being said, I think that understanding RNNs is very beneficial conceptually and nowadays there are relatively easy to use implementations that should be pretty good for many use cases.

ma2rten · on July 25, 2019

Mainly RNNs are much slower to train than transformers.

_pd19 · on July 25, 2019

As well as stability issues, long range dependencies, etc..

mrfox321 · on July 26, 2019

It depends on the problem domain. Transformers are useful for NLP (language modeling, machine translation), but RNNs (and CNNs) are still being used in speech-to-text. When the input/output relationships are monotonic, transformers are probably too general.

vzhou842 · on July 25, 2019

Hey, author here. Happy to answer any questions or take any suggestions.

Runnable code from the article: https://repl.it/@vzhou842/A-RNN-from-scratch

RyEgswuCsn · on July 25, 2019

Very nicely written post. I particularly like how you attached a link to your codebase on repl.it so anyone who is interested can tinker with the code.

One thing I have been wondering for some time is whether the vanilla RNN can learn negations (i.e. 'not good' == 'bad') and valence shifts (e.g. modifier words like 'very' --- they do not carry sentiment connotations themselves, but may amplify/dampen the sentiment of the words they modify; negations like 'not' can be considered as a special-case valence shifter where it inverts the sentiment of the following word).

My suspicion is that vanilla RNNs are not capable of modelling negations and valence shifters since they make inference on the sentiment of a sentence by 'adding up' the sentiment connotations of its constituent words --- negations and valence shifts, however, works more like multiplications than additions.

I see you already have such examples in your dataset so I thought I'd do some experiments. I simplified your original dataset to the following:

  train_data = {
    'good': True,
    'bad': False,
    'not good': False,
    'not bad': True,
    'very good': True,
    'very bad': False,
    'not very good': False,
    'not very bad': True
  }
  
  test_data = {
    'very not bad': True,
    'very not good': False
  }

While the test cases do not reflect how people actually speak, the hope is that the model should be able to apply its learning to infer their sentiment. For me, however, it would seem the training failed to converge with the default parameter settings (hidden_size=64).

It would be interesting to see how other RNN architectures (e.g. LSTM, Transformers) fare with negations and valence shifters.

P.S.: When calculating softmax, it is better to use the built-in functions or at least do the log-sum-exp trick to prevent under-flowing.

vzhou842 · on July 25, 2019

Thanks for the comments! Interesting experiment - I wouldn't be surprised if better RNN architectures were more effective for this example.

Appreciate the softmax tip, I'll update soon.

mruts · on July 25, 2019

I tried a LSTM model of stock twists and it seemed reasonably good at handling negations (single and double negatives at least)

p1esk · on July 25, 2019

Suggestion: do the same for transformer.

lucidrains · on July 25, 2019

http://nlp.seas.harvard.edu/2018/04/03/attention.html

p1esk · on July 25, 2019

Numpy only.

applecrazy · on July 25, 2019

Machine translation using RNNs actually uses two of them, one serving as an encoder and the other as a decoder (the Seq2Seq architecture)

stibba · on July 25, 2019

In your post you initialize your weights a certain way but you said there are better ways. Do you have any resources for better ways?

wish5031 · on July 25, 2019

Nice! I like that the author wrote the code by hand rather than leaning on some framework. It makes it a lot easier to connect the math to the code. :)

As a meta-comment on these "Introduction to _____ neural network" articles (not just this one), I wish people would spend more time talking about when their neural net isn't the right tool for the job. SVMs, kNN, even basic regression techniques aren't any less effective than they were 20 years ago. They're easier to interpret and debug, require many fewer parameters, and potentially (you may need to apply some tricks here or there) faster at both training and evaluation time.

cheez · on July 25, 2019

This kind of article is absolutely the thing everyone new to deep learning/neural networks should read. I wish there was one for each type of algorithm.

vzhou842 · on July 25, 2019

I do have similar articles for Neural Networks (MLP) and CNNs:

---

NN: https://victorzhou.com/blog/intro-to-neural-networks/

NN HN discussion: https://news.ycombinator.com/item?id=19320217

---

CNN: https://victorzhou.com/blog/intro-to-cnns-part-1/ https://victorzhou.com/blog/intro-to-cnns-part-2/

CNN HN discussions: https://news.ycombinator.com/item?id=19981736 https://news.ycombinator.com/item?id=20064900

cheez · on July 26, 2019

Awesome!

rrggrr · on July 25, 2019

Would be great if you showed the final output (eg. semantic analysis) result.

mlevental · on July 26, 2019

why do people insist on mentioning the bias terms in expository essays? it's a detail that clutters the equations. why not keep the transformations linear and then at the end make a note that you also need to shift using a bias term.

ape4 · on July 25, 2019

I doubt Google Translate uses RNN. They use Statistical Machine Translation. Oops, I see they switched to NN in 2016. https://en.wikipedia.org/wiki/Google_Translate

probably_wrong · on July 25, 2019

I feel I should point out that those two things are not mutually exclusive. RNNs are, after all, a mechanism for learning conditional probabilities.

I think the confusion comes from Google itself, who used the term "Statistical Machine Translation" (SMT) to refer to "Rule-based SMT". Both methods are statistical.

_pd19 · on July 25, 2019

SMT as a term of art in the translation field means rule based SMT... it's not a google particularity, I see the same usage in both industry and academia

_pd19 · on July 25, 2019

Every top MT uses pure-NMT, not SMT