It's worth noting that apparently (as I learned lately) RNNs are going slightly out of fashion because they are hard to parallelize and have trouble remembering important stuff at larger distances. Transformers are proposed as a possible solution - very roughly speaking, they use attention mechanisms instead of recurrent memory and can run in parallel.
I have to say that while I understand the problems with recurrent nets (which I've used many times), I haven't yet grokked the alternatives. Here are some decently looking search results for you as starting points. Warning, these can be longer and heavier reads probably not for beginners.
That being said, I think that understanding RNNs is very beneficial conceptually and nowadays there are relatively easy to use implementations that should be pretty good for many use cases.
It depends on the problem domain. Transformers are useful for NLP (language modeling, machine translation), but RNNs (and CNNs) are still being used in speech-to-text. When the input/output relationships are monotonic, transformers are probably too general.
Very nicely written post. I particularly like how you attached a link to your codebase on repl.it so anyone who is interested can tinker with the code.
One thing I have been wondering for some time is whether the vanilla RNN can learn negations (i.e. 'not good' == 'bad') and valence shifts (e.g. modifier words like 'very' --- they do not carry sentiment connotations themselves, but may amplify/dampen the sentiment of the words they modify; negations like 'not' can be considered as a special-case valence shifter where it inverts the sentiment of the following word).
My suspicion is that vanilla RNNs are not capable of modelling negations and valence shifters since they make inference on the sentiment of a sentence by 'adding up' the sentiment connotations of its constituent words --- negations and valence shifts, however, works more like multiplications than additions.
I see you already have such examples in your dataset so I thought I'd do some experiments. I simplified your original dataset to the following:
train_data = {
'good': True,
'bad': False,
'not good': False,
'not bad': True,
'very good': True,
'very bad': False,
'not very good': False,
'not very bad': True
}
test_data = {
'very not bad': True,
'very not good': False
}
While the test cases do not reflect how people actually speak, the hope is that the model should be able to apply its learning to infer their sentiment. For me, however, it would seem the training failed to converge with the default parameter settings (hidden_size=64).
It would be interesting to see how other RNN architectures (e.g. LSTM, Transformers) fare with negations and valence shifters.
P.S.: When calculating softmax, it is better to use the built-in functions or at least do the log-sum-exp trick to prevent under-flowing.
Nice! I like that the author wrote the code by hand rather than leaning on some framework. It makes it a lot easier to connect the math to the code. :)
As a meta-comment on these "Introduction to _____ neural network" articles (not just this one), I wish people would spend more time talking about when their neural net isn't the right tool for the job. SVMs, kNN, even basic regression techniques aren't any less effective than they were 20 years ago. They're easier to interpret and debug, require many fewer parameters, and potentially (you may need to apply some tricks here or there) faster at both training and evaluation time.
This kind of article is absolutely the thing everyone new to deep learning/neural networks should read. I wish there was one for each type of algorithm.
why do people insist on mentioning the bias terms in expository essays? it's a detail that clutters the equations. why not keep the transformations linear and then at the end make a note that you also need to shift using a bias term.
I feel I should point out that those two things are not mutually exclusive. RNNs are, after all, a mechanism for learning conditional probabilities.
I think the confusion comes from Google itself, who used the term "Statistical Machine Translation" (SMT) to refer to "Rule-based SMT". Both methods are statistical.
SMT as a term of art in the translation field means rule based SMT... it's not a google particularity, I see the same usage in both industry and academia
I have to say that while I understand the problems with recurrent nets (which I've used many times), I haven't yet grokked the alternatives. Here are some decently looking search results for you as starting points. Warning, these can be longer and heavier reads probably not for beginners.
https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c7... (there's some sensationalism here to be fair)
https://mchromiak.github.io/articles/2017/Sep/12/Transformer...
https://www.analyticsvidhya.com/blog/2019/06/understanding-t...
https://www.tensorflow.org/beta/tutorials/text/transformer
That being said, I think that understanding RNNs is very beneficial conceptually and nowadays there are relatively easy to use implementations that should be pretty good for many use cases.