Really we are building on the shoulders of giants (calculus, linear algebra, sta...

Really we are building on the shoulders of giants (calculus, linear algebra, statistics) but it seems like the modern use of recurrent neural networks crystalized in the 80s with the publication of Parallel Distributed Processing by David Rumelhardt, James L. McClelland, and PDP Research Group ( that included Geoffrey Hinton) which discussed backpropagation and recurrent neural networks even providing a handbook with code samples.

Jeffrey Elman (with others) wrote a successor to the PDP books called Rethinking Innateness: A Connectionist Perspective on Development (1997)

His paper Finding Structure in Time (1990) adapted backpropagation to take time into account, backpropagation through time (BPTT):

https://crl.ucsd.edu/~elman/Papers/fsit.pdf

https://en.wikipedia.org/wiki/Jeffrey_Elman

>Elman's work was highly significant to our understanding of how languages are acquired and also, once acquired, how sentences are comprehended. Sentences in natural languages are composed of sequences of words that are organized in phrases and hierarchical structures. The Elman network provides an important hypothesis for how neural networks - and, by analogy, the human brain - might be doing the learning and processing of such structures.

https://web.stanford.edu/group/pdplab/pdphandbook/handbookch...

>Here we briefly discuss three of the findings from Elman (1990). Elman's work was highly significant to our understanding of how languages are acquired and also, once acquired, how sentences are comprehended. Sentences in natural languages are composed of sequences of words that are organized in phrases and hierarchical structures. The Elman network provides an important hypothesis for how neural networks - and, by analogy, the human brain - might be doing the learning and processing of such structures.

>The concept ‘word’ is actually a complicated one, presenting considerable difficulty to anyone who feels they must decide what is a word and what is not. Consider these examples: ‘linedrive’, ‘flagpole’, ‘carport’, ‘gonna’, ‘wanna’, ‘hafta’, ‘isn’t’ and ‘didn’t’ (often pronounced “dint”). How many words are involved in each case? If more than one word, where are the word boundaries? Life might be easier if we did not have to decide where the boundaries between words actually lie. Yet, we have intuitions that there are points in the stream of speech sounds that correspond to places where something ends and something else begins. One such place might be between ‘fifteen’ and ‘men’ in a sentence like ‘Fifteen men sat down at a long table’, although there is unlikely to be a clear boundary between these words in running speech.

> Elman’s approach to these issues, as previously mentioned, was to break utternances down into a sequence of elements, and present them to an SRN. In his letter-in-word simulation, he actually used a stream of sentences generated from a vocabulary of 15 words. The words were converted into a stream of elements corresponding to the letters that spelled each of the words, with no spaces. Thus, the network was trained on an unbroken stream of letters. After the network had looped repeatedly through a stream of about 5,000 elements, he tested its predictions for the first 50 or so elements of the training sequence.

Schmidhuber developed the LSTMs, LeCun developed CNN, the ideas were refined and processing capabilities developed and Hinton revived these connectionist ideas leading up to Imagenet in 2012