Isn’t backprop just reverse-mode AD, which has been around since the 60’s? I get...

ekleraki · on July 8, 2022

My point was not backprop, but time of publication.

However, "reverse-mode" AD is not as simple as just caching results.

> BP’s continuous form was derived in the early 1960s (Kelley, 1960; Bryson, 1961; Bryson and Ho, 1969). Dreyfus (1962) published the elegant derivation of BP based on the chain rule only.

> explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected, NN-like networks

But that wasn't close enough to current BP

> BP’s modern efficient version for discrete sparse networks (including FORTRAN code) was published by Linnainmaa (1970). Here the complexity of computing the derivatives of the output error with respect to each weight is proportional to the number of weights. That’s the method still used today.

See [1] for more details.

I do not know the exact implementation details of Linnainmaa's BP, but there are all sorts of improvements to plain reverse mode auto-grad through stuff like Jacobian-Vector products that save an enormous amount of memory.

Nocedal's "Numerical Optimization" covers a great deal on how to implement autograd, and I can easily recommend it to anyone interested.

[1] https://www.reddit.com/r/MachineLearning/comments/e5vzun/d_j...