Isn’t backprop just reverse-mode AD, which has been around since the 60’s? I get the impression that a lot of physicists and scientific computing researchers have been rolling their eyes at various ML “breakthroughs” over the years.
My point was not backprop, but time of publication.
However, "reverse-mode" AD is not as simple as just caching results.
> BP’s continuous form was derived in the early 1960s (Kelley, 1960; Bryson, 1961; Bryson and Ho, 1969). Dreyfus (1962) published the elegant derivation of BP based on the chain rule only.
> BP’s modern efficient version for discrete sparse networks (including FORTRAN code) was published by Linnainmaa (1970). Here the complexity of computing the derivatives of the output error with respect to each weight is proportional to the number of weights. That’s the method still used today.
See [1] for more details.
I do not know the exact implementation details of Linnainmaa's BP, but there are all sorts of improvements to plain reverse mode auto-grad through stuff like Jacobian-Vector products that save an enormous amount of memory.
Nocedal's "Numerical Optimization" covers a great deal on how to implement autograd, and I can easily recommend it to anyone interested.