It's a bit beyond me too, but here's my take from skimming: It's already known that you can do differential calculus in some pretty abstract spaces (not just functions from R^n to R^m). Automatic differentiation is a way of automatically keeping differentiability and computing derivatives when sticking other differentiable stuff together. This puts those ideas together, so you can automatically differentiate some more interesting things stuck together in some interesting ways related to programming. That was what I got from my rusty semi-layperson's skim, at least.
Karpathy's explanation in CS231n is great: for any computational unit that is a function, you can calculate a derivative. So propagating the derivative backward through the function, just aplly the chain rule. This is much with simpler functions, so understand your algorithm down to the smallest computable units in the graph.
Does this only work for mathematical programs?
I don't understand how this would work for more general computations, involving loops, writes to files, etc.
This paper isn't really a great introduction into the concept of AD. That said, one way to visualize it is to see how a basic AD implementation works. There are generally two different varieties, operator overloading and source transformation. Let's focus on the first.
Say we have a function:
double myfun(std::vector <double> x) {
auto y = double(0.);
for(int i=0;i<=5;i++)
y += x[0]*x[1];
return y;
}
We may want to find the gradient of myfun, or simply the derivative of myfun with respect to x[0] and x[1]. What we're going to do is reply double with a special AD type, which requires us to code things like:
template <typename Real>
Real myfun(std::vector <Real> x) {
auto y = Real(0.);
for(int i=0;i<=5;i++)
y += x[0]*x[1];
return y;
}
Now, instead of using a double, we'll set Real to be our AD type. In C++, this'll be something like a class where all of the AD action occurs. What's in this class depends on the particular AD algorithm, which depends on the what derivatives we want and what kind of recomputations of the derivative we want to make. In general, though, it consists of some kind of tree. At its most basic level, the nodes of this tree consist of the types variable, constant, unary function, or binary function and they contain information that relates to the chain rule. When we want to compute a derivative, we traverse the tree in a very particular order to compute and accumulate the derivative information. How this is done can vary as well, which has implications for the efficiency of the algorithm. Most implementations use a tape that predefines the order of the tree traversal. It's also possible to do a topological sort to solve for a tree traversal. Anyway, traversing a tree can be expensive for a really, really large computational tree, so the goal is to only touch every necessary node once.
Returning to your original question, finding the derivative of a program involves building up this tree of computation. Control structures don't really matter in this context because we'll just keep adding to the tree regardless of loops or conditionals. This, by the way, can be really, really expensive since the computational tree can be huge for a long running loop. There's some game in determining how and when to collapse the tree in certain ways. Source code transformation tools can just leave the original, or really modified, control structures in place, which alleviates a lot of this expense, which is partially why they're faster than operator overloading.
Thanks for your explanation. So what this really does is enable one to train code using machine learning techniques such as gradient descent, traversing a code-space instead of traversing a space of numbers?
Well, from my perspective, there are three different components to machine learning techniques like multilayer perceptrons. Basically,
1. What's the model?
2. How do we define the error between the model and the data?
3. What algorithm do we use to fit the model to the data?
In the case of a multilayer perceptron with a single hidden layer, the model is
m(alpha,W,b,x) = alpha'*sigmoidal(Wx + b)
where
alpha : nhidden x 1
W : nhidden x ninput
b : nhidden
' denotes tranpose and sigmoidal is some kind of sigmoidal function like a logistic function or arc tangent. As far defining the error, we tend to use the sum of squared errors since it's differentiable and easy:
J(alpha,W,b) = sum_i (m(alpha,W,b,x[i]) - y[i])^2
Finally, we need to apply an optimization algorithm to the problem
min_{alpha,W,b} J(alpha,W,b)
Most of the time, people use a nonglobalized steepest descent. Honestly, this is terrible. A matrix-free inexact-Newton method globalized with either a line-search or trust-region method will work better.
Anyway, all good optimization algorithms require the derivative of J above. Certainly, we can grind through these derivatives by hand if we're masochistic. Well, for one layer it's not all that bad. However, as we start to nest things, it becomes a terrible pain to derive. Rather than do this, we can just use automatic differentiation to find the gradient and Hessian-vector products of J. In fact, most AD codes already do this. I can never find a paper that's run through the details, but back propagation is basically just a reverse mode AD algorithm applied to J in order to find the gradient.
What is the data that the machine learning algorithm would optimize with given an arbitrary program that is differentiated?
Is it just a faster program or something else?
For any changes you would need some kind of perfect 100% coverage regression test that proves that the optimized program is still correct and handles all cases because the differentiation only recorded one possible path through the program.
Alright, so machine learning is a pretty broad term that encompasses a lot of things. Very specifically above, I was talking about things I call curve fitters, which include multilayer perceptrons. There's also classifiers, which often people create by use a curve fitter and then thresholding it. There's other strange techniques as well that people call machine learning that don't fit into these categories as well.
Anyway, for curve fitters, we literally need a set of points {(x_i,y_i)}_{i=1}^m given to us and our goal is to map the n-dimensional vectors x_i to the scalar output y_i. If we want a multidimensional output, we just do this more than once. Now, theoretically, we can try and match any function, or program, to this data. Let phi be this model, so that we want phi(alpha,x)=y. Here, alpha are the parameters we want to tune for the matching. Also note, these parameters can't be arbitrary. We really need them to be some kind of floating point number. Now, how we do the matching can vary, but sum of squared errors is easy, so we solve the optimization problem
min_{alpha} sum_i (phi(alpha,x_i)-y_i)^2
Basically, we want to find a parameter alpha so that the output from program phi matches our data {(x_i,y_i)}_{i=1}^m as closely as possible in a least-squares sense. Since all good optimization algorithms need a derivative, we use automatic differentiation to find the gradient, and probably Hessian-vector product, of
J(alpha) = sum_i (phi(alpha,x_i)-y_i)^2
Now, if phi is a program, then so is J. It's a program that takes alpha as an input, runs a for-loop to sum up the squared errors, and then it returns the result.
Alright, that was long winded, but the key is this. We're not optimizing a program when we do this. Rather, the optimization occurs when we minimize the difference between our inputs and our outputs, which were given to us ahead of time. In this context, we have two programs. One, the program that we're trying to fit to our data. Two, the program that determines the amount of misfit between our inputs and our outputs. We need the derivative of the composition of these two programs, which is why we use automatic differentiation.