OP here. This is a dense paper since it involves many concepts (abstract algebras, tensors, automata) and I don't fully grok it (not even close).
The main points seem to be:
- Realization of the duality between programs and maps over vector space (simplification: assumes the real field).
- Define differentiation for those programs.
- Expand over current automatic differentiation approaches by having a well defined algebra, as opposed to a pure algorithmic translation of the chain rule by forward/backward propagation.
- Since the differentiation operator is well defined, get higher-order derivatives "for free" .
- Also, get composition of differentiable programs "for free".
This formalization seem to have a lot of potential IMO by allowing generalization of techniques researchers have to currently "hand craft" in neural networks (control structures, recursion, composition).
I skimmed their paper and don't claim to understand everything, but I am curious about some of your comments.
As far as higher-order derivatives, there's not really been an issue with obtaining these derivatives with traditional automatic differentiation techniques. The hard part is how to compute them efficiently. For example, Hessian-vector products can be obtained via forward-over-reverse or reverse-or-forward methods, but one runs significantly faster than the other. What's not clear to me in this paper is the practical efficiency of their technique especially for high-order derivatives.
Second, as far as neural networks, it's also not clear to me what's currently missing from using existing AD tools and what this new paper provides. Right now, there exist good AD tools that work just fine with control structures, recursion, and composition. In my opinion, a much better way to implement the training algorithms for machine learning is to just run an AD package on a particular model and then feed the resulting derivatives into an appropriate optimization algorithm. It's accurate, clean, and save time and effort. As such, what do we really gain from using this new methodology?
That said, I appreciate that the authors put together some interesting abstractions that I'd not seen before in this context. I think it's worth a read even if just for that.
> What's not clear to me in this paper is the practical efficiency of their technique especially for high-order derivatives.
On section "5.3.1 Order reduction for nested applications":
> We gained the ability of writing a differentiable program acting on derivatives of another
program, stressed as crucial (but lacking in most models) by other authors [18].
> Right now, there exist good AD tools that work just fine with control structures, recursion, and composition.
See their example implementation (https://github.com/ZigaSajovic/dCpp) to understand what "control structures, recursion and composition" mean in this context.
Thanks for the reply. Perlmutter and Siskind's paper doesn't really talk about the efficiency of higher-order derivatives either and only has a very simple run time for a gradient calculation.
I had a look at the code and I can't figure out how their abstractions would allow for efficient higher-order derivatives. Even on the front page, they list the examples
Generally, I think those abstractions make it hard to compute things like Hessian-vector products because we really need a reverse pass to generate either a program or computational tree that we can fly through with a forward pass. That involves giving an explicit instruction to run a reverse pass since we don't do it automatically since it's expensive. Also, we generally give an entire vector to the forward pass at one time since we can overlap a lot of computation. Finding second order derivatives one at a time, like `df/dxdy` for an entire Hessian tends to be slow. It looks like most of the action is in var.cpp, but I can't see how this would run fast. Do you have any idea how this scales up for derivative calculations (gradient and Hess-vec) as we get at least into the thousands of variables in terms of derivative time vs evaluation time with no AD abstractions?
Certainly, I may be misunderstanding something. That said, if the efficiency doesn't scale, it seems like the practical significance will be limited.
The idea of composing complicated differentiable programs (that have control structures, recursion, etc.) from simpler differentiable programs by means of an abstract algebra of differentiable programs is truly mind-bending.
It upends the conventional concept I have of "programming" for day-to-day purposes. It also upends the conventional concept I have of "designing" (i.e., handcrafting) machine learning models, like deep neural nets, for different problems.
Like you, I don't yet grok all the nitty-gritty details of the paper (it will take a me a long time to grok them), but it's clear to me that if the paper survives peer review (which I would expect, given that the authors have published working code), you're right: its formalization has a lot of potential for the fields of machine learning, AI, and computer science.
Holy cow.
EDIT: Removed the "could revolutionize" bit, which was over-enthusiastic.
Agreed. The paper seems to be about a lot more than AD, as AD isn't really the main subject. This approach to programming really does strike me as new and important. I too havent yet grasped all the details, having spent a few hours on it today. I commented a short synopsis bellow. I will try to get a better grasp of it tomorrow.
I got onto that based on this fantasy I was having. Basically I keep thinking that the main advantage deep learning has is probably not actually the neural nets, but the fact that they are highly parallel and very general computation techniques.
So if it doesn't really have to be a NN, maybe there could be a way to more efficiently achieve the same level of results by using something more like evolved parallel programs. Maybe have a universal virtual sensor grounding. Then if you find a useful/general procedure/dataset you could distribute it to improve lots of diverse agent's knowledge/skills immediately.
But to evolve these supposed universal function-approximating parallel programs that are more efficient and exchangable than deep NNs, you would need something like backpropagation, except for some arbitrary program structures. So when reading about backpropagation on wikipedia they mention the generalization is https://en.m.wikipedia.org/wiki/Automatic_differentiation
Anyway I don't grock all the details fully but my interpretation of all of this is that I think things like backpropagation with deep NNs are truly general purpose and when combined with grounding such as streams of virtual/real senses and training across widely varied tasks can actually produce general human and animal/like intelligence. I think the problem is it just takes a ton of computing power, iterations and time. If we wait 2-10 years that probably won't matter. But if we don't want to wait we can probably do something similar and more efficient/readily distributable if we can use algorithms that aren't NNs but are general and highly parallel and can be adjusted via AD to learn.
Based on this I personally believe we will see human/animal like general intelligence systems before 2020.
You have to be a little careful here. Neural networks are not "very general computation techniques." A dot product and a rectified linear function (or some other function of choice) are not "general computation techniques" in the sense you seem to use. They are a very specific set of operations. And the fact that you can show that two layers of these operations is a universal approximator is a red herring: decision trees and k nearest neighbors are also universal approximators.
Sure but I thought it was clear that I wasn't speaking in a precise or mathematical way. I was referring to what seems to be amazingly general abilities of certain successful deep learning-ish systems. And potentially also possibly amazingly general capabilities of some supposed imagined system a little bit like like those in terms the AD versus backpropagation.
Do you really think these abilities are not or cannot be as general as I imagine? I just see these systems like the Deep Mind projects reading pixels in 2d and 3d, and having this low-level sensory grounding. It seems likely to me that we will see very exciting and general further results and then down the line similar things on perhaps much smaller computers with the non-NN efficiencies. Do you think this is impossible or unlikely?
If you cannot describe something in a way that would allow for a precise formulation, then I don't know what you're trying to say. This is the difference between something that is implementable / testable, and something that's just a fantasy.
As far as I can tell, your definition of "general computing" is: "it works, so it must be general." While this is ironically indeed the attitude of a lot of people in deep learning, this is just faulty logic.
"I personally believe we will see human/animal like general intelligence systems before 2020."
What odds would you give me for that? If I agreed to pay you $100 in the event that we created human/animal-like general intelligence systems before 2020, how much would you agree to pay me in the event that we fail to do so?
The problem is that there are a lot of things I would consider to be 'human/animal-like general intelligence' that you probably wouldn't.. especially if it came down to money.
But for my purposes it would be something that could maybe get hints from virtual signs with simple words or phrases and then learn to do novel tasks in multiple complex 3d simulated environments where to some degree some learning was transferred. And then if it could also interact with virtual web pages on screens in those 3d worlds that had realistic use cases such as input forms, captchas, etc. Maybe we could even speak to it using language that a 3 year old would understand.
I think by the end of 2019 this is quite likely. So say 2-to-1 odds. So if I win, I get $50, if you win, you get $100.
But whether its a few years or a few decades, I truly believe that the human era is coming to an end. As in, regular people that don't have some sort of high-bandwidth brain interface to a contemporary AI, within not too many years, will be mainly irrelevant in the same way that many would consider chimpanzees to be irrelevant. At that point they will be technically transhumans, but the main drivers for our world will likely not be tied to human brains. Exactly when it will get to that point I don't know and don't think I want to bet. But my guess is before 2027.
It blows my mind that you think the world could change that much in 10 years. I'd estimate the things you describe as happening maybe by 2100, if they're going to happen at all.
Keep in mind that CPU time is a huge bottleneck. Genetic algorithms are easy to parallelize, while inductive searches are less so, also inductive searches are not cache friendly. Still, if you try them, you'll see that what I said about CPU time is a huge understating.
Turns out that the search space of all programs in a given language is big.
As I understand it, the motivation behind DeepMind's "differentiable neural computer" work is to build something Turing-complete but differentiable. The "neural" part isn't intrinsic to the goal, it's just the most promising way we currently have to get there.
A non-neural-net-based differentiable computer would be just as good, possibly better since the proposed neural architectures seem pretty inefficient.
Yes I think there are multiple serious AGI efforts using something like OpenAI Universe (there are others like that) and even though I haven't seen any earth-shattering announcements yet, I think between the AD stuff and hybrid data structure/program/NN systems and all of the investment in more powerful NN hardware and configurations there will be some surprisingly general demos running in open/varied environments in the next year or two.
Actually, I've been playing with this stuff as a hobby for about 15 years. There was a book by Stephen Levy i remember reading that kind of kicked it off for me.
I spent the last few hours reading this paper. From what I understand, the paper is not about AD, but something more. AD is "efficient means to calculating derivatives", as they put it. This paper seems to be about program analysis through calculus (AD comes into play, as derivatives are needed).
They show how to form equations with programs and calculate with them, kind of like with "normal" math. This seems to be the reason for constructing the algebra. They derive some transformation of programs similar to Fouriers I do not fully understand yet (they also use this algebra and calculus to generalize methods like Deep Dream).
They show how to use this algebra to initialize "tensor networks" from some other sub-optimal program and than train it to obtain better results than the original program. The transformation is called "Neural tensor series", it seems to be like an enhanced Taylor series for programs, but again, I do not fully understand it yet. What I do understand from it, is that through this transformation you can somehow relate a program to a tensor network, and use this to study and analyze programs and vice versa.
This algebra strikes me as interesting and important. They mention parallels to how physics is being studied. In physics, you often express a system as a solution to a system of equations, because you do not know the system in advance. In this frame, a program is like a system. Perhaps this theory allows a similar approach to programming.
There is another paper listed on github https://arxiv.org/abs/1612.02731 . It seems as though the code on github and this paper are additional material to the paper in the OP, not a full implementation of the theory, but a bridge for those familiar with AD to better understand the theory in the paper. I could be wrong though, it is getting late :)
Take the factorial function defined in one of the usual ways (recursive or iterative). How do you take the derivative of it, using this paper? Do you magically get the gamma function this way, too?
I just skimmed the paper without getting close to answering these questions -- anyone who does already understand, maybe explaining this concrete example could help the rest of us?
Thinking about the problem in my naive way earlier this year, I noticed that the usual if-then-else expression is a special case of linear interpolation, at least when you're dealing with numbers. ("lerp by t between a and b" generalizes "if t then b else a", taking false=0 and true=1.) But it's not so obvious how to handle recursion and nonnumeric data: I guess there's a hint in that the usual theory of programs has Scott continuity (https://en.wikipedia.org/wiki/Continuous_function) while differentiable functions are continuous in the reals, but I'm mathematically undereducated.
That's a helpful link, thanks. At first glance it looks like, in all their examples, the control flow is oblivious to the values in the vectors. I suppose I'll have to try it to see if/how it can differentiate factorial(x) with respect to x. ("You can't do that" would be a perfectly reasonable answer, but it's unclear to me if that's what it is.)
I could be [very] wrong but my read of the practical use of this is that it creates a really complicated-mathey AST that allows for adaptation of a program to different forms. The paper notes several times how a program could be adapted for maximum efficiency on different hardware than it was originally written for. I'm certain there a bunch more to it, but that was all I was able to grasp.
Actually, what's funny is that this is probably the most succinct explanation of an operator overloading approach to AD that I've heard. But, from what I can tell as well, yes. Most of the action is in var.cpp:
It looks like the AST is held by a collection of maps that contain pointers and other variables.
Outside of this paper, most of my time on writing AD codes has been literally just trying to do things on ASTs really, really fast, which is why your comment is really apt.
It's a bit beyond me too, but here's my take from skimming: It's already known that you can do differential calculus in some pretty abstract spaces (not just functions from R^n to R^m). Automatic differentiation is a way of automatically keeping differentiability and computing derivatives when sticking other differentiable stuff together. This puts those ideas together, so you can automatically differentiate some more interesting things stuck together in some interesting ways related to programming. That was what I got from my rusty semi-layperson's skim, at least.
Karpathy's explanation in CS231n is great: for any computational unit that is a function, you can calculate a derivative. So propagating the derivative backward through the function, just aplly the chain rule. This is much with simpler functions, so understand your algorithm down to the smallest computable units in the graph.
Does this only work for mathematical programs?
I don't understand how this would work for more general computations, involving loops, writes to files, etc.
This paper isn't really a great introduction into the concept of AD. That said, one way to visualize it is to see how a basic AD implementation works. There are generally two different varieties, operator overloading and source transformation. Let's focus on the first.
Say we have a function:
double myfun(std::vector <double> x) {
auto y = double(0.);
for(int i=0;i<=5;i++)
y += x[0]*x[1];
return y;
}
We may want to find the gradient of myfun, or simply the derivative of myfun with respect to x[0] and x[1]. What we're going to do is reply double with a special AD type, which requires us to code things like:
template <typename Real>
Real myfun(std::vector <Real> x) {
auto y = Real(0.);
for(int i=0;i<=5;i++)
y += x[0]*x[1];
return y;
}
Now, instead of using a double, we'll set Real to be our AD type. In C++, this'll be something like a class where all of the AD action occurs. What's in this class depends on the particular AD algorithm, which depends on the what derivatives we want and what kind of recomputations of the derivative we want to make. In general, though, it consists of some kind of tree. At its most basic level, the nodes of this tree consist of the types variable, constant, unary function, or binary function and they contain information that relates to the chain rule. When we want to compute a derivative, we traverse the tree in a very particular order to compute and accumulate the derivative information. How this is done can vary as well, which has implications for the efficiency of the algorithm. Most implementations use a tape that predefines the order of the tree traversal. It's also possible to do a topological sort to solve for a tree traversal. Anyway, traversing a tree can be expensive for a really, really large computational tree, so the goal is to only touch every necessary node once.
Returning to your original question, finding the derivative of a program involves building up this tree of computation. Control structures don't really matter in this context because we'll just keep adding to the tree regardless of loops or conditionals. This, by the way, can be really, really expensive since the computational tree can be huge for a long running loop. There's some game in determining how and when to collapse the tree in certain ways. Source code transformation tools can just leave the original, or really modified, control structures in place, which alleviates a lot of this expense, which is partially why they're faster than operator overloading.
Thanks for your explanation. So what this really does is enable one to train code using machine learning techniques such as gradient descent, traversing a code-space instead of traversing a space of numbers?
Well, from my perspective, there are three different components to machine learning techniques like multilayer perceptrons. Basically,
1. What's the model?
2. How do we define the error between the model and the data?
3. What algorithm do we use to fit the model to the data?
In the case of a multilayer perceptron with a single hidden layer, the model is
m(alpha,W,b,x) = alpha'*sigmoidal(Wx + b)
where
alpha : nhidden x 1
W : nhidden x ninput
b : nhidden
' denotes tranpose and sigmoidal is some kind of sigmoidal function like a logistic function or arc tangent. As far defining the error, we tend to use the sum of squared errors since it's differentiable and easy:
J(alpha,W,b) = sum_i (m(alpha,W,b,x[i]) - y[i])^2
Finally, we need to apply an optimization algorithm to the problem
min_{alpha,W,b} J(alpha,W,b)
Most of the time, people use a nonglobalized steepest descent. Honestly, this is terrible. A matrix-free inexact-Newton method globalized with either a line-search or trust-region method will work better.
Anyway, all good optimization algorithms require the derivative of J above. Certainly, we can grind through these derivatives by hand if we're masochistic. Well, for one layer it's not all that bad. However, as we start to nest things, it becomes a terrible pain to derive. Rather than do this, we can just use automatic differentiation to find the gradient and Hessian-vector products of J. In fact, most AD codes already do this. I can never find a paper that's run through the details, but back propagation is basically just a reverse mode AD algorithm applied to J in order to find the gradient.
What is the data that the machine learning algorithm would optimize with given an arbitrary program that is differentiated?
Is it just a faster program or something else?
For any changes you would need some kind of perfect 100% coverage regression test that proves that the optimized program is still correct and handles all cases because the differentiation only recorded one possible path through the program.
Alright, so machine learning is a pretty broad term that encompasses a lot of things. Very specifically above, I was talking about things I call curve fitters, which include multilayer perceptrons. There's also classifiers, which often people create by use a curve fitter and then thresholding it. There's other strange techniques as well that people call machine learning that don't fit into these categories as well.
Anyway, for curve fitters, we literally need a set of points {(x_i,y_i)}_{i=1}^m given to us and our goal is to map the n-dimensional vectors x_i to the scalar output y_i. If we want a multidimensional output, we just do this more than once. Now, theoretically, we can try and match any function, or program, to this data. Let phi be this model, so that we want phi(alpha,x)=y. Here, alpha are the parameters we want to tune for the matching. Also note, these parameters can't be arbitrary. We really need them to be some kind of floating point number. Now, how we do the matching can vary, but sum of squared errors is easy, so we solve the optimization problem
min_{alpha} sum_i (phi(alpha,x_i)-y_i)^2
Basically, we want to find a parameter alpha so that the output from program phi matches our data {(x_i,y_i)}_{i=1}^m as closely as possible in a least-squares sense. Since all good optimization algorithms need a derivative, we use automatic differentiation to find the gradient, and probably Hessian-vector product, of
J(alpha) = sum_i (phi(alpha,x_i)-y_i)^2
Now, if phi is a program, then so is J. It's a program that takes alpha as an input, runs a for-loop to sum up the squared errors, and then it returns the result.
Alright, that was long winded, but the key is this. We're not optimizing a program when we do this. Rather, the optimization occurs when we minimize the difference between our inputs and our outputs, which were given to us ahead of time. In this context, we have two programs. One, the program that we're trying to fit to our data. Two, the program that determines the amount of misfit between our inputs and our outputs. We need the derivative of the composition of these two programs, which is why we use automatic differentiation.
A set of all the possible states of the program’s memory, can be modeled by a finite dimensional real vector space V ≡ Rn.
Isn't the set of possible states of a computer's memory countable? How can one reasonably model this using a vector space over the reals? The paper claims any field will do. Well Z_p isn't a Banach space.
It seems to me if they want a notion of differentiability of a program that one would work in a field of characteristic 2 and use the module of Kahler differentials or use discrete differential forms. I certainly don't know enough about computer science but using a real vector space seems like it wouldn't be helpful.
yequalsx: yes, almost by definition, memory made up of bits should be modeled with a field of characteristic 2. That seems the most "natural" way to do it.
However, the authors of this paper are thinking about "computers" in a broader sense.
The memory space of these "neural" computers is formally modeled as R^n (n being the total number of parameters and memory addresses), and implemented approximately with 32-bit floating point numbers in practice (i.e., with a memory space of (2^32)^n bits). As you read the paper linked to by the OP, think about these neural computers instead of the one atop your desk.
EDITS: removed sloppy math and lightly edited some sentences.
Correction: on the second to last sentence above, the (2^32)^n figure is incorrect. I make this sort of mistake all the time... sorry! The point still stands, however :-)
This looks extremely interesting, but I find many of the terms difficult to map to a mental concept. Does anyone know good background material (books, other papers, blogs)?
This paper is rather hard to read and I've put it into my pile of papers to work through when it's not 1 AM, but basically this is related to machine learning and neural networks. More generally it's about how to transform arbitrary imperative programs with branching and control flow into a "differentiable" form. It means, you can train the program by using algorithms like gradient descent.
The awesome/scary thing about this line of research is that it's exploring the question of how far computer programming can be automated. Simple machine learning techniques can already be seen as "automatic computer programmers" in a sense, but their capacity is very limited. DeepMind recently made a breakthrough with their DNC architecture, described in that wiki page, which introduces a notion of memory to neural networks, and right now NN research is going at an incredible speed. The more stuff that becomes encoded in a differentiable form, the more stuff can be "trained" based on examples using standard machine learning techniques.
Well, I too put it on my "to read" stack but if it is referring to ordinary automatic differentiation, also known as "algorithmic differentiation" that beast is a bit more specific.
That is, ad allows the differentiation of functions defined as algorithms but not with arbitrary "branching and control flow" but essentially only when the "algorithm" involve a composition of primitive functions having known derivatives (example: addition, multiplication, exponentiation). That is still a big set - back propagation uses one form of AD for example.
However, for example, AD cannot be correctly applied to functions defined with loop or with arbitrary recursion.
I'm not sure if there's a relation to the DeepMind DNC but maybe if I read article I'll find out.
Isn't there a limit to completely automated programming that has to do more or less with the halting problem? Like if the program builds an infinite loop? I guess it will have to timeout, but that's going to slow the learning down. Deep neural networks don't have conditional loops in them so they don't have to deal with the halting problem.
I'm not sure exactly how it applies here, but in general you can limit the "language" so it only includes things that are guaranteed to halt. You lose a lot of expressive power, but hey, no halting problem. The obvious thing to do is guarantee loops halt by forcing them to have a monotonic countdown, or ban them entirely.
The main points seem to be:
- Realization of the duality between programs and maps over vector space (simplification: assumes the real field).
- Define differentiation for those programs.
- Expand over current automatic differentiation approaches by having a well defined algebra, as opposed to a pure algorithmic translation of the chain rule by forward/backward propagation.
- Since the differentiation operator is well defined, get higher-order derivatives "for free" .
- Also, get composition of differentiable programs "for free".
This formalization seem to have a lot of potential IMO by allowing generalization of techniques researchers have to currently "hand craft" in neural networks (control structures, recursion, composition).