CPython is build in C. Can you differentiate through that? I.e. then Python programs also become differentiable? Similar as JAX.
How much control do you have about the gradient? In some cases, it can be useful to explicitly define a custom gradient, or to stop the gradient, or to change the gradient, etc.
Can you define gradients on integral types (int, char)?
Regarding differentiating python via CPython, theoretically yes, though practically it is likely more wise to use something like Numba which takes Python to LLVM directly to avoid a bunch of abstraction overhead that would otherwise have to be differentiated through. Also fun fact JaX can be told to simply emit LLVM and we've used that as an input for tests :)
You can explicitly define custom gradients by attaching metadata to the function you want to have the custom gradient (and Enzyme will use that even if it could differentiate the original function).
Integral types: mayyybe, depending what exactly you mean. I can imagine using custom gradient definitions to try specifying how an integral type can be used in a differentiable way (say representing a fixed point). We don't support differentiating integral types by approximating them as continuous values if that's what you're asking. There's no reason why we couldn't add this (besides perhaps bit tricks being annoying to differentiate), but haven't come across a use case.
Yea, I had this very rough (maybe crazy) idea in mind:
Once you can differentiate through CPython, and let's say you can also differentiate integral types via some approximation, and you have some bug in some Python code, and a failing test case in Python, you can use the output (e.g. exception of the failing test) as an error signal and backpropagate to the Python program code. The Python program code is represented as a chunk of bytes. If there is some meaningful gradient, it could point you to possible source code locations where the bug might be.
Probably the gradient will be quite meaningless though, and that's why the idea does not really work in practice. But I think for some simple examples, it still might work.
For any possible branches in the code (and there are a lot), to get a good approximated gradient, you should visit some of the branches, maybe some MC sampling or so.
Thank you for sharing and releasing usable code! Do you know if this would work for GPU based applications? Tensorflow models that are trained on a GPU, for example?
For GPU's, there's a couple of different things that you might want to do.
You can use existing tools within LLVM to automatically generate GPU code out of existing code, and this works perfectly fine, even running Enzyme first to synthesize the derivative.
You can also consider taking an existing GPU kernel and then automatically differentiating it. We currently support a limited set of cases for this (certain CUDA instructions, shared memory etc), and are working on expanding as well as doing performance improvements. AD of existing general GPU kernels is interesting [and more challenging] since racey reads in your original code become racey writes in the gradient -- which must have extra care taken to make sure they don't conflict. To my knowledge GPU AD on general programs (e.g. not a specific code) really hasn't been done before, so it's a fun research problem to work on (and if someone knows of existing tools for this please email me at wmoses at mit dot edu).
Enzyme needs to be able to access the IR of any potentially active functions (calls that it deduced could impact the gradient) to be able to differentiate them.
If all of the code you care about is in one compilation unit, you're immediately good to go.
Multiple compilation units can be handled in a couple of ways, depending on how much energy you want to set it up (and we're working on making this easier).
The easiest way is to compile with Link-Time Optimization (LTO) and have Enzyme run during LTO, which ensures it has access to bitcode for all potentially differentiated functions.
The slightly more difficult approach is to have Enzyme ahead-of-time rather than lazily emit derivatives for any functions you may call in an active way (and incidentally this is where Enzyme's rather aggressive activity analysis is super useful). Leveraging Enzyme's support for custom derivatives in which an LLVM function declaration can have metadata that marks its derivative function, Enzyme can then be told to use the "custom" derivatives it generated while compiling other compilation units. This obviously requires more setup so I'm usually lazy and use LTO, but this can definitely be made easier as a workflow.
We go into more details in the Limitations section of the paper, but in short Enzyme requires the following properties:
* IR of active functions must be accessible when Enzyme is called (e.g. cannot differentiate dlopen'd functions)
* Enzyme must be able to deduce the types of operations being performed (see paper section on interprocedural type analysis for details why)
* Support for exceptions is limited (and running with -fno-exceptions, equivalent in a diff language, or LLVM's exception lowering pass removes these).
* Support for parallel code (CPU/GPU) is ongoing [and see the prior comment on GPU parallelism for details]
Also see the Julia package that makes it acessible with a high level interface and probably one of the easier ways to play with it: https://github.com/wsmoses/Enzyme.jl.
> The Enzyme project is a tool for performing reverse-mode automatic differentiation (AD) of statically-analyzable LLVM IR. This allows developers to use Enzyme to automatically create gradients of their source code without much additional work.
Can someone please explain applications of creating gradients of my source code?
It constructs an analytical gradient from the code. The reason is that you can compute the gradient directly. This can enable optimizations such as avoiding caching big matrices because you don't need to keep track of states/trace the graph, or you can compute the 2nd, 3rd, 4th... and so on derivatives because you have an analytical gradient.
For example in an affine function, the gradient of the bias/intercept is the gradient of the loss wrt the activation function and for the weights, it's the product of loss wrt activation function and the input to the layer.
With automatic graph construction e.g. eager Tensorflow/Pytorch, the layer needs to cache the input of the layer, so that it can compute the gradient of the weights. If the layer receives inputs multiple times within the computation graph, you end up caching it multiple times.
With analytical gradients, you may be able to save memory by finding optimizations because you have the analytical gradient, e.g. above you can sum the inputs ie (dL/dz)input1 + (dL/dz)input2 = (dL/dz)(input1+input2).
Isn't the input of the layer fundamentally a part of the gradient computation? So even in this case (inspecting LLVM code) the computation still needs to look at the input.
Even so, you maybe able to perform optimizations that were not possible under normal circumstances, e.g. you have an exponent in the output of a layer followed by a log in the next. Think SoftMax and logloss.
You don't always need the input to compute the gradient. For example the gradient of a sum function doesn't require the original input, it just sets all of the derivative(input)'s to 1.
I think in essence what PartiallyTyped is trying to say is that one potential optimization opportunity in whole-program AD is that you can avoid having to cache the original inputs of the program if you know that derivative computation won't need it (e.g. its only used in a sum and not a product or something whose derivative depends on the value). Some ML frameworks must cache all of the inputs to an operation since they don't know whether it will be necessary for the reverse pass of an operation. You could go even further and decide to cache a different & smaller set of intermediate values that still lets you compute the gradient.
Beyond cache reduction, in our paper we demonstrate a lot of interesting ways that combining AD with a compiler can create potential speed-up. For example, we are often able to dead-code eliminate part of the original forward-pass code since it's not needed to compute the gradient.
We also have a cool example in the paper showing an asymptotic [O(N^2) => O(N)] speedup on a code for normalizing a vector because doing AD in the compiler lets Enzyme run after optimization (and in that example benefit from loop invariant code motion in a way that tools that aren't in the compiler cannot do).
Yeah my best guess at that is that they were trying to say you'd only need to store one value: the sum, rather than the two individual values -- but I'm not completely sure.
The essence of what I was trying to say with the example is that a layer may be used multiple times through the computation graph. Without an analytical gradient, you may end up caching all of the inputs to the layer to compute the gradient. The alternative is to sum up the inputs because the gradient is linear with respect to the inputs; with an analytical gradient you can find that and compute it within the code instead of looking for adhoc optimisations within the graph.
AFAIK, It's mainly used for implementing gradient descent, which is used for training neural networks.
Frameworks like pytorch, tensorflow, probably used back propagation to calculate the gradient of a multidimensional function. But in involves tracing, and storing the network state during the forward pass.
Static automatic differentiation should be faster and should look a lot like differentiation is done mathematically rather than numerically.
Of course there are more applications to AD in scientific computing.
You're right about Swift for Tensorflow. It is an interesting development. I have installed swift on Linux just to play with it. Was dissapointed to see that Swift for Tensorflow was a fork of the Swift compiler, and haven't followed whether it's being refreshed, and whether there are any plans to merge it back.
Ideally an LLVM tool should allow languages that compile to LLVM(I'm mostly interested in Rust and Julia) to leverage it without dramatically changing their compilers.
It would be interesting to see something like JAX in Rust, exposing the AD functionality in the Standard Library, paired with a high performance SIMD/GPU array type. Things could get very interesting.
I don't see how static AD removes the need to store the network state. Is this a fundamental property of static AD?
Also, your statement sounds like pyTorch/TF are doing AD numerically, which is not the case. They build the analytical gradient from the traced computation graph.
Reverse mode AD can always get into situations where it needs to store original values (i.e. network state).
One advantage, however, of doing a more whole-program approach to AD rather than individual operators is that one might be able to avoid caching values unnecessarily. For example if an input isn't modified (and still exists) by the time the value is needed in the reverse pass, you don't need to cache it but can simply use the original input without a copy.
And yes PyTorch/TF tend to perform a (limited) form of AD as well, rather than do numerical differentiation (though I do think there may be an option for numerical?)
I wouldn't really position a tool like Enzyme as a competitor to PyTorch/TF (they may have some better domain-specific knowledge after all), but rather a really nice complement. Enzyme can take derivatives of arbitrary functions, in any LLVM-based language rather than the DSL of operators supported by PyTorch/TF. In fact, we built a plugin for PyTorch/TF that uses Enzyme to import custom foreign code as a differentiable layer!
> One advantage, however, of doing a more whole-program approach to AD rather than individual operators
I was under the impression that the big ML frameworks (and surely JAX with jit) are doing optimization on the complete compute graph, too.
I didn't want to make this discussion too TF/pyTorch focused (I'm not even a ML researcher). But your optimization claims sound like the other AD frameworks are not doing any optimization at all, which is not the case.
I was also thinking about derivatives of functions which are doing something iterative on the inside, like a matrix decomposition (combined with linear solve and/or matrix inversion). While a "high level" AD tracer can identify an efficient derivative of these operations, your LLVM introspection would only be able to compute the derivative through all the internal step of the matrix decomposition?
Oh for sure, any ML framework worth its salt should do some amount of graph rewriting / transformations.
I was (perhaps poorly) trying to explain how while yes AD (regardless of implementation in Enzyme, PyTorch, etc) _can_ avoid caching values using clever tricks, they can't always get away with it. The cache-reduction optimizations really depend on the abstraction level chosen for what tools can do. If a tool can only represent the binary choice of whether an input is needed or not, it could miss out on the fact that perhaps only the first element (and not the whole array/tensor) is needed.
Regarding Enzyme v JaX/etc, again I think that's the wrong way to think about these tools. They solve problems at different levels and in fact can be used together for mutual benefit.
For example a high-level AD tool in a particular DSL might know that algebraically you don't need to compute the derivative of something since from the domain knowledge it is always a constant. Without that domain knowledge, a tool will have to actually compute it. On the other side of the coin, there's no way such a high level AD tool would do all the drudgery of invariant code motion, or even lower level scheduling/register allocation (and see Enzyme paper for reasons why these can be really useful optimizations for AD).
In an ideal world you want to combine all this together and have AD done in part whenever there's some amount of meaningful optimization (and ideally remove abstraction barriers like say a black box call to cudnn). We demonstrate this high and low level AD in a minimal test case against Zygote [high level Julia AD], replacing a scalar code which is something Zygote is particularly bad at. This thus enables both the high level algebraic transformations of Zygote and the low level scalar performance of Enzyme, which is what you'd really want to do.
It looks like the discussion of this has dropped off for now, but I'm sure shoyer would be able to do a much better job of listing interesting high-level tricks JaX does [and perhaps low level ones it misses] as a consequence of its choice of where to live on the abstraction spectrum.
Also thanks for reminding me about matrix decomposition, I actually think there's a decent chance of doing that somewhat nicely at a low level from various loop analyses, but I got distracted by a large fortran code for nuclear particles.
You might be right, I seem to have confused the pytorch(and tf eager mode) differentiation approaches with numerical methods.
I'm making a Pytorch inspired ML framework, and indeed, each op node, defines also a backward pass, which is a manual definition of a derivative. And going backwards over the ops graph, and combine derivatives for each op via chain rule to get the final gradient, looks indeed like a runtime analytical method rather than a numerical one.
The advantage of an automatic AD is not having to define the backward pass for each op, and the function that calculates the derivative being generated at compile time.
I've left the project marinate a bit, so the little knowledge I had is fading away.
Say you have some existing virus simulation codebase that you want to use ML on to derive an effective policy on. Without an AD tool like Enzyme, you'd have to spend significant time and effort understanding and rewriting that obnoxious 100K lines of fortran into TensorFlow, when you could've been spending it solving your problem. The reason you need to do this rewriting is because many ML algorithms require the derivatives of functions to be able to use them and Enzyme provides an easy way to generate derivatives of existing code.
This is also useful in the scientific world where derivatives of functions are commonplace.
You could also use it in more performance-engineering/computer systems ways as well by using the derivatives to perform uncertainty quantification and perhaps decide to use 32-bit floats rather than 64-bit doubles.
I'm a big believer in auto-diff, but I'm skeptical that any autodiff tool would differentiate a 100k line simulation code correctly and efficiently without manual intervention. I'd certainly love to be proven wrong, though and absolutely AD can be a big time saver :)
One could, but automatic differentiation is much more efficient than numerical differentiation, thus for high performance applications it is preferable to use automatic differentiation.
Adding onto this, numerical derivatives have two potential problems which is why they tend not to be used in big scientific/ML frameworks.
First of all they suffer from accuracy decay. For example if you were to do the standard f'(x) \approx [f(x+h)-f(x)]/h, you'd subtract two similar numbers and waste many bits of precision. In contrast if you were to generate the derivative function directly like below, you'd end up far more accurate.
Secondly, from a performance perspective numerical differentiation is really slow -- especially for gradient computation. For example you would need to evaluate the function at once per argument in numeric differentiation to get the whole gradient. In contrast, reverse mode AD lets you generate the entire gradient in one call.
In addition to these generic issues, we illustrate in our paper how doing this at a compiler level allows for significant additional optimization (by removing unnecessary code from the forward pass, finding common expressions, etc).
These issues also are amplified as you make higher-order derivatives and so on.
Funny, I worked on Tapenade (one of the compared automatic differentiation software). I'm happy that it still reaches 60% of the performance of something written directly inside an optimizing compiler.
The first time my co-advisor said "you're going to take the derivative of that code by the end of the day" I felt like I'd taken a wrong turn into crazy-town. I enjoyed reading about & understanding Tapenade — such a great piece of code!
Enzyme is named such as it's a tool that "synthesizes derivatives" and also as a pun referencing Zygote (another AD tool) since Enzyme operates at a lower level (LLVM rather than Julia).
Tapenade is just a traditional recipe from south-east of France, where the research team developing Tapenane is based (they are at INRIA Sophia-Antipolis).
do you have any sense for how this would integrate with rust? as someone who isn't familiar with how it works, it's not clear whether that would be as easy as normal C ffi interop or more involved.
Oh man that was a fun hack to write. Basically we demonstrated an easy-to-setup AD on rust by leveraging link-time optimization (LTO) as a way to make sure Enzyme's generate derivatives "optimization pass" was run.
We're currently working with the Rust ML infrastructure group to make a nice integration of Enzyme into Rust (e.g. nice type-checking, safety, etc). If you're interested in helping, you should join the Rust ML meetings and/or Enzyme weekly meetings and check out https://github.com/rust-ml/Meetings and https://github.com/tiberiusferreira/oxide-enzyme/tree/c-api . There's a bunch of interesting optimizations and nicer UX for interoperability we want to add so more manpower is really helpful!
The most interesting thing from the Rust standpoint is that ideally we'd want Enzyme to be loaded into the Rust compiler as a plugin (much like it is for Julia, Clang for C/C++, etc) -- but Rust doesn't support the option for that yet. This means we can either help push for plugins/custom codegen in Rust, make script-based compilation tools within rustc [I don't remember the specific name but someone who is more of a Rust expert I'm sure can chime in], or do the sketchy LTO approach above [not always desirable as it requires running LTO].
Alternatively Enzyme can just become part of LLVM mainline so everyone can use it without a plugin :P We're not quite there yet but we're in the process of becoming a formal LLVM incubator project!
Enzyme does indeed handle mutable arrays (both in Enzyme.jl and any other frontend)! If you want to try it out forewarned that we're currently upgrading Enzyme.jl for better JIT integration (dynamic re-entry, custom derivative passthrough, nicer garbage collection) so there may be some falling debris.
Some more relevant links for the curious
Github: https://github.com/wsmoses/Enzyme
Paper: https://proceedings.neurips.cc/paper/2020/file/9332c513ef44b...
Basically the long story short is that Enzyme has a couple of interesting contributions:
1) Low-level Automatic Differentiation (AD) IS possible and can be high performance
2) By working at LLVM we get cross-language and cross-platform AD
3) Working at the LLVM level actually can give more speedups (since it's able to be performed after optimization)
4) We made a plugin for PyTorch/TF that uses Enzyme to import foreign code into those frameworks with ease!